NAMES Lextools symbol files, regular expression syntax, general file formats, grammar formats - lextools file formats DESCRIPTION Symbol Files The lextools symbol file specifies the label set of an application. Labels come in three basic flavors: Basic labels Superclass labels Category labels Basic labels are automatically assigned a unique integer representation (excluding 0, which is reserved for "<epsilon>"), and this information is compiled by lexmakelab into the basic label file, which is in the format specified in fsm(5). Superclass labels are collections of basic labels: a super- class inherits all of the integral values of all of the basic labels (or superclass labels) that it contains. Category labels are described below. The lines in a symbol file look as follows: superclass1 basic1 basic2 basic3 You may repeat the same superclass label on multiple (possi- bly non-adjacent) lines: whatever is specified in the new line just gets added to that superclass. The "basic" labels can also be superclass labels: in that case, the superclass in the first column recursively inherits all of the labels from these superclass labels. The one exception is a category expression which is speci- fied as follows: Category: catname feat1 feat2 feat3 The literal "Category:" must be the first entry: it is case insensitive. "catname" should be a new name. "feat1" labels are their values. The following sample symbol file serves to illustrate: dletters a b c d e f g h i j k l m n o p dletters q r s t u v w x y z uletters A B C D E F G H I J K L M N O P uletters Q R S T U V W X Y Z letters dletters uletters gender masc fem case nom acc gen dat number sg pl person 1st 2nd 3rd Category: noun gender case number Category: verb number person For this symbol set, the superclass "dletters" will contain the labels for all the lower case labels, "uletters" the upper case letters, and "letters" both. Defined categories are "noun" and "verb". "noun", for instance, has the fea- tures "gender", "case" and "number". The feature "gender" can have the values "masc" and "fem". (NB: this way of representing features is inherited in con- cept from Lextools 2.0.) Some caveats: You should not use the reserved terms "<epsilon>" or "<sigma>" in a symbol file. If you pass the -L flag to lex- makelab you should also not use any of the special symbols that it introduces with that flag (see lextools(1)). Symbol files cannot contain comments. Backslash continuation syntax does not work. Regular Expressions Regular expressions consist of strings, possibly with speci- fied costs, conjoined with one or more operators. strings are constructed out of basic and superclass labels. Labels themselves may be constructed out of one or more characters. Characters are defined as follows: If 2-byte characters are specified (Chinese, Japanese, Korean...), a character can be a pair of bytes if the first byte of the pair has the high bit set. In all other conditions a character is a single byte. Multicharacter tokens MUST BE delimited in strings by a left bracket (default: "[") and right bracket (default: "]"). This includes special tokens "<epsilon>", "<sigma>", "<bos>" and "<eos>". This may seem inconvenient, but the regular expression compiler must have some way to figure out what a token is. Whitespace is inconvenient since if you have a long string made up of single-character tokens, you don't want to be putting spaces between every character: trust me. You may also use brackets to delimit single-character tokens if you wish. Some well-formed strings are given below: abc[foo] [<epsilon>]ab[<sigma>] Note that the latter uses the superclass "<sigma>" (con- structed by lexmakelab to include all symbols of the alpha- bet except "<epsilon>"): in order to compile this expres- sion, the superclass file must have been specified using the -S flag. If features are specified in the label set, then one can specify the features in strings in a linguistically appeal- ing way as follows: food[noun gender=fem number=sg case=nom] Order of the feature specifications does not matter: the order is determined by the order of the symbols in the sym- bol file. Thus the following is equivalent to the above: food[noun case=nom number=sg gender=fem] The internal representation of such feature specifications looks as follows: "food[_noun][nom][sg][fem]". Unspecified features will have all legal values filled in. Thus food[noun case=nom number=sg] will produce a lattice with both [fem] and [masc] as alter- natives. Inappropriate feature values will cause a warning during the compilation process. Since features use super- classes, again, in order to compile such expressions, the superclass file must have been specified using the -S flag. Costs can be specified anywhere in strings. They are speci- fied by a positive or negative floating point number within angle brackets. The current version of lextools assumes the tropical semiring, so costs are accumulated across strings by summing. Thus the following two strings have the same cost: abc[foo]<3.0> a<-1.0>b<2.0>c<1.0>[foo]<0.5><0.5> Note that a cost on its own -- i.e. without an accompanying string -- specifies a machine with a single state, no arcs, and an exit cost equal to the cost specified. Regular expressions can be constructed as follows. First of all a string is a regular expression. Next, a regular expression can be constructed out of one or two other regu- lar expressions using the following operators: regexp1* Kleene star regexp1+ Kleene plus regexp1^n power regexp1? optional !regexp1 negation regexp1 | regexp2 union regexp1 & regexp2 intersection regexp1 : regexp2 cross product regexp1 @ regexp2 composition regexp1 - regexp2 difference In general the same restrictions on these operations apply as specified in fsm(1). For example, the second argument to "-" (difference) must be an unweighted acceptor. Note also that the two argument to ":" (cross product) must be accep- tors. The argument n to "^" must be a positive integer. The arguments to "@" (composition) are assumed to be transduc- ers. The algorithm for parsing regular expressions finds the longest stretch that is a string, and takes that to be the (first) argument of the unary or binary operator immediately to the left, or the second argument of the binary operator immediately to the right. Thus "abcd | efgh" represents the union of "abcd" and "efgh" (which is reminiscent of Unix regular expression syntax) and "abcd*" represents the tran- sitive closure of "abcd" (i.e., not "abc" followed by the transitive closure of d, which is what one would expect on the basis of Unix regular expression syntax). The precedence of the operators is as follows (from lowest to highest), the last parenthesized group being of equal precedence: | & - : (* + ? ^) But this is hard to remember, and in the case of multiple operators, it may be complex to figure out which elements get grouped first. The use of parentheses is highly recom- mended: use parentheses to disambiguate "!(abc | def)" from "(!abc) | def". Spaces are never significant in regular expressions. Escapes, Comment Syntax and Miscellaneous other Comments can appear in input files, with the exception of symbol files. Comments are preceded by "#" and continue to the end of the line. You can split lines or regular expressions within lines onto multiple lines if you include "\ " at the end of the line, right before the newline character. Special characters, including the comment character, can be escaped with `\'. To get a `\', escape it with `\': `\\'. Characters that should be escaped are: `\', `*', `^', `?', `!', `|', `&', `:', `@', `-', `[', `]', `<' and `>'. Lexicons The input to lexcomplex is simply a list of regular expres- sions. The default interpretation is that these expressions are to be unioned together, but other interpretations are possible: see lextools(1) for details. If any of the regular expressions denotes a relation (i.e., a transducer) the resulting union also denotes a relation, otherwise it denotes a language (i.e., an acceptor). Arclists An arclist (borrowing a term from Tzoukermann and Liberman's 1990 work on Spanish morphology) is a simple way to specify a finite-state morphological grammar. Lines can be of one of the following three formats: instate outstate regexp finalstate finalstate cost Note that cost here should be a floating point number not enclosed in angle brackets. State names need not be enclosed in square brackets: they are not regular expressions. The following example, for instance, specifies a toy grammar for English morphology that handles the words, "grammati- cal", "able", "grammaticality", "ability" (mapping it to "able + ity"), and their derivatives in "un-": START ROOT un [++] | [<epsilon>] ROOT FINAL grammatical | able ROOT SUFFIX grammatical ROOT SUFFIX abil : able SUFFIX FINAL [++] ity FINAL 1.0 Paradigms The paradigm file specifies a set of morphological para- digms, specified as follows. Each morphological paradigm is introduced by the word "Para- digm" (case insensitive), followed by a bracketed name for the paradigm Paradigm [m1a] Following this are specifications of the one of the follow- ing forms: Suffix suffix features Prefix prefix features Circumfix circumfix features The literals "Suffix", "Prefix" and "Circumfix" are matched case-insensitively. The remaining two fields are regular expressions describing the phonological (or orthographic) material in the affix, and the features. The "Circumfix" specification has a special form, namely "regexp...regexp". The three adjacent dots, which must be present, indicate the location of the stem inside the circumfix. In all cases, features are placed at the end of the morphologically com- plex form. There is no provided mechanism for infixes, though that would not be difficult to add. One may specify in a third field in the "Paradigm" line another previously defined paradigm from which the current paradigm inherits forms: Paradigm [mo1a] [m1a] In such a case, a new paradigm will be set up, and all the forms will be inherited from the prior paradigm except those forms whose features match entries specified for the new paradigm: in other words, you can override, say, the form for "[noun num=sg case=dat]" by specifying a new form with those features. (See the example below.) One may also add additional entries (with new features) in inherited para- digms. A sample set of paradigms (for Russian) is given below: Paradigm [m1a] Suffix [++] [noun num=sg case=nom] Suffix [++]a [noun num=sg case=gen] Suffix [++]e [noun num=sg case=prep] Suffix [++]u [noun num=sg case=dat] Suffix [++]om [noun num=sg case=instr] Suffix [++]y [noun num=pl case=nom] Suffix [++]ov [noun num=pl case=gen] Suffix [++]ax [noun num=pl case=prep] Suffix [++]am [noun num=pl case=dat] Suffix [++]ami [noun num=pl case=instr] Paradigm [mo1a] [m1a] Paradigm [m1e] [m1a] Suffix [++]"ov [noun num=pl case=gen] Suffix [++]"ax [noun num=pl case=prep] Suffix [++]"am [noun num=pl case=dat] Suffix [++]"ami [noun num=pl case=instr] Note that "[mo1a]" inherits all of "[m1a]", whereas "[m1e]" inherits all except the genitive, prepositional, dative and instrumental plurals. See lextools(1) for some advice on how to link the paradigm labels to individual lexical entries in the lexicon file argument to lexparadigm. Context-Free Rewrite Rules The input to lexcfcompile is a set of expressions of the following form: NONTERMINAL -> regexp The "->" must be literally present. "NONTERMINAL" can actu- ally be a regular expression over nonterminal symbols, though the only useful regular expressions in this case are unions of single symbols. The "regexp" can in principle be any regular expression specifying a language (i.e., not a relation) containing a mixture of terminals and non- terminals. However, while lexcfcompile imposes no restric- tions on what you put in the rule, the algorithm implemented in GRMCfCompile, which lexcfcompile uses, can only handle certain kinds of context-free grammars. The user is strongly advised to read and understand the description in grm(1) to understand the restrictions on the kinds of context-free grammars that can be handled. By default the start symbol is assumed to be the first non- terminal mentioned in the grammar; see lextools(1) for fur- ther details. The following grammar implements the toy English morphology example we saw above under Arclists, this time putting brackets around the constituents (and this time without the mapping from "abil" to "able"): [NOUN] -> \[ ( \[ [ADJ] \] | \[ [NEGADJ] \] ) ity \] [NOUN] -> \[ [ADJ] \] | \[ [NEGADJ] \] [NEGADJ] -> un \[ [ADJ] \] [ADJ] -> grammatical | able Context-Dependent Rewrite Rules A context-dependent rewrite rule file consists of specifica- tions of one of the following two forms: phi -> psi / lambda __ rho phi => psi / lambda __ rho In each case "phi", "psi", "lambda" and "rho" are regular expressions specifying languages (acceptors). All but "psi" must be unweighted (a requirement of the underlying GRMCd- Compile; see grm(1), grm(3)). The connectors "->", "=>", and "/" must literally occur as such. The underbar separating "lambda" and "rho" can be any number of consecutive under- bars. The interpretation of all such rules is that "phi" is changed to "psi" in the context "lambda" on the left and "rho" on the right. The difference between the two productions, "->" and "=>" is the following. "->" denotes a mapping where any element of "phi" can be mapped to any element of "psi". With, "=>" the inputs and outputs are matched according to their order in the symbol file: this is most useful with single (super- class) symbol to single (superclass) symbol replacements. For example, suppose you have the following entries in your symbol file: V a e i o u +voiced b d g -voiced p t k The rule: [+voiced] -> [-voiced] / V __ V will replace any symbol in {p,t,k} with any symbol in {b,d,g} between two vowels. Probably what you want in this case is the following: [+voiced] => [-voiced] / V __ V This will replace "p" with "b", "t" with "d" and "k" with "g". The matching is done according to the order of the sym- bols in the symbol file. If you had specified instead: +voiced b g d -voiced p t k then "t" would be replaced with "g" and "k" with "d". Simi- larly with +voiced b d g -voiced p t k x "p", "t" and "k" would be replaced as in the first case, but "x" would be ignored since there is nothing to match it to: nothing will happen to "x" intervocalically. Use of the matching rule specification "=>" thus requires some care in labelset management. Beginning of string and end of string can be specified as "[<bos>]" and "[<eos>]", respectively: these are added to the label set by default if you use the -L flag to lexmake- lab. A line may also consist of one of the following specifica- tions (which are case insensitive): left-to-right right-to-left simultaneous optional obligatory The first three set the direction of the rule application; the last two set whether the rule application is obligatory or optional; see grm(1). All specifications are in effect until the next specification or until the end of the file. The default setting is obligatory, left-to-right. In practice the user will rarely need to fiddle with these default settings. Replacements A replacement specification (for lexreplace) is a file con- sisting of lines like the following: foo.fst a|b|c|d The first column specifies a single fsm that must exist in the named file. The remainder of the line specifies a union of labels to be replaced in the topology fsm argument to lexreplace with said fsm. Specified substitutions for a given label will override and previous substitutions. In the following case: foo.fst a|b|c|d bar.fst a you will foo.fst for "b", "c" and "d", and bar.fst for "a". See also grmreplace in grm(1). Currency Expressions Currency specification files contain lines of the following form: sym major-expr point minor-expr large-number Each entry is a regular expression. See lextools(1) for a description of the function of the entries. Note that the whitespace separator MUST BE A TAB: the regu- lar expressions themselves may contain non-tab whitespace. There must therefore be four tabs. You must still have tabs separating entries even if you split an entry across multi- ple lines (with `\'). SEE ALSO lextools(1) lextools user functions. fsmintro(1) Introduction to the FSM programs and library. grmintro(1) Introduction to the GRM programs and library. FILES /usr/local/bin/lextools-3 Distribution Lextools binaries. /usr/local/lib/lextools-3/grammars Sample Grammars AUTHOR Richard Sproat (rws@research.att.com)