NAMES lexarclist, lexcomplex, lexcurrency, lexfsmstrings, lexnum- builder, lexreplace, lexcfcompile, lexcompre, lexdownsize, lexmakelab, lexparadigm, lexrulecomp - tools for finite- state linguistic analysis SYNOPSIS lexarclist -l basic labels [ -S superclass labels ] [ -F output file ] [ -2dhv? ] [arclist file] lexcfcompile -l basic labels [ -S superclass labels ] [ -F output file ] [ -2dhsv? ] [grammar specification file] lexcomplex -l basic labels [ -S superclass labels ] [ -F output file ] [ -2Mdhmsv? ] [lexicon file] lexcompre -l basic labels -s regular expression [ -S superclass labels ] [ -F output file ] [ -2Mdhmx? ] lexcurrency -l basic labels -n number fsm [ -S superclass labels ] [ -F output file ] [ -2LMdhmv? ] [currency specification file] lexdownsize -l basic labels -S superclass labels -f fsm -n new labelset name [ -2dhio? ] lexfsmstrings -l basic labels [ -S superclass labels ] [ -2cdh? ] [fsm] lexmakelab [ -Lhv? ] symbol file lexnumbuilder -l basic labels [ -S superclass labels ] [ -c comma fsm ] [ -L linguistic filter fsm ] [ -D decimal model fsm ] [ -i intermediate fsm dump file ] [ -2dghiv? ] [compiled grammar specification] lexparadigm -l basic labels -p paradigm file [ -S superclass labels ] [ -F output file ] [ -2dhv? ] [lexicon file] lexreplace -l basic labels -t topology fsm [ -S superclass labels ] [ -F output file ] [ -2dhv? ] [replacement specifications] lexrulecomp -l basic labels [ -S superclass labels ] [ -F output file ] [ -2dhv? ] [rule file] DESCRIPTION Lextools is a package of tools for creating weighted finite-state transducers from high-level linguistic descrip- tions. These descriptions include: - Regular expressions or strings - Lists of regular expressions or strings - Context dependent rewrite rules - Context free rewrite rules - And various more specialized tools for building partic- ular kinds of grammars --- for instance grammars that specify between the mapping between digit strings and the number-names for those strings. Lextools is built on the FSM (fsmintro(1)) and GRM (grmintro(1)) libraries. In particular, see the FSM Library for the definitions of the components of an FSM: state, start state, arc (transition), input symbol, output symbol, arc cost, final cost and final state, and the GRM library for descriptions of restrictions on context free and context sensitive grammars. With the exception of lexmakelab (which builds label sets), and lexfsmstrings (which prints acyclic automata in terms of a given label set), all of the commands take as input a label set, a set of one or more grammar specifications (in some cases specified as compiled FSMs), and outputs an FSM that implements the grammatical specifications. Where the input file (last argument specified) is listed as optional in the above synopses, then the information can either be read from a file, or from the standard input. The majority of the tools share a common set of command-line flags, which for convenience are listed here with their interpretation. These flags are usually optional, with the exception of -l: -l basic label file -S superclass label file -F write result to this output file rather than standard out -2 characters are 2-byte characters (useful for Asian writing systems) -d turn on various lextools debugging flags -v turn on debugging flags for this tool -[h?] print the command-line flag information Label Set Construction lexmakelab compiles symbol file into a basic label file, and a superclass label file. The symbol file should be named name.sym, and the basic label file and superclass label file will be named, respectively, name.lab and name.scl. The label files produced are in the format specified in fsm(5), where the first column is the label, and the second column gives the integral values (internal arc label values for the FSMs) corresponding to those labels. The file format of the symbol file is specified in lextools(5), but basically the first entry in the line defines the superclass to which the basic labels, or predefined superclasses, in the remaining entries in the line belong. Superclass names may appear in the first column of multiple lines: whenever lexmakelab sees a superclass label in the first colmun, it merely adds the labels specified in the rest of the line to that superclass. Thus the integer values labeled with the superclass label foo in the superclass label file are exactly those integer values specified for the basic labels belonging to foo. So, assuming that a is assigned to 1, b to 2 and c to 3, then the superclass speci- fication: foo a b c will result in the entries foo 1 foo 2 foo 3 being placed in the superclass label file. If another super- class is defined in terms of foo, say bar, then bar's entries in the superclass label file will include all of foo's entries. Superclasses are useful for defining classes of symbols. For example, one could define all of the vowel symbols as belonging to the superclass vowel. The special designation Category: in the first column of a line defines the entry in the second column as having the features specified by the remaining entries in the line, which should be predefined superclasses. Thus, the set of lines gender masc fem case nom acc gen dat number sg pl person 1st 2nd 3rd Category: noun gender case number Category: verb number person specifies noun and verb as categories which have, respec- tively, the features gender, case and number; and number and person. Then one can specify these features, with their appropriate values (in any order), in regular expressions: Hund[noun case=nom number=sg gender=masc]. By default the label <epsilon>, with numeric value 0, is added to the basic label file. Similarly <sigma> is put in the superclass file for each non-epsilon member of the label set. The -L flag instructs lexmakelab to add various addi- tional labels that are needed by various other tools in the lextools package, or by the gentex multilingual TTS text- analysis module. These labels are: -- Linguistic boundary symbols: $$, ++, ww, aa, ii, ,,, .., !!, ?? -- XML start and end tags: <xml>, </xml> -- Accentuation marks: acc:+, acc:-, acc:c -- Powers of 10 and 20 (needed by lexnumbuilder): 10^1 - 10^20, 20^1 - 20^2 -- Internal representation of multipliers of powers of 10 or 20 (needed by lexnumbuilder): 0*, 1*, 2*, 3*, 4*, 5*, 6*, 7*, 8*, 9* -- Beginning and end of string symbols (useful for lexrulecomp): <bos>, <eos> Regular Expression Compilation lexcompre compiles regular expression into an FSM. If the -x flag is specified, regular expression is inter- preted as a string, rather than as a regular expression. Long symbols (enclosed in [...]) and costs (enclosed in <...>) are still interpreted, but regular expression opera- tors are treated as ordinary symbols. Printing Acyclic Machines As Strings lexfsmstrings prints acyclic automata and transducers as, respectively strings or pairs of strings, one string or pair of strings per line, using the symbols in basic label file. Long symbols are enclosed in [...] and costs in <...>. Printing of string costs can be suppressed with the -c flag. Producing Subsets of Label Sets lexdownsize takes a basic label file and a superclass file (required), and produces a new pair of label/superclass file with the basename name (i.e., name..lab and name.scl) con- taining only those labels found in the output (default, or -o flag) or input (-i) arc labels of fsm. This is a particularly useful thing to do if you have a large label set, but you know that, say, the output of a given transducer T only uses a small subset of those labels: call this Labels(T). Necessarily, any transducers that T is composed with need only know what to do with Labels(T). So a set of ordered rules, for instance, that composes with T, only needs to know what to do with strings constructed out of Labels(T). If Labels(T) is a lot smaller than the full labelset, compilation of the ruleset can be much more effi- cient. Lexicon Compilation lexcomplex compiles the list of regular expressions in lexi- con file. Normal regular expression syntax, line continua- tion with backslash, and comment syntax as defined in lextools(5) apply. By default, it produces the union of the individual regular expressions. If either -m or -M are set, the file is interpreted as spec- ifying a map, which defines some output for any input string over the alphabet of the label set: note that the individual regular expressions should in this case specify transducers. With the -m flag the relation (Id(<sigma> - Projection_1(R)) U R)* is computed, which is efficient to compute and probably what you want if you are interested in building a map involving transductions of single-symbols to strings. With the -M flag the transductions specified by each expres- sion in lexicon file are treated as corresponding parts of a batch left-to-right obligatory context-sensitive rewrite rule. Thus lines such as: input1 : output1 input2 : output2 input3 : output3 are treated as if one had written a batch rule |input1| |output1| |input2| -> |output2| |input3| |output3| and compiled it with lexrulecomp (which, lamentably, does not currently support generally batch rules). The behavioral difference between -m and -M is best illus- trated by example. Suppose one has the following alphabet: a b c d and the following lexicon file: a : b cd : ac Compiled with the -M flag output.fst will guarantee that any a in a string will be transduced to b, and any sequence cd will be transduced to ac. Thus abcd yields as output only bbac: lexcompre -l foo.lab -s abcd | fsmcompose - output.fst | lexfsmstrings -l foo.lab abcd bbac Compiled with -m, output.fst will guarantee that any a in a string will be transduced to b (since the input a is sub- tracted from <sigma> in computing the complement of the left projection of what's specified by in lexicon file), but the sequence cd can be transduced either to itself (via (Id(<sigma> - Projection_1(R)))* or else to cd (via R). lexcompre -l foo.lab -s abcd | fsmcompose - output.fst | lexfsmstrings -l foo.lab abcd bbac abcd bbcd Note that since <sigma> is used in the compilation of maps, if one uses the -[mM] flags one must also specify the super- class label file with -S. Finite-State Grammars lexarclist compiles an arclist specification of a finite- state grammar from arclist file into an FSM that implements that grammar. Morphological Paradigms lexparadigm takes the paradigm specification in paradigm file and the stems and morphological class information in lexicon file and produces a replace class FSM (>grm(1)) that recognizes all and only the inflected forms of the stems in lexicon file. The assignment of lexical entries to paradigms is accom- plished by specifying the appropriate paradigm as part of lexical entry. This can be done in any fashion by including the paradigm label as part of the entry, but it is recom- mended that you pick some appropriate feature name such as "pdm" and some attribute with an appropriate name like "class". Thus if there is a paradigm definition for class [m1a] in the paradigm file, an entry in the lexicon file might look as follows: abazh'ur[pdm class=m1a]. Context-Free Grammars lexcfcompile compiles the context-free grammar specified in grammar specification file into a finite-state acceptor that recognizes the language defined by the grammar. lexcfcom- pile first compiles the grammar into a transducer represen- tation compatible with that produced by grmread (grm(1)>), and then passes that transducer to GRMCfCompile (grm(3)). Thus lexcfcompile produces exactly what one gets by first calling grmread and piping the result to grmcfcompile (except that the grammar file format for lexcfcompile is a little more flexible than the file format for grmcfcompile). Note that only context-free grammars that are formally equivalent to finite-state grammars can be compiled: GRMCf- Compile does not currently support finite-state approxima- tion of context free grammars. See grm(1) for discussion of restrictions on what kinds of CF grammars can be compiled. By default the start symbol is assumed to be the first non- terminal mentioned in the grammar. With the -s flag, no start symbol is set. This can be useful if you want to set the start state at some later time: see the section on Dynamic Modifications in grm(1). Context-Sensitive Rulesets lexrulecomp compiles a set of context-dependent rewrite rules specified in rule file into a series of transducers which are then composed together. The algorithm is the one described in grmcdcompile in grm(1); lexrulecomp calls GRM- CdCompile (grm(3)). See grm(1) for the description on formal restrictions on context-dependent rules. Number Expression Compilation lexnumbuilder allows one to construct transducers that map between digit strings and their number names. The input to lexnumbuilder is an FSM that enumerates the well-formed number names of the language, where each word has been annotated with its semantic interpretation in terms of (sums of) products of powers of ten, in terms of the pow- ers and multipliers listed above. Such an FSM can be conve- niently produced with a context-free grammar and lexcfcom- pile. For example, rules like [HUNDRED] -> [UNITW] [ww] [HUNDREDW] ([ww] and [ww] [TEN])? [HUNDREDW] -> hundred[10^2] [TEN] -> [TENW] ([ww] [UNITW])? [TENW] -> twen[2*]ty[10^1] [UNITW] -> one[0*] as part of a context-free grammar will produce as output an analysis like the following for "one hundred and twenty one": one[1*][ww]hundred[10^2][ww]and[ww]twen[2*]ty[10^1][ww]one[1*] An FSM that produces representations like that is appropri- ate as input to lexnumbuilder. The first operation that is performed on the input is to compute a bimorphism, producing a transducer with the multi- pliers and powers on the output and everything else on the input. Thus the example above would become: one[ww]hundred[ww]and[ww]twenty[ww]one [1*][10^2][2*][10^1][1*] At this stage, you can dump this transducer to an intermedi- ate fsm dump file using the -i flag, if you wish. This is useful for debugging. The remainder of the processing per- formed by lexnumbuilder involves composing this transducer with a transducer that maps from factorizations like "[1*][10^2][2*][10^1][1*]" into digit strings like "121". Since the rules that lexnumbuilder uses (see below) do assume that these expressions occur in a certain order, if there is a bug in the grammar you will probably lose some of the expressions that you had thought were allowed. Computing the difference (using fsmdifference) between the input of the dumped transducer and the output of the final transducer produced by lexnumbuilder allows one to see what is missing. If the -g flag is specified, the next phase involves comput- ing "Germanic inversion" on the output of the bimorphized transducer. Number names in German and other Germanic lan- guages in particular have the well-known property that units occur before decades, so that "twenty one" becomes "one and twenty". This would generally result in the "reversed" fac- torization [1*][2*][10^1] (i.e. 1 + 2*10^1). Germanic inver- sion will flip this into the normal order [2*][10^1][1*]. This is performed using the transducer specified in /usr/local/lib/lextools-3/Fragments/NBGmcFlip.asc, which is compiled at runtime using the specified label set. The next phase uses the transducer specified in /usr/local/lib/lextools-3/Fragments/NBPowerInterp.asc (com- piled at runtime using the label set) to "normalize" to a fully explicit factorization into sums of products of powers of ten. This becomes necessary when dealing larger numbers like eight hundred and ten thousand and eighty nine, which would at this stage be expressed as [8*][10^2][1*][10^1][0*][10^3][8*][10^1][9*] (i.e., [8*10^2 + 1*10^1 + 0]*10^3 + 8*10^1 + 9), for which we want [8*][10^5][1*][10^4][0*][10^3][0*][10^2][8*][10^1][9*] (i.e., 8*10^5 + 1*10^4 + 0*10^3 + 0*10^2 8*10^1 + 9). The rules in /usr/local/lib/lextools- 3/Fragments/NBPowerInterp.rul (the source of /usr/local/lib/lextools-3/Fragments/NBPowerInterp.asc) per- form this normalization. Number name systems differ in what powers of the base (= 10, usually) there are distinctive words for. European languages typically have words for 10^0, 10^1, 10^2, 10^3, 10^6, 10^9, 10^12, etc. Chinese and Japanese have words for 10^0, 10^1, 10^2, 10^4, 10^8, 10^12, etc. Hindi and other South Asian languages have words for 10^0, 10^1, 10^2, 10^3, 10^5, 10^7. These differences affect the set of rules needed to perform the normalization. /usr/local/lib/lextools-3/Fragments/NBPowerInterp.rul should handle most cases that you will encounter. The ruleset also contains a small set of vigesimal rules to handle French- type constructions like cuatre-vingt dix [4*][20^1][10^1], converting them into (in this case) [9*][10^1]. There is no support for full-blown vigesimal systems, though this could be added. The next phase of the compilation performs the complementary operation on digit strings, expanding them into a factorized representation, using /usr/local/lib/lextools- 3/Fragments/NBFactors.asc, and a rule of the form [digits] => [multipliers] to map to the from the actual digit symbols to the multipli- ers ([2*], etc.). This is then composed with the transducer from the previous phase, and you now have something that goes from digit strings to number names. There are several additional options. The -c flag allows the use of commas or other separators in numerical expressions, such as "1,234" instead of "1234". The argument should be an fsm that inserts commas at speci- fied places in the string. One for English might be speci- fied as follows [digits]?[digits]?[digits](([<epsilon>]:,)?[digits][digits][digits])* Note, though, that the argument must be a compiled fsm rather than a regular expression. With the -D flag one can specify an fsm that implements the decimal point and the manner of reading the numbers after that point. It should specify: 1. The decimal point symbol(s) to be used, and their man- ner of reading 2. The manner of splitting up the digit string so that it can be read properly. For instance, in English we tend to read decimals as sequences of single digits, so that ".1415" is read as if it were written, "point 1 4 1 5". So for English we want to introduce spaces after each digit: (.:[ww]point[ww]) [digits] (([<epsilon>]:[ww]) [digits])* For Romance languages one would want to group the digits into pairs or triples. The -L flag allows one to specify a linguistic filter to be performed on the output of the digit-to-number-name trans- duction. For instance, in Spanish you have gender agreement among the members of the number name. This can be expressed as an automaton that is then passed to the -L flag. To use lexnumbuilder, one must have an entry in the symbol file that defines the superclass label "digits", with entries 0-9 in that order. Thus: digits 0 1 2 3 4 5 6 7 8 9 lexnumbuilder uses this superclass to compile a matched rule of the form: [digits] => [multipliers] to map between the digit representation and the internal multiplier symbols listed above. The reason that the digit set must be specified by the user is that this is likely to vary from coding scheme to coding scheme. If you do not pro- vide this, or if you do not put the entries in the specified order, you will have no joy. lexnumbuilder is the by far the most complicated tool in the lextools package. Some sample grammars have been provided which can be used as a basis for building new grammars; see /usr/local/lib/lextools-3/grammars/README for details. Prefixed Currency Expression Compilation lexcurrency allows for the compilation of transducers that handle currency expressions where the currency symbol occurs before the number and the currency word occurs after the number. It is a rather specialized tool. lexcurrency requires that the user have defined the super- class labels: digits, as for lexnumbuilder above comma, the symbol(s) used to group long numbers (e.g. the "," in "2,000") zero, the symbol(s) used for zero in the script for the language The task of lexcurrency is to convert an expression like "$3.23" into "three dollars and twenty five cents"; or an expression like "$3.25 million" into "three point five mil- lion dollars". Call the "$" the currency symbol, the numeric part before the decimal point (or whatever is standard for that language) the "major" currency portion, the point (or whatever) itself the "point", the part after the decimal point the "minor" currency portion, and things like "mil- lion" the "large number" portion. Terms like "dollars" and "cents" we will term the "major" and "minor" expressions. The currency specification file is a file of the form (see also lextools(5)), where tabs (it has to be tabs) separate the elements: symbol major-expr point minor-expr large- number For example: $ [ww](dollars) .:[ww] [ww]cents [ww](m|b)illion Would allow one to generate "six dollars twenty five cents" from "$6.25" and "two point five million dollars" from "$2.5 million". In addition to the currency specification file lexcurrency requires an already constructed number fsm (argument to the -n flag), as constructed by lexnumbuilder. Note that it is assumed that you have provided a way of reading decimal fractions! For large number expressions lexcurrency takes the normal mode of reading decimal numbers and combines it with the large number set specified and the word for the major cur- rency. Thus if your number model tells you that "6.253" is read as "six point two five three", then that is what gets used by lexcurrency for, e.g., "$6.253 billion". For other currency expressions lexcurrency takes the number model plus the currency specification, and from symbol any-number-of-digits point exactly-2-digits produces number-name major-expression point-expression number-name minor-expression The point-expression is just the output side of the trans- ducer specified in the point part of the currency expres- sions file: in the English example above ".:[ww]" means that the "point" is written as a decimal point, and is expressed as a word boundary ("[ww]"). If you wanted it to be, e.g., "and" ("six dollars and twenty five cents"), you would spec- ify something like ".:([ww]and[ww])". The exactly-2-digits restriction is of course a concession to the practical fact that virtually all currency expres- sions can be written (maximally) in an X*.XX format, where X is a digit. This would not have worked, for example, for traditional British currency expressions expressed in pounds, shillings and pence. Three separate filters are allowed on the output, one for the major expression (-M), one for the minor expression (-m) and one for the large number expression (-L). In each case, these should be fsms that specify some operation on the out- put of the transduction from currency-symbol-plus-numeric strings to linguistic expressions. For instance, the above grammar as it stands will produce anomalous output like "*one dollars and one cents" for "$1.01", since the major and minor currency terms are "dollars" and "cents", respec- tively. This can be cleaned up though with a rule such as the following, which can then be compiled and passed as the argument to both -M and -m: s -> [<epsilon>] / [<bos>] one [ww] (cent|dollar) __ FSM Replacements lexreplace functions much as grmreplace (grm(1)) except that instead of mapping between single symbols and names of FSM's to replace those symbols, one can specify unions of symbols and have the specified FSM replace any symbol in that union in topology fsm. Thus: foo.fst a|b|c|d specifies that any of a, b, c, d is replaced by foo.fst in toplogy fsm. CAVEATS Large grammars and rulesets can produce large intermediate results during the compilation process, and large final results at the output of the compilation process. Sometimes the intermediate machines may overrun the available memory on your machine. A simple solution is almost always avail- able in such cases: split up the rule file or lexicon file into a set of smaller files, and then combine them at the end. For example, if a set of 100 ordered rules is too big or too slow to compile, split it into, say, four files con- sisting of 25 rules, compile them separately, and then com- pose the four together when you are done: alternatively, the last step could be omitted, and you can compose the rules together only when you use the rules, at runtime. For lexparadigm, you might think that you could do something like the following to say that "volos" is a member of both paradigm m1e and m1a: volos([pdm class=m1e]|[pdm class=m1a]) Do this instead: volos[pdm class=m1e] volos[pdm class=m1a] The latter will inflect the strings "volos[pdm class=m1e]" and "volos[pdm class=m1a]" with the endings from the respec- tive paradigms, which is generally what you want. The former will inflect the strings "volos[pdm class=m1e]" and "volos[pdm class=m1a]" with the endings from both paradigms, which is generally not what you want. BUGS The file format for lexcurrency is a little funky. Some of these tools could use better names. SEE ALSO lextools(5) lextools file formats grmintro(1) Introduction to the GRM programs and library. fsmintro(1) Introduction to the FSM programs and library. FILES /usr/local/bin/lextools-3 Distribution Lextools binaries. /usr/local/lib/lextools-3/grammars Sample Grammars AUTHOR Richard Sproat (rws@research.att.com)