NAMES
          lexarclist, lexcomplex, lexcurrency, lexfsmstrings, lexnum-
          builder, lexreplace, lexcfcompile, lexcompre, lexdownsize,
          lexmakelab, lexparadigm, lexrulecomp - tools for finite-
          state linguistic analysis

     SYNOPSIS
          lexarclist -l basic labels [ -S superclass labels ] [ -F
          output file ] [ -2dhv? ] [arclist file]

          lexcfcompile -l basic labels [ -S superclass labels ] [ -F
          output file ] [ -2dhsv? ] [grammar specification file]

          lexcomplex -l basic labels [ -S superclass labels ] [ -F
          output file ] [ -2Mdhmsv? ] [lexicon file]

          lexcompre -l basic labels -s regular expression [ -S
          superclass labels ] [ -F output file ] [ -2Mdhmx? ]

          lexcurrency -l basic labels -n number fsm [ -S superclass
          labels ] [ -F output file ] [ -2LMdhmv? ] [currency
          specification file]

          lexdownsize -l basic labels -S superclass labels -f fsm -n
          new labelset name [ -2dhio? ]

          lexfsmstrings -l basic labels [ -S superclass labels ] [
          -2cdh? ] [fsm]

          lexmakelab [ -Lhv? ] symbol file

          lexnumbuilder -l basic labels [ -S superclass labels ] [ -c
          comma fsm ] [ -L linguistic filter fsm ] [ -D decimal model
          fsm ] [ -i intermediate fsm dump file ] [ -2dghiv? ]
          [compiled grammar specification]

          lexparadigm -l basic labels -p paradigm file [ -S superclass
          labels ] [ -F output file ] [ -2dhv? ] [lexicon file]

          lexreplace -l basic labels -t topology fsm [ -S superclass
          labels ] [ -F output file ] [ -2dhv? ] [replacement
          specifications]

          lexrulecomp -l basic labels [ -S superclass labels ] [ -F
          output file ] [ -2dhv? ] [rule file]

     DESCRIPTION
          Lextools is a package of tools for creating weighted
          finite-state transducers from high-level linguistic descrip-
          tions. These descriptions include:
             - Regular expressions or strings

             - Lists of regular expressions or strings

             - Context dependent rewrite rules

             - Context free rewrite rules

             - And various more specialized tools for building partic-
             ular kinds of grammars --- for instance grammars that
             specify between the mapping between digit strings and the
             number-names for those strings.

          Lextools is built on the FSM (fsmintro(1)) and GRM
          (grmintro(1)) libraries. In particular, see the FSM Library
          for the definitions of the components of an FSM: state,
          start state, arc (transition), input symbol, output symbol,
          arc cost, final cost and final state, and the GRM library
          for descriptions of restrictions on context free and context
          sensitive grammars.

          With the exception of lexmakelab (which builds label sets),
          and lexfsmstrings (which prints acyclic automata in terms of
          a given label set), all of the commands take as input a
          label set, a set of one or more grammar specifications (in
          some cases specified as compiled FSMs), and outputs an FSM
          that implements the grammatical specifications.

          Where the input file (last argument specified) is listed as
          optional in the above synopses, then the information can
          either be read from a file, or from the standard input.

          The majority of the tools share a common set of command-line
          flags, which for convenience are listed here with their
          interpretation. These flags are usually optional, with the
          exception of -l:

             -l basic label file

             -S superclass label file

             -F write result to this output file rather than standard
             out

             -2 characters are 2-byte characters (useful for Asian
             writing systems)

             -d turn on various lextools debugging flags

             -v turn on debugging flags for this tool

             -[h?] print the command-line flag information

        Label Set Construction
          lexmakelab compiles symbol file into a basic label file, and
          a superclass label file. The symbol file should be named
          name.sym, and the basic label file and superclass label file
          will be named, respectively, name.lab and name.scl.  The
          label files produced are in the format specified in fsm(5),
          where the first column is the label, and the second column
          gives the integral values (internal arc label values for the
          FSMs) corresponding to those labels.

          The file format of the symbol file is specified in 
          lextools(5), but basically the first entry in the line defines
          the superclass to which the basic labels, or predefined
          superclasses, in the remaining entries in the line belong.
          Superclass names may appear in the first column of multiple
          lines: whenever lexmakelab sees a superclass label in the
          first colmun, it merely adds the labels specified in the
          rest of the line to that superclass. Thus the integer values
          labeled with the superclass label foo in the superclass
          label file are exactly those integer values specified for
          the basic labels belonging to foo. So, assuming that a is
          assigned to 1, b to 2 and c to 3, then the superclass speci-
          fication:

             foo          a b c

          will result in the entries

             foo          1
             foo          2
             foo          3

          being placed in the superclass label file. If another super-
          class is defined in terms of foo, say bar, then bar's
          entries in the superclass label file will include all of
          foo's entries.

          Superclasses are useful for defining classes of symbols. For
          example, one could define all of the vowel symbols as
          belonging to the superclass vowel.

          The special designation Category: in the first column of a
          line defines the entry in the second column as having the
          features specified by the remaining entries in the line,
          which should be predefined superclasses. Thus, the set of
          lines

             gender          masc fem
             case            nom acc gen dat
             number          sg pl
             person          1st 2nd 3rd
             Category:       noun gender case number
             Category:       verb number person

          specifies noun and verb as categories which have, respec-
          tively, the features gender, case and number; and number and
          person. Then one can specify these features, with their
          appropriate values (in any order), in regular expressions:
          Hund[noun case=nom number=sg gender=masc].

          By default the label <epsilon>, with numeric value 0, is
          added to the basic label file.  Similarly <sigma> is put in
          the superclass file for each non-epsilon member of the label
          set.  The -L flag instructs lexmakelab to add various addi-
          tional labels that are needed by various other tools in the
          lextools package, or by the gentex multilingual TTS text-
          analysis module. These labels are:

             -- Linguistic boundary symbols: $$, ++, ww, aa, ii, ,,,
             .., !!, ??

             -- XML start and end tags: <xml>, </xml>

             -- Accentuation marks: acc:+, acc:-, acc:c

             -- Powers of 10 and 20 (needed by lexnumbuilder): 10^1 -
             10^20,
                20^1 - 20^2

             -- Internal representation of multipliers of powers of 10
             or 20
                (needed by lexnumbuilder): 0*, 1*, 2*, 3*, 4*, 5*, 6*,
             7*, 8*, 9*

             -- Beginning and end of string symbols (useful for
             lexrulecomp):
                <bos>, <eos>


        Regular Expression Compilation
          lexcompre compiles regular expression into an FSM.

          If the -x flag is specified, regular expression is inter-
          preted as a string, rather than as a regular expression.
          Long symbols (enclosed in [...]) and costs (enclosed in
          <...>) are still interpreted, but regular expression opera-
          tors are treated as ordinary symbols.


        Printing Acyclic Machines As Strings
          lexfsmstrings prints acyclic automata and transducers as,
          respectively strings or pairs of strings, one string or pair
          of strings per line, using the symbols in basic label file.
          Long symbols are enclosed in [...] and costs in <...>.

          Printing of string costs can be suppressed with the -c flag.

        Producing Subsets of Label Sets
          lexdownsize takes a basic label file and a superclass file
          (required), and produces a new pair of label/superclass file
          with the basename name (i.e., name..lab and name.scl) con-
          taining only those labels found in the output (default, or
          -o flag) or input (-i) arc labels of fsm.

          This is a particularly useful thing to do if you have a
          large label set, but you know that, say, the output of a
          given transducer T only uses a small subset of those labels:
          call this Labels(T). Necessarily, any transducers that T is
          composed with need only know what to do with Labels(T). So a
          set of ordered rules, for instance, that composes with T,
          only needs to know what to do with strings constructed out
          of Labels(T). If Labels(T) is a lot smaller than the full
          labelset, compilation of the ruleset can be much more effi-
          cient.


        Lexicon Compilation
          lexcomplex compiles the list of regular expressions in lexi-
          con file. Normal regular expression syntax, line continua-
          tion with backslash, and comment syntax as defined in
          lextools(5) apply.  By default, it produces the union of the
          individual regular expressions.

          If either -m or -M are set, the file is interpreted as spec-
          ifying a map, which defines some output for any input string
          over the alphabet of the label set: note that the individual
          regular expressions should in this case specify transducers.

          With the -m flag the relation

             (Id(<sigma> - Projection_1(R)) U R)*

          is computed, which is efficient to compute and probably what
          you want if you are interested in building a map involving
          transductions of single-symbols to strings.

          With the -M flag the transductions specified by each expres-
          sion in lexicon file are treated as corresponding parts of a
          batch left-to-right obligatory context-sensitive rewrite
          rule. Thus lines such as:


             input1    :     output1
             input2    :     output2
             input3    :     output3

          are treated as if one had written a batch rule


             |input1|         |output1|
             |input2|   ->    |output2|
             |input3|         |output3|

          and compiled it with lexrulecomp (which, lamentably, does
          not currently support generally batch rules).

          The behavioral difference between -m and -M is best illus-
          trated by example. Suppose one has the following alphabet:


          a b c d

          and the following lexicon file:


           a  : b
           cd : ac

          Compiled with the -M flag output.fst will guarantee that any
          a in a string will be transduced to b, and any sequence cd
          will be transduced to ac. Thus abcd yields as output only
          bbac:

          lexcompre -l foo.lab -s abcd | fsmcompose - output.fst |
          lexfsmstrings -l foo.lab abcd    bbac

          Compiled with -m, output.fst will guarantee that any a in a
          string will be transduced to b (since the input a is sub-
          tracted from <sigma> in computing the complement of the left
          projection of what's specified by in lexicon file), but the
          sequence cd can be transduced either to itself (via
          (Id(<sigma> - Projection_1(R)))* or else to cd (via R).

          lexcompre -l foo.lab -s abcd | fsmcompose - output.fst |
          lexfsmstrings -l foo.lab abcd    bbac abcd    bbcd

          Note that since <sigma> is used in the compilation of maps,
          if one uses the -[mM] flags one must also specify the super-
          class label file with -S.


        Finite-State Grammars
          lexarclist compiles an arclist specification of a finite-
          state grammar from arclist file into an FSM that implements
          that grammar.


        Morphological Paradigms
          lexparadigm takes the paradigm specification in paradigm
          file and the stems and morphological class information in
          lexicon file and produces a replace class FSM (>grm(1)) that
          recognizes all and only the inflected forms of the stems in
          lexicon file.

          The assignment of lexical entries to paradigms is accom-
          plished by specifying the appropriate paradigm as part of
          lexical entry. This can be done in any fashion by including
          the paradigm label as part of the entry, but it is recom-
          mended that you pick some appropriate feature name such as
          "pdm" and some attribute with an appropriate name like
          "class". Thus if there is a paradigm definition for class
          [m1a] in the paradigm file, an entry in the lexicon file
          might look as follows: abazh'ur[pdm class=m1a].


        Context-Free Grammars
          lexcfcompile compiles the context-free grammar specified in
          grammar specification file into a finite-state acceptor that
          recognizes the language defined by the grammar.  lexcfcom-
          pile first compiles the grammar into a transducer represen-
          tation compatible with that produced by grmread (grm(1)>),
          and then passes that transducer to GRMCfCompile (grm(3)).
          Thus lexcfcompile produces exactly what one gets by first
          calling grmread and piping the result to grmcfcompile
          (except that the grammar file format for lexcfcompile is a
          little more flexible than the file format for grmcfcompile).

          Note that only context-free grammars that are formally
          equivalent to finite-state grammars can be compiled: GRMCf-
          Compile does not currently support finite-state approxima-
          tion of context free grammars. See grm(1) for discussion of
          restrictions on what kinds of CF grammars can be compiled.

          By default the start symbol is assumed to be the first non-
          terminal mentioned in the grammar. With the -s flag, no
          start symbol is set. This can be useful if you want to set
          the start state at some later time: see the section on
          Dynamic Modifications in grm(1).


        Context-Sensitive Rulesets
          lexrulecomp compiles a set of context-dependent rewrite
          rules specified in rule file into a series of transducers
          which are then composed together. The algorithm is the one
          described in grmcdcompile in grm(1); lexrulecomp calls GRM-
          CdCompile (grm(3)).

          See grm(1) for the description on formal restrictions on
          context-dependent rules.


        Number Expression Compilation
          lexnumbuilder allows one to construct transducers that map
          between digit strings and their number names.

          The input to lexnumbuilder is an FSM that enumerates the
          well-formed number names of the language, where each word
          has been annotated with its semantic interpretation in terms
          of (sums of) products of powers of ten, in terms of the pow-
          ers and multipliers listed above. Such an FSM can be conve-
          niently produced with a context-free grammar and lexcfcom-
          pile. For example, rules like


           [HUNDRED]   -> [UNITW] [ww] [HUNDREDW] ([ww] and [ww]
          [TEN])?
           [HUNDREDW]  -> hundred[10^2]
           [TEN]       -> [TENW] ([ww] [UNITW])?
           [TENW]      -> twen[2*]ty[10^1]
           [UNITW]     -> one[0*]

          as part of a context-free grammar will produce as output an
          analysis like the following for "one hundred and twenty
          one":
          one[1*][ww]hundred[10^2][ww]and[ww]twen[2*]ty[10^1][ww]one[1*]

          An FSM that produces representations like that is appropri-
          ate as input to lexnumbuilder.

          The first operation that is performed on the input is to
          compute a bimorphism, producing a transducer with the multi-
          pliers and powers on the output and everything else on the
          input. Thus the example above would become:

           one[ww]hundred[ww]and[ww]twenty[ww]one
          [1*][10^2][2*][10^1][1*]

          At this stage, you can dump this transducer to an intermedi-
          ate fsm dump file using the -i flag, if you wish. This is
          useful for debugging. The remainder of the processing per-
          formed by lexnumbuilder involves composing this transducer
          with a transducer that maps from factorizations like
          "[1*][10^2][2*][10^1][1*]" into digit strings like "121".
          Since the rules that lexnumbuilder uses (see below) do
          assume that these expressions occur in a certain order, if
          there is a bug in the grammar you will probably lose some of
          the expressions that you had thought were allowed. Computing
          the difference (using fsmdifference) between the input of
          the dumped transducer and the output of the final transducer
          produced by lexnumbuilder allows one to see what is missing.

          If the -g flag is specified, the next phase involves comput-
          ing "Germanic inversion" on the output of the bimorphized
          transducer. Number names in German and other Germanic lan-
          guages in particular have the well-known property that units
          occur before decades, so that "twenty one" becomes "one and
          twenty". This would generally result in the "reversed" fac-
          torization [1*][2*][10^1] (i.e. 1 + 2*10^1). Germanic inver-
          sion will flip this into the normal order [2*][10^1][1*].
          This is performed using the transducer specified in
          /usr/local/lib/lextools-3/Fragments/NBGmcFlip.asc, which is
          compiled at runtime using the specified label set.

          The next phase uses the transducer specified in
          /usr/local/lib/lextools-3/Fragments/NBPowerInterp.asc (com-
          piled at runtime using the label set) to "normalize" to a
          fully explicit factorization into sums of products of powers
          of ten. This becomes necessary when dealing larger numbers
          like eight hundred and ten thousand and eighty nine, which
          would at this stage be expressed as
          [8*][10^2][1*][10^1][0*][10^3][8*][10^1][9*] (i.e., [8*10^2
          + 1*10^1 + 0]*10^3 + 8*10^1 + 9), for which we want
          [8*][10^5][1*][10^4][0*][10^3][0*][10^2][8*][10^1][9*]
          (i.e., 8*10^5 + 1*10^4 + 0*10^3 + 0*10^2 8*10^1 + 9). The
          rules in /usr/local/lib/lextools-
          3/Fragments/NBPowerInterp.rul (the source of
          /usr/local/lib/lextools-3/Fragments/NBPowerInterp.asc) per-
          form this normalization.  Number name systems differ in what
          powers of the base (= 10, usually) there are distinctive
          words for. European languages typically have words for 10^0,
          10^1, 10^2, 10^3, 10^6, 10^9, 10^12, etc. Chinese and
          Japanese have words for 10^0, 10^1, 10^2, 10^4, 10^8, 10^12,
          etc. Hindi and other South Asian languages have words for
          10^0, 10^1, 10^2, 10^3, 10^5, 10^7. These differences affect
          the set of rules needed to perform the normalization.
          /usr/local/lib/lextools-3/Fragments/NBPowerInterp.rul should
          handle most cases that you will encounter. The ruleset also
          contains a small set of vigesimal rules to handle French-
          type constructions like cuatre-vingt dix [4*][20^1][10^1],
          converting them into (in this case) [9*][10^1]. There is no
          support for full-blown vigesimal systems, though this could
          be added.

          The next phase of the compilation performs the complementary
          operation on digit strings, expanding them into a factorized
          representation, using /usr/local/lib/lextools-
          3/Fragments/NBFactors.asc, and  a rule of the form

           [digits] => [multipliers]

          to map to the from the actual digit symbols to the multipli-
          ers ([2*], etc.).  This is then composed with the transducer
          from the previous phase, and you now have something that
          goes from digit strings to number names.

          There are several additional options.

          The -c flag allows the use of commas or other separators in
          numerical expressions, such as "1,234" instead of "1234".
          The argument should be an fsm that inserts commas at speci-
          fied places in the string. One for English might be speci-
          fied as follows

          [digits]?[digits]?[digits](([<epsilon>]:,)?[digits][digits][digits])*

          Note, though, that the argument must be a compiled fsm
          rather than a regular expression.

          With the -D flag one can specify an fsm that implements the
          decimal point and the manner of reading the numbers after
          that point. It should specify:


             1. The decimal point symbol(s) to be used, and their man-
             ner of reading

             2. The manner of splitting up the digit string so that it
             can be read properly.

          For instance, in English we tend to read decimals as
          sequences of single digits, so that ".1415" is read as if it
          were written, "point 1 4 1 5". So for English we want to
          introduce spaces after each digit:

          (.:[ww]point[ww]) [digits] (([<epsilon>]:[ww]) [digits])*

          For Romance languages one would want to group the digits
          into pairs or triples.

          The -L flag allows one to specify a linguistic filter to be
          performed on the output of the digit-to-number-name trans-
          duction. For instance, in Spanish you have gender agreement
          among the members of the number name. This can be expressed
          as an automaton that is then passed to the -L flag.

          To use lexnumbuilder, one must have an entry in the symbol
          file that defines the superclass label "digits", with
          entries 0-9 in that order. Thus:

           digits        0 1 2 3 4 5 6 7 8 9

          lexnumbuilder uses this superclass to compile a matched rule
          of the form:

           [digits] => [multipliers]

          to map between the digit representation and the internal
          multiplier symbols listed above. The reason that the digit
          set must be specified by the user is that this is likely to
          vary from coding scheme to coding scheme. If you do not pro-
          vide this, or if you do not put the entries in the specified
          order, you will have no joy.

          lexnumbuilder is the by far the most complicated tool in the
          lextools package. Some sample grammars have been provided
          which can be used as a basis for building new grammars; see
          /usr/local/lib/lextools-3/grammars/README for details.


        Prefixed Currency Expression Compilation
          lexcurrency allows for the compilation of transducers that
          handle currency expressions where the currency symbol occurs
          before the number and the currency word occurs after the
          number. It is a rather specialized tool.

          lexcurrency requires that the user have defined the super-
          class labels:

             digits, as for lexnumbuilder above

             comma, the symbol(s) used to group long numbers (e.g. the
             "," in "2,000")

             zero, the symbol(s) used for zero in the script for the
             language

          The task of lexcurrency is to convert an expression like
          "$3.23" into "three dollars and twenty five cents"; or an
          expression like "$3.25 million" into "three point five mil-
          lion dollars". Call the "$" the currency symbol, the numeric
          part before the decimal point (or whatever is standard for
          that language) the "major" currency portion, the point (or
          whatever) itself the "point", the part after the decimal
          point the "minor" currency portion, and things like "mil-
          lion" the "large number" portion. Terms like "dollars" and
          "cents" we will term the "major" and "minor" expressions.

          The currency specification file is a file of the form (see
          also lextools(5)), where tabs (it has to be tabs) separate
          the elements:

          symbol    major-expr     point     minor-expr     large-
          number

          For example:

          $       [ww](dollars)        .:[ww]     [ww]cents
          [ww](m|b)illion

          Would allow one to generate "six dollars twenty five cents"
          from "$6.25" and "two point five million dollars" from "$2.5
          million".

          In addition to the currency specification file lexcurrency
          requires an already constructed number fsm (argument to the
          -n flag), as constructed by lexnumbuilder.  Note that it is
          assumed that you have provided a way of reading decimal
          fractions!

          For large number expressions lexcurrency takes the normal
          mode of reading decimal numbers and combines it with the
          large number set specified and the word for the major cur-
          rency. Thus if your number model tells you that "6.253" is
          read as "six point two five three", then that is what gets
          used by lexcurrency for, e.g., "$6.253 billion".

          For other currency expressions lexcurrency takes the number
          model plus the currency specification, and from

          symbol any-number-of-digits point exactly-2-digits

          produces

          number-name major-expression point-expression number-name
          minor-expression

          The point-expression is just the output side of the trans-
          ducer specified in the point part of the currency expres-
          sions file: in the English example above ".:[ww]" means that
          the "point" is written as a decimal point, and is expressed
          as a word boundary ("[ww]"). If you wanted it to be, e.g.,
          "and" ("six dollars and twenty five cents"), you would spec-
          ify something like ".:([ww]and[ww])".

          The exactly-2-digits restriction is of course a concession
          to the practical fact that virtually all currency expres-
          sions can be written (maximally) in an X*.XX format, where X
          is a digit. This would not have worked, for example, for
          traditional British currency expressions expressed in
          pounds, shillings and pence.

          Three separate filters are allowed on the output, one for
          the major expression (-M), one for the minor expression (-m)
          and one for the large number expression (-L). In each case,
          these should be fsms that specify some operation on the out-
          put of the transduction from currency-symbol-plus-numeric
          strings to linguistic expressions. For instance, the above
          grammar as it stands will produce anomalous output like
          "*one dollars and one cents" for "$1.01", since the major
          and minor currency terms are "dollars" and "cents", respec-
          tively. This can be cleaned up though with a rule such as
          the following, which can then be compiled and passed as the
          argument to both -M and -m:
          s -> [<epsilon>] / [<bos>] one [ww] (cent|dollar) __

        FSM Replacements
          lexreplace functions much as grmreplace (grm(1)) except that
          instead of mapping between single symbols and names of FSM's
          to replace those symbols, one can specify unions of symbols
          and have the specified FSM replace any symbol in that union
          in topology fsm. Thus:


           foo.fst      a|b|c|d

          specifies that any of a, b, c, d is replaced by foo.fst in
          toplogy fsm.


     CAVEATS
          Large grammars and rulesets can produce large intermediate
          results during the compilation process, and large final
          results at the output of the compilation process. Sometimes
          the intermediate machines may overrun the available memory
          on your machine. A simple solution is almost always avail-
          able in such cases: split up the rule file or lexicon file
          into a set of smaller files, and then combine them at the
          end. For example, if a set of 100 ordered rules is too big
          or too slow to compile, split it into, say, four files con-
          sisting of 25 rules, compile them separately, and then com-
          pose the four together when you are done: alternatively, the
          last step could be omitted, and you can compose the rules
          together only when you use the rules, at runtime.

          For lexparadigm, you might think that you could do something
          like the following to say that "volos" is a member of both
          paradigm m1e and m1a:

             volos([pdm class=m1e]|[pdm class=m1a])

          Do this instead:

             volos[pdm class=m1e]
             volos[pdm class=m1a]

          The latter will inflect the strings "volos[pdm class=m1e]"
          and "volos[pdm class=m1a]" with the endings from the respec-
          tive paradigms, which is generally what you want. The former
          will inflect the strings "volos[pdm class=m1e]" and
          "volos[pdm class=m1a]" with the endings from both paradigms,
          which is generally not what you want.


     BUGS
          The file format for lexcurrency is a little funky.
          Some of these tools could use better names.


     SEE ALSO
          lextools(5)                        lextools file formats
          grmintro(1)                        Introduction to the GRM
                                             programs and library.
          fsmintro(1)                        Introduction to the FSM
                                             programs and library.
     FILES
          /usr/local/bin/lextools-3          Distribution Lextools
                                             binaries.
          /usr/local/lib/lextools-3/grammars Sample Grammars
     AUTHOR
          Richard Sproat (rws@research.att.com)