lextools(5)

     NAMES
          Lextools symbol files, regular expression syntax, general
          file formats, grammar formats - lextools file formats


     DESCRIPTION
        Symbol Files
          The lextools symbol file specifies the label set of an
          application. Labels come in three basic flavors:


             Basic labels

             Superclass labels

             Category labels

          Basic labels are automatically assigned a unique integer
          representation (excluding 0, which is reserved for
          "<epsilon>"), and this information is compiled by lexmakelab
          into the basic label file, which is in the format specified
          in fsm(5).

          Superclass labels are collections of basic labels: a super-
          class inherits all of the integral values of all of the
          basic labels (or superclass labels) that it contains.

          Category labels are described below.

          The lines in a symbol file look as follows:


             superclass1          basic1 basic2 basic3

          You may repeat the same superclass label on multiple (possi-
          bly non-adjacent) lines: whatever is specified in the new
          line just gets added to that superclass. The "basic" labels
          can also be superclass labels: in that case, the superclass
          in the first column recursively inherits all of the labels
          from these superclass labels.

          The one exception is a category expression which is speci-
          fied as follows:


             Category:          catname feat1 feat2 feat3

          The literal "Category:" must be the first entry: it is case
          insensitive.  "catname" should be a new name. "feat1" labels
          are their values.

          The following sample symbol file serves to illustrate:
           dletters   a b c d e f g h i j k l m n o p
           dletters   q r s t u v w x y z
           uletters   A B C D E F G H I J K L M N O P
           uletters   Q R S T U V W X Y Z
           letters    dletters uletters
           gender     masc fem
           case       nom acc gen dat
           number     sg pl
           person     1st 2nd 3rd
           Category:  noun gender case number
           Category:  verb number person

          For this symbol set, the superclass "dletters" will contain
          the labels for all the lower case labels, "uletters" the
          upper case letters, and "letters" both. Defined categories
          are "noun" and "verb". "noun", for instance, has the fea-
          tures "gender", "case" and "number". The feature "gender"
          can have the values "masc" and "fem".

          (NB: this way of representing features is inherited in con-
          cept from Lextools 2.0.)

          Some caveats:

          You should not use the reserved terms "<epsilon>" or
          "<sigma>" in a symbol file. If you pass the -L flag to lex-
          makelab you should also not use any of the special symbols
          that it introduces with that flag (see lextools(1)).

          Symbol files cannot contain comments. Backslash continuation
          syntax does not work.


        Regular Expressions
          Regular expressions consist of strings, possibly with speci-
          fied costs, conjoined with one or more operators.

          strings are constructed out of basic and superclass labels.
          Labels themselves may be constructed out of one or more
          characters. Characters are defined as follows:


             If 2-byte characters are specified (Chinese, Japanese,
             Korean...), a character can be a pair of bytes if the
             first byte of the pair has the high bit set.

             In all other conditions a character is a single byte.

          Multicharacter tokens MUST BE delimited in strings by a left
          bracket (default: "[") and right bracket (default: "]").
          This includes special tokens "<epsilon>", "<sigma>", "<bos>"
          and "<eos>". This may seem inconvenient, but the regular
          expression compiler must have some way to figure out what a
          token is. Whitespace is inconvenient since if you have a
          long string made up of single-character tokens, you don't
          want to be putting spaces between every character: trust me.
          You may also use brackets to delimit single-character tokens
          if you wish.

          Some well-formed strings are given below:


           abc[foo]
           [<epsilon>]ab[<sigma>]

          Note that the latter uses the superclass "<sigma>" (con-
          structed by lexmakelab to include all symbols of the alpha-
          bet except "<epsilon>"): in order to compile this expres-
          sion, the superclass file must have been specified using the
          -S flag.

          If features are specified in the label set, then one can
          specify the features in strings in a linguistically appeal-
          ing way as follows:


           food[noun gender=fem number=sg case=nom]

          Order of the feature specifications does not matter: the
          order is determined by the order of the symbols in the sym-
          bol file. Thus the following is equivalent to the above:


           food[noun case=nom number=sg gender=fem]

          The internal representation of such feature specifications
          looks as follows: "food[_noun][nom][sg][fem]".

          Unspecified features will have all legal values filled in.
          Thus


           food[noun case=nom number=sg]

          will produce a lattice with both [fem] and [masc] as alter-
          natives.  Inappropriate feature values will cause a warning
          during the compilation process.  Since features use super-
          classes, again, in order to compile such expressions, the
          superclass file must have been specified using the -S flag.

          Costs can be specified anywhere in strings. They are speci-
          fied by a positive or negative floating point number within
          angle brackets.  The current version of lextools assumes the
          tropical semiring, so costs are accumulated across strings
          by summing. Thus the following two strings have the same
          cost:



           abc[foo]<3.0>
           a<-1.0>b<2.0>c<1.0>[foo]<0.5><0.5>

          Note that a cost on its own -- i.e. without an accompanying
          string -- specifies a machine with a single state, no arcs,
          and an exit cost equal to the cost specified.

          Regular expressions can be constructed as follows. First of
          all a string is a regular expression. Next, a regular
          expression can be constructed out of one or two other regu-
          lar expressions using the following operators:


           regexp1*             Kleene star
           regexp1+             Kleene plus
           regexp1^n         power
           regexp1?             optional
           !regexp1             negation
           regexp1 | regexp2    union
           regexp1 & regexp2    intersection
           regexp1 : regexp2    cross product
           regexp1 @ regexp2    composition
           regexp1 - regexp2    difference

          In general the same restrictions on these operations apply
          as specified in fsm(1). For example, the second argument to
          "-" (difference) must be an unweighted acceptor. Note also
          that the two argument to ":" (cross product) must be accep-
          tors. The argument n to "^" must be a positive integer. The
          arguments to "@" (composition) are assumed to be transduc-
          ers.

          The algorithm for parsing regular expressions finds the
          longest stretch that is a string, and takes that to be the
          (first) argument of the unary or binary operator immediately
          to the left, or the second argument of the binary operator
          immediately to the right.  Thus "abcd | efgh" represents the
          union of "abcd" and "efgh" (which is reminiscent of Unix
          regular expression syntax) and "abcd*" represents the tran-
          sitive closure of "abcd" (i.e., not "abc" followed by the
          transitive closure of d, which is what one would expect on
          the basis of Unix regular expression syntax).

          The precedence of the operators is as follows (from lowest
          to highest), the last parenthesized group being of equal
          precedence:

           | & - : (* + ? ^)

          But this is hard to remember, and in the case of multiple
          operators, it may be complex to figure out which elements
          get grouped first. The use of parentheses is highly recom-
          mended: use parentheses to disambiguate "!(abc | def)" from
          "(!abc) | def".

          Spaces are never significant in regular expressions.


        Escapes, Comment Syntax and Miscellaneous other
          Comments can appear in input files, with the exception of
          symbol files. Comments are preceded by "#" and continue to
          the end of the line.

          You can split lines or regular expressions within lines onto
          multiple lines if you include "\ " at the end of the line,
          right before the newline character.

          Special characters, including the comment character, can be
          escaped with `\'.  To get a `\', escape it with `\': `\\'.

          Characters that should be escaped are: `\', `*', `^', `?',
          `!', `|', `&', `:', `@', `-', `[', `]', `<' and `>'.


        Lexicons
          The input to lexcomplex is simply a list of regular expres-
          sions. The default interpretation is that these expressions
          are to be unioned together, but other interpretations are
          possible: see lextools(1) for details.

          If any of the regular expressions denotes a relation (i.e.,
          a transducer) the resulting union also denotes a relation,
          otherwise it denotes a language (i.e., an acceptor).


        Arclists
          An arclist (borrowing a term from Tzoukermann and Liberman's
          1990 work on Spanish morphology) is a simple way to specify
          a finite-state morphological grammar. Lines can be of one of
          the following three formats:


           instate    outstate    regexp
           finalstate
           finalstate cost

          Note that cost here should be a floating point number not
          enclosed in angle brackets. State names need not be enclosed
          in square brackets: they are not regular expressions.
          The following example, for instance, specifies a toy grammar
          for English morphology that handles the words, "grammati-
          cal", "able", "grammaticality", "ability" (mapping it to
          "able + ity"), and their derivatives in "un-":


           START   ROOT     un [++] | [<epsilon>]
           ROOT    FINAL    grammatical | able
           ROOT    SUFFIX   grammatical
           ROOT    SUFFIX   abil : able
           SUFFIX  FINAL    [++] ity
           FINAL   1.0

        Paradigms
          The paradigm file specifies a set of morphological para-
          digms, specified as follows.

          Each morphological paradigm is introduced by the word "Para-
          digm" (case insensitive), followed by a bracketed name for
          the paradigm


           Paradigm        [m1a]

          Following this are specifications of the one of the follow-
          ing forms:


           Suffix     suffix    features
           Prefix     prefix    features
           Circumfix  circumfix features

          The literals "Suffix", "Prefix" and "Circumfix" are matched
          case-insensitively. The remaining two fields are regular
          expressions describing the phonological (or orthographic)
          material in the affix, and the features. The "Circumfix"
          specification has a special form, namely "regexp...regexp".
          The three adjacent dots, which must be present, indicate the
          location of the stem inside the circumfix.  In all cases,
          features are placed at the end of the morphologically com-
          plex form.  There is no provided mechanism for infixes,
          though that would not be difficult to add.

          One may specify in a third field in the "Paradigm" line
          another previously defined paradigm from which the current
          paradigm inherits forms:

           Paradigm        [mo1a] [m1a]

          In such a case, a new paradigm will be set up, and all the
          forms will be inherited from the prior paradigm except those
          forms whose features match entries specified for the new
          paradigm: in other words, you can override, say, the form
          for "[noun num=sg case=dat]" by specifying a new form with
          those features. (See the example below.)  One may also add
          additional entries (with new features) in inherited para-
          digms.

          A sample set of paradigms (for Russian) is given below:


           Paradigm        [m1a]
           Suffix  [++]    [noun num=sg case=nom]
           Suffix  [++]a   [noun num=sg case=gen]
           Suffix  [++]e   [noun num=sg case=prep]
           Suffix  [++]u   [noun num=sg case=dat]
           Suffix  [++]om  [noun num=sg case=instr]
           Suffix  [++]y   [noun num=pl case=nom]
           Suffix  [++]ov  [noun num=pl case=gen]
           Suffix  [++]ax  [noun num=pl case=prep]
           Suffix  [++]am  [noun num=pl case=dat]
           Suffix  [++]ami [noun num=pl case=instr]
           Paradigm        [mo1a] [m1a]
           Paradigm        [m1e]  [m1a]
           Suffix  [++]"ov [noun num=pl case=gen]
           Suffix  [++]"ax [noun num=pl case=prep]
           Suffix  [++]"am [noun num=pl case=dat]
           Suffix  [++]"ami  [noun num=pl case=instr]

          Note that "[mo1a]" inherits all of "[m1a]", whereas "[m1e]"
          inherits all except the genitive, prepositional, dative and
          instrumental plurals.

          See lextools(1) for some advice on how to link the paradigm
          labels to individual lexical entries in the lexicon file
          argument to lexparadigm.


        Context-Free Rewrite Rules
          The input to lexcfcompile is a set of expressions of the
          following form:


           NONTERMINAL -> regexp

          The "->" must be literally present. "NONTERMINAL" can actu-
          ally be a regular expression over nonterminal symbols,
          though the only useful regular expressions in this case are
          unions of single symbols. The "regexp" can in principle be
          any regular expression specifying a language (i.e., not a
          relation) containing a mixture of terminals and non-
          terminals. However, while lexcfcompile imposes no restric-
          tions on what you put in the rule, the algorithm implemented
          in GRMCfCompile, which lexcfcompile uses, can only handle
          certain kinds of context-free grammars.  The user is
          strongly advised to read and understand the description in
          grm(1) to understand the restrictions on the kinds of
          context-free grammars that can be handled.

          By default the start symbol is assumed to be the first non-
          terminal mentioned in the grammar; see lextools(1) for fur-
          ther details.

          The following grammar implements the toy English morphology
          example we saw above under Arclists, this time putting
          brackets around the constituents (and this time without the
          mapping from "abil" to "able"):


           [NOUN] -> \[ ( \[ [ADJ] \] | \[ [NEGADJ] \] ) ity \]
           [NOUN] -> \[ [ADJ] \] | \[ [NEGADJ] \]
           [NEGADJ] -> un \[ [ADJ] \]
           [ADJ] -> grammatical | able

        Context-Dependent Rewrite Rules
          A context-dependent rewrite rule file consists of specifica-
          tions of one of the following two forms:


           phi -> psi / lambda __ rho
           phi => psi / lambda __ rho

          In each case "phi", "psi", "lambda" and "rho" are regular
          expressions specifying languages (acceptors). All but "psi"
          must be unweighted (a requirement of the underlying GRMCd-
          Compile; see grm(1), grm(3)). The connectors "->", "=>", and
          "/" must literally occur as such. The underbar separating
          "lambda" and "rho" can be any number of consecutive under-
          bars.  The interpretation of all such rules is that "phi" is
          changed to "psi" in the context "lambda" on the left and
          "rho" on the right.

          The difference between the two productions, "->" and "=>" is
          the following. "->" denotes a mapping where any element of
          "phi" can be mapped to any element of "psi". With, "=>" the
          inputs and outputs are matched according to their order in
          the symbol file: this is most useful with single (super-
          class) symbol to single (superclass) symbol replacements.
          For example, suppose you have the following entries in your
          symbol file:


           V           a e i o u
           +voiced     b d g
           -voiced     p t k

          The rule:


           [+voiced] -> [-voiced] / V __ V

          will replace any symbol in {p,t,k} with any symbol in
          {b,d,g} between two vowels. Probably what you want in this
          case is the following:


           [+voiced] => [-voiced] / V __ V

          This will replace "p" with "b", "t" with "d" and "k" with
          "g". The matching is done according to the order of the sym-
          bols in the symbol file. If you had specified instead:


           +voiced     b g d
           -voiced     p t k

          then "t" would be replaced with "g" and "k" with "d". Simi-
          larly with

           +voiced     b d g
           -voiced     p t k x

          "p", "t" and "k" would be replaced as in the first case, but
          "x" would be ignored since there is nothing to match it to:
          nothing will happen to "x" intervocalically. Use of the
          matching rule specification "=>" thus requires some care in
          labelset management.

          Beginning of string and end of string can be specified as
          "[<bos>]" and "[<eos>]", respectively: these are added to
          the label set by default if you use the -L flag to lexmake-
          lab.

          A line may also consist of one of the following specifica-
          tions (which are case insensitive):


           left-to-right
           right-to-left
           simultaneous
           optional
           obligatory

          The first three set the direction of the rule application;
          the last two set whether the rule application is obligatory
          or optional; see grm(1). All specifications are in effect
          until the next specification or until the end of the file.
          The default setting is obligatory, left-to-right. In
          practice the user will rarely need to fiddle with these
          default settings.


        Replacements
          A replacement specification (for lexreplace) is a file con-
          sisting of lines like the following:


           foo.fst      a|b|c|d

          The first column specifies a single fsm that must exist in
          the named file. The remainder of the line specifies a union
          of labels to be replaced in the topology fsm argument to
          lexreplace with said fsm. Specified substitutions for a
          given label will override and previous substitutions. In the
          following case:


           foo.fst      a|b|c|d
           bar.fst      a

          you will foo.fst for "b", "c" and "d", and bar.fst for "a".

          See also grmreplace in grm(1).


        Currency Expressions
          Currency specification files contain lines of the following
          form:

          sym major-expr point minor-expr large-number

          Each entry is a regular expression.  See lextools(1) for a
          description of the function of the entries.

          Note that the whitespace separator MUST BE A TAB: the regu-
          lar expressions themselves may contain non-tab whitespace.
          There must therefore be four tabs. You must still have tabs
          separating entries even if you split an entry across multi-
          ple lines (with `\').


     SEE ALSO
          lextools(1)                        lextools user functions.
          fsmintro(1)                        Introduction to the FSM
                                             programs and library.
          grmintro(1)                        Introduction to the GRM
                                             programs and library.
     FILES
          /usr/local/bin/lextools-3          Distribution Lextools
                                             binaries.

          /usr/local/lib/lextools-3/grammars Sample Grammars
     AUTHOR
          Richard Sproat (rws@research.att.com)