NAMES
Lextools symbol files, regular expression syntax, general
file formats, grammar formats - lextools file formats
DESCRIPTION
Symbol Files
The lextools symbol file specifies the label set of an
application. Labels come in three basic flavors:
Basic labels
Superclass labels
Category labels
Basic labels are automatically assigned a unique integer
representation (excluding 0, which is reserved for
"<epsilon>"), and this information is compiled by lexmakelab
into the basic label file, which is in the format specified
in fsm(5).
Superclass labels are collections of basic labels: a super-
class inherits all of the integral values of all of the
basic labels (or superclass labels) that it contains.
Category labels are described below.
The lines in a symbol file look as follows:
superclass1 basic1 basic2 basic3
You may repeat the same superclass label on multiple (possi-
bly non-adjacent) lines: whatever is specified in the new
line just gets added to that superclass. The "basic" labels
can also be superclass labels: in that case, the superclass
in the first column recursively inherits all of the labels
from these superclass labels.
The one exception is a category expression which is speci-
fied as follows:
Category: catname feat1 feat2 feat3
The literal "Category:" must be the first entry: it is case
insensitive. "catname" should be a new name. "feat1" labels
are their values.
The following sample symbol file serves to illustrate:
dletters a b c d e f g h i j k l m n o p
dletters q r s t u v w x y z
uletters A B C D E F G H I J K L M N O P
uletters Q R S T U V W X Y Z
letters dletters uletters
gender masc fem
case nom acc gen dat
number sg pl
person 1st 2nd 3rd
Category: noun gender case number
Category: verb number person
For this symbol set, the superclass "dletters" will contain
the labels for all the lower case labels, "uletters" the
upper case letters, and "letters" both. Defined categories
are "noun" and "verb". "noun", for instance, has the fea-
tures "gender", "case" and "number". The feature "gender"
can have the values "masc" and "fem".
(NB: this way of representing features is inherited in con-
cept from Lextools 2.0.)
Some caveats:
You should not use the reserved terms "<epsilon>" or
"<sigma>" in a symbol file. If you pass the -L flag to lex-
makelab you should also not use any of the special symbols
that it introduces with that flag (see lextools(1)).
Symbol files cannot contain comments. Backslash continuation
syntax does not work.
Regular Expressions
Regular expressions consist of strings, possibly with speci-
fied costs, conjoined with one or more operators.
strings are constructed out of basic and superclass labels.
Labels themselves may be constructed out of one or more
characters. Characters are defined as follows:
If 2-byte characters are specified (Chinese, Japanese,
Korean...), a character can be a pair of bytes if the
first byte of the pair has the high bit set.
In all other conditions a character is a single byte.
Multicharacter tokens MUST BE delimited in strings by a left
bracket (default: "[") and right bracket (default: "]").
This includes special tokens "<epsilon>", "<sigma>", "<bos>"
and "<eos>". This may seem inconvenient, but the regular
expression compiler must have some way to figure out what a
token is. Whitespace is inconvenient since if you have a
long string made up of single-character tokens, you don't
want to be putting spaces between every character: trust me.
You may also use brackets to delimit single-character tokens
if you wish.
Some well-formed strings are given below:
abc[foo]
[<epsilon>]ab[<sigma>]
Note that the latter uses the superclass "<sigma>" (con-
structed by lexmakelab to include all symbols of the alpha-
bet except "<epsilon>"): in order to compile this expres-
sion, the superclass file must have been specified using the
-S flag.
If features are specified in the label set, then one can
specify the features in strings in a linguistically appeal-
ing way as follows:
food[noun gender=fem number=sg case=nom]
Order of the feature specifications does not matter: the
order is determined by the order of the symbols in the sym-
bol file. Thus the following is equivalent to the above:
food[noun case=nom number=sg gender=fem]
The internal representation of such feature specifications
looks as follows: "food[_noun][nom][sg][fem]".
Unspecified features will have all legal values filled in.
Thus
food[noun case=nom number=sg]
will produce a lattice with both [fem] and [masc] as alter-
natives. Inappropriate feature values will cause a warning
during the compilation process. Since features use super-
classes, again, in order to compile such expressions, the
superclass file must have been specified using the -S flag.
Costs can be specified anywhere in strings. They are speci-
fied by a positive or negative floating point number within
angle brackets. The current version of lextools assumes the
tropical semiring, so costs are accumulated across strings
by summing. Thus the following two strings have the same
cost:
abc[foo]<3.0>
a<-1.0>b<2.0>c<1.0>[foo]<0.5><0.5>
Note that a cost on its own -- i.e. without an accompanying
string -- specifies a machine with a single state, no arcs,
and an exit cost equal to the cost specified.
Regular expressions can be constructed as follows. First of
all a string is a regular expression. Next, a regular
expression can be constructed out of one or two other regu-
lar expressions using the following operators:
regexp1* Kleene star
regexp1+ Kleene plus
regexp1^n power
regexp1? optional
!regexp1 negation
regexp1 | regexp2 union
regexp1 & regexp2 intersection
regexp1 : regexp2 cross product
regexp1 @ regexp2 composition
regexp1 - regexp2 difference
In general the same restrictions on these operations apply
as specified in fsm(1). For example, the second argument to
"-" (difference) must be an unweighted acceptor. Note also
that the two argument to ":" (cross product) must be accep-
tors. The argument n to "^" must be a positive integer. The
arguments to "@" (composition) are assumed to be transduc-
ers.
The algorithm for parsing regular expressions finds the
longest stretch that is a string, and takes that to be the
(first) argument of the unary or binary operator immediately
to the left, or the second argument of the binary operator
immediately to the right. Thus "abcd | efgh" represents the
union of "abcd" and "efgh" (which is reminiscent of Unix
regular expression syntax) and "abcd*" represents the tran-
sitive closure of "abcd" (i.e., not "abc" followed by the
transitive closure of d, which is what one would expect on
the basis of Unix regular expression syntax).
The precedence of the operators is as follows (from lowest
to highest), the last parenthesized group being of equal
precedence:
| & - : (* + ? ^)
But this is hard to remember, and in the case of multiple
operators, it may be complex to figure out which elements
get grouped first. The use of parentheses is highly recom-
mended: use parentheses to disambiguate "!(abc | def)" from
"(!abc) | def".
Spaces are never significant in regular expressions.
Escapes, Comment Syntax and Miscellaneous other
Comments can appear in input files, with the exception of
symbol files. Comments are preceded by "#" and continue to
the end of the line.
You can split lines or regular expressions within lines onto
multiple lines if you include "\ " at the end of the line,
right before the newline character.
Special characters, including the comment character, can be
escaped with `\'. To get a `\', escape it with `\': `\\'.
Characters that should be escaped are: `\', `*', `^', `?',
`!', `|', `&', `:', `@', `-', `[', `]', `<' and `>'.
Lexicons
The input to lexcomplex is simply a list of regular expres-
sions. The default interpretation is that these expressions
are to be unioned together, but other interpretations are
possible: see lextools(1) for details.
If any of the regular expressions denotes a relation (i.e.,
a transducer) the resulting union also denotes a relation,
otherwise it denotes a language (i.e., an acceptor).
Arclists
An arclist (borrowing a term from Tzoukermann and Liberman's
1990 work on Spanish morphology) is a simple way to specify
a finite-state morphological grammar. Lines can be of one of
the following three formats:
instate outstate regexp
finalstate
finalstate cost
Note that cost here should be a floating point number not
enclosed in angle brackets. State names need not be enclosed
in square brackets: they are not regular expressions.
The following example, for instance, specifies a toy grammar
for English morphology that handles the words, "grammati-
cal", "able", "grammaticality", "ability" (mapping it to
"able + ity"), and their derivatives in "un-":
START ROOT un [++] | [<epsilon>]
ROOT FINAL grammatical | able
ROOT SUFFIX grammatical
ROOT SUFFIX abil : able
SUFFIX FINAL [++] ity
FINAL 1.0
Paradigms
The paradigm file specifies a set of morphological para-
digms, specified as follows.
Each morphological paradigm is introduced by the word "Para-
digm" (case insensitive), followed by a bracketed name for
the paradigm
Paradigm [m1a]
Following this are specifications of the one of the follow-
ing forms:
Suffix suffix features
Prefix prefix features
Circumfix circumfix features
The literals "Suffix", "Prefix" and "Circumfix" are matched
case-insensitively. The remaining two fields are regular
expressions describing the phonological (or orthographic)
material in the affix, and the features. The "Circumfix"
specification has a special form, namely "regexp...regexp".
The three adjacent dots, which must be present, indicate the
location of the stem inside the circumfix. In all cases,
features are placed at the end of the morphologically com-
plex form. There is no provided mechanism for infixes,
though that would not be difficult to add.
One may specify in a third field in the "Paradigm" line
another previously defined paradigm from which the current
paradigm inherits forms:
Paradigm [mo1a] [m1a]
In such a case, a new paradigm will be set up, and all the
forms will be inherited from the prior paradigm except those
forms whose features match entries specified for the new
paradigm: in other words, you can override, say, the form
for "[noun num=sg case=dat]" by specifying a new form with
those features. (See the example below.) One may also add
additional entries (with new features) in inherited para-
digms.
A sample set of paradigms (for Russian) is given below:
Paradigm [m1a]
Suffix [++] [noun num=sg case=nom]
Suffix [++]a [noun num=sg case=gen]
Suffix [++]e [noun num=sg case=prep]
Suffix [++]u [noun num=sg case=dat]
Suffix [++]om [noun num=sg case=instr]
Suffix [++]y [noun num=pl case=nom]
Suffix [++]ov [noun num=pl case=gen]
Suffix [++]ax [noun num=pl case=prep]
Suffix [++]am [noun num=pl case=dat]
Suffix [++]ami [noun num=pl case=instr]
Paradigm [mo1a] [m1a]
Paradigm [m1e] [m1a]
Suffix [++]"ov [noun num=pl case=gen]
Suffix [++]"ax [noun num=pl case=prep]
Suffix [++]"am [noun num=pl case=dat]
Suffix [++]"ami [noun num=pl case=instr]
Note that "[mo1a]" inherits all of "[m1a]", whereas "[m1e]"
inherits all except the genitive, prepositional, dative and
instrumental plurals.
See lextools(1) for some advice on how to link the paradigm
labels to individual lexical entries in the lexicon file
argument to lexparadigm.
Context-Free Rewrite Rules
The input to lexcfcompile is a set of expressions of the
following form:
NONTERMINAL -> regexp
The "->" must be literally present. "NONTERMINAL" can actu-
ally be a regular expression over nonterminal symbols,
though the only useful regular expressions in this case are
unions of single symbols. The "regexp" can in principle be
any regular expression specifying a language (i.e., not a
relation) containing a mixture of terminals and non-
terminals. However, while lexcfcompile imposes no restric-
tions on what you put in the rule, the algorithm implemented
in GRMCfCompile, which lexcfcompile uses, can only handle
certain kinds of context-free grammars. The user is
strongly advised to read and understand the description in
grm(1) to understand the restrictions on the kinds of
context-free grammars that can be handled.
By default the start symbol is assumed to be the first non-
terminal mentioned in the grammar; see lextools(1) for fur-
ther details.
The following grammar implements the toy English morphology
example we saw above under Arclists, this time putting
brackets around the constituents (and this time without the
mapping from "abil" to "able"):
[NOUN] -> \[ ( \[ [ADJ] \] | \[ [NEGADJ] \] ) ity \]
[NOUN] -> \[ [ADJ] \] | \[ [NEGADJ] \]
[NEGADJ] -> un \[ [ADJ] \]
[ADJ] -> grammatical | able
Context-Dependent Rewrite Rules
A context-dependent rewrite rule file consists of specifica-
tions of one of the following two forms:
phi -> psi / lambda __ rho
phi => psi / lambda __ rho
In each case "phi", "psi", "lambda" and "rho" are regular
expressions specifying languages (acceptors). All but "psi"
must be unweighted (a requirement of the underlying GRMCd-
Compile; see grm(1), grm(3)). The connectors "->", "=>", and
"/" must literally occur as such. The underbar separating
"lambda" and "rho" can be any number of consecutive under-
bars. The interpretation of all such rules is that "phi" is
changed to "psi" in the context "lambda" on the left and
"rho" on the right.
The difference between the two productions, "->" and "=>" is
the following. "->" denotes a mapping where any element of
"phi" can be mapped to any element of "psi". With, "=>" the
inputs and outputs are matched according to their order in
the symbol file: this is most useful with single (super-
class) symbol to single (superclass) symbol replacements.
For example, suppose you have the following entries in your
symbol file:
V a e i o u
+voiced b d g
-voiced p t k
The rule:
[+voiced] -> [-voiced] / V __ V
will replace any symbol in {p,t,k} with any symbol in
{b,d,g} between two vowels. Probably what you want in this
case is the following:
[+voiced] => [-voiced] / V __ V
This will replace "p" with "b", "t" with "d" and "k" with
"g". The matching is done according to the order of the sym-
bols in the symbol file. If you had specified instead:
+voiced b g d
-voiced p t k
then "t" would be replaced with "g" and "k" with "d". Simi-
larly with
+voiced b d g
-voiced p t k x
"p", "t" and "k" would be replaced as in the first case, but
"x" would be ignored since there is nothing to match it to:
nothing will happen to "x" intervocalically. Use of the
matching rule specification "=>" thus requires some care in
labelset management.
Beginning of string and end of string can be specified as
"[<bos>]" and "[<eos>]", respectively: these are added to
the label set by default if you use the -L flag to lexmake-
lab.
A line may also consist of one of the following specifica-
tions (which are case insensitive):
left-to-right
right-to-left
simultaneous
optional
obligatory
The first three set the direction of the rule application;
the last two set whether the rule application is obligatory
or optional; see grm(1). All specifications are in effect
until the next specification or until the end of the file.
The default setting is obligatory, left-to-right. In
practice the user will rarely need to fiddle with these
default settings.
Replacements
A replacement specification (for lexreplace) is a file con-
sisting of lines like the following:
foo.fst a|b|c|d
The first column specifies a single fsm that must exist in
the named file. The remainder of the line specifies a union
of labels to be replaced in the topology fsm argument to
lexreplace with said fsm. Specified substitutions for a
given label will override and previous substitutions. In the
following case:
foo.fst a|b|c|d
bar.fst a
you will foo.fst for "b", "c" and "d", and bar.fst for "a".
See also grmreplace in grm(1).
Currency Expressions
Currency specification files contain lines of the following
form:
sym major-expr point minor-expr large-number
Each entry is a regular expression. See lextools(1) for a
description of the function of the entries.
Note that the whitespace separator MUST BE A TAB: the regu-
lar expressions themselves may contain non-tab whitespace.
There must therefore be four tabs. You must still have tabs
separating entries even if you split an entry across multi-
ple lines (with `\').
SEE ALSO
lextools(1) lextools user functions.
fsmintro(1) Introduction to the FSM
programs and library.
grmintro(1) Introduction to the GRM
programs and library.
FILES
/usr/local/bin/lextools-3 Distribution Lextools
binaries.
/usr/local/lib/lextools-3/grammars Sample Grammars
AUTHOR
Richard Sproat (rws@research.att.com)