Hermit Crab Parsing Engine Specification
Mike Maxwell
24 February 1999
The morpher/ lexical lookup module is also referred to as the "morpher module" in this specification. Its function is analyze each word of the input into a stem plus possible affixes. Conceptually, this is done by applying morphological and (morpho-)phonological rules in analysis order (i.e. the reverse of the order linguists usually think of) until the morpher discovers a string matching the lexical entry of some stem in the user's dictionary. The rules are applied in this reverse order in as many ways as possible to generate all possible analyses of each word. Each lexical entry discovered in this way is then acted on by the rules in synthesis order, to allow the testing of various criteria more conveniently tested when the lexical entry is known. (The algorithm assumed here is then a generate-and-test algorithm.) The output is the set of analyses, in the form of lexical entries for the input word.
The user is free to provide lexical entries for roots, stems, or partially or completely inflected/ derived words. Because of this freedom on the part of the user to provide both inflected and uninflected lexical entries, the lexical entries into which the morpher module analyzes input words are of one of two types: real entries, and virtual entries. A real lexical entry is one which the user has listed in the dictionary, while a virtual entry is one which the morpher has constructed from a dictionary entry plus one or more affixes.
The dictionary is then the repository of all real (as opposed to virtual) lexical entries. Since the dictionary is potentially very large, it may not be stored in the lexical module itself, but may be a separate module (perhaps a database program).
Regardless of whether the dictionary is actually internal to the morpher or not, the morpher may handle access to the lexical entries of the dictionary. That is, the morpher may serve as the front end to the dictionary. Dictionary commands are therefore listed together with other morpher commands in the following specification.
This section describes the linguistic characteristics of the morpher module in general terms. Succeeding sections provide a more rigorous definition of these capabilities.
Morphological and phonological rules are discussed in this specification from the viewpoint of the linguist. That is, the "input" and "output" of rules are seen from the viewpoint of the generation of surface forms from underlying forms. (However, the term "input to the morpher module" refers to the unanalyzed tokens read in by the morpher, while the term "output of the morpher" refers to the lexical entries written out by the morpher.)
The morpher may be used to model either an Item-and-Process theory or an Item-and-Arrangement theory.
The user may define various strata of rule application, where a "stratum" of rules refers to a set of rules which apply in a block, before or after the application of rules of other strata.
A morphological rule applies in just one stratum, while a phonological rule may apply in more than one stratum. Which stratum (or strata) a given rule applies in is designated by the user, as is the order of application of the various strata.
Linguistic theories may vary in the number of strata they assume. A structuralist theory, for instance, might have a stratum of allophonic phonological rules and another stratum of morphological and morphophonemic rules. The theory of The Sound Pattern of English (Chomsky and Halle 1968, henceforth SPE), on the other hand, assumes that morphological and phonological rules exist in at least two strata, a cyclic stratum and a postcyclic stratum. (Some generative phonologists would propose a stratum of precyclic rules as well.)
Within each stratum, the user (or the shell) may define several types of rule interaction, including cyclic and non-cyclic application. (Cyclic application, as implemented by Hermit Crab, is not precisely the same as that described in SPE. Under Hermit Crab, each cycle of phonological rules applies immediately after each morphological rule, not after all the morphological rules of the cyclic stratum have applied. If a morphological rule is sensitive to the phonetic form of the word to which it attaches, this leaves open the possibility that a preceding cycle of phonological rules will feed or bleed that morphological rule.) Cyclic phonological rules, in addition to applying as a block after each application of a cyclic morphological rule, are constrained by Kiparsky's Strict Cycle Condition (see below, Cyclic Phonological Rules, 2.3.2).
Within each stratum, morphological rules may be specified as being ordered in a linear fashion, or as being unordered (i.e. as potentially applying whenever their structural description is met). Similarly, phonological rules may be specified as being linearly ordered, as applying whenever their structural description is met, or as applying simultaneously (the latter option being unique to phonological rules). If linear order is specified for morphological and/or phonological rules, the relative ordering of individual rules must be specified.
Finally, subsets of the phonological rules in a given stratum may be specified as applying disjunctively. Within such a set of rules, the order is linear; and as soon as one such rule has applied once, no other rule in the set may apply to the same position in the phonetic shape of the lexical entry (except that in a cyclic stratum, the entire set may be applied again on the next cycle, subject to the Strict Cyclicity Condition).
Morphological rules analyze the phonetic (or phonological) shape of their input (a lexical entry) into one or more substrings, and output a lexical entry whose phonetic shape is the concatenation of one or more phonological substrings. These output substrings may be copies of the original substrings, copies altered by the modification of designated features, or entirely new sequences of segments and boundary markers. A morphological rule may also change the syntactic feature content, part of speech, etc. of a lexical entry.
Morphological Rules have one or more subrules, which apply disjunctively to a given form: the first subrule to match a given form is the only subrule which can apply. This mechanism can be used to encode variant forms of a rule whose application depends on the phonetic form of their input (e.g. English pluralization), conjugation class membership (e.g. Spanish verb classes), etc.
Sequences of phonetic segments in a morphological rule are specified in terms of their phonetic features. These sequences are matched against a translation into phonetic features of the string representing the phonetic shape of the rule’s input.
A morphological rule may require or prohibit the presence of Morphological Rule Features or syntactic features; may require that the input belong to a certain part of speech; and may require that the input have certain syntactic subcategorization properties.
All the following affix types can be analyzed by the morpher: prefixes, suffixes, circumfixes, infixes, suprafixes, replacives, reduplication, and null affixes.
However, care should be taken in writing null affixation rules, lest they cause the morpher to loop infinitely. For instance, if a language had a null affixation rule that derived nouns from verbs, and another null affixation rule that derived verbs from nouns, with no further stipulation the morpher could enter an infinite regress of deriving nouns from verbs and vice versa. Such looping can be prevented by the use of assigned and prohibited features.
As discussed in more detail below (see section 2.3.1, Reverse Application of Phonological Rules), when parsing, Hermit Crab operates part of the time in analysis mode, undoing rules, rather than applying rules to an underlying form (as linguists are accustomed to doing). When a rule specifies a change in the value of some feature (e.g. that a particular segment in the stem becomes voiced), Hermit Crab "undoes" this rule by leaving the value for voicing of that segment unspecified. This is because the original (underlying) value of that feature is unknown: the morphological rule may apply in the synthesis direction to one underlying form by changing the feature specification (in this case changing an underlying [– voiced] segment to [+ voiced]), while to another underlying form, the rule may apply vacuously. The original voicing contrast thus becomes neutralized in this context. When features which have become uninstantiated are referred to by another rule, Hermit Crab assumes the rule applies without actually instantiating the features in question to all possible combinations of values.
As mentioned above, a morphological rule may be assigned to any one stratum. Within each stratum, morphological rules may be specified as linearly ordered or as unordered (i.e. as applying whenever their structural description is met).
If the user has specified multiple strata, there is a linear order among those strata, and no morphological rule from an earlier stratum may apply after a rule from a later stratum. "Looping back" (as advocated in Halle and Mohanan 1985) is therefore not provided for. This can lead to problems. For example, a common analysis of English is that rules of the cyclic stratum precede rules of the post-cyclic stratum. For instance, –al is a cyclic suffix (in SPE terms, it is attached with a + boundary), while –ment is a post-cyclic suffix (in terms of the SPE system, it is attached with a # boundary). However, in the word developmental, the –al suffix attaches outside the –ment suffix, an impossibility if there is no looping back. In defense of the non-provision for looping back in Hermit Crab, it should be said that there is no clear answer in morphological theory to such ordering paradoxes. (An ad hoc solution is to regard –mental as a single post-cyclic suffix. Alternatively, development can be listed in the lexicon as a stem.)
A similar ordering paradox occurs across word boundaries in compound words like transformational grammarian—a phrase which refers to a person who studies transformational grammar, not a grammarian who undergoes transformations. Again, theory provides no simple answer, nor does Hermit Crab.
Hermit Crab also lacks any direct provision for orders of affixes (i.e. position classes, such as first order suffixes, second order suffixes, etc.). However, it is possible to enforce an order of affixes in two ways. One is by the use of features. For instance, first order affix rules could assign the pseudo-syntactic feature (level (one)), and second order affix rules could require the presence of the feature (level (one)), assigning the new feature value two to the feature level, resulting in the new feature-name feature-value pair (level (two)). If a third order suffix could attach outside either a second order suffix or a first order suffix, such a rule could require the presence of either a (level (two)) or a (level (one)) feature. More precisely, the rule would have among its Required Syntactic Features the feature (level (one two))), and would assign the feature (level (three)).
A more straightforward way of assigning orders to affixes is by linearly ordering the morphological rules that attach the affixes. A rule attaching a first order suffix would be ordered before the rule attaching a second order suffix, etc.
Phonological rules are of the general form
X ® Y / W __ Z
where X, Y, W and Z represent (possibly optional) sequences of phonetic segments (specified as phonetic feature matrices) and/or boundary markers. These sequences cannot include any segments of morphemes which have not yet been attached by a morphological rule, i.e. which are outside of the "current" stem. The application of phonological rules may also be restricted by requiring the presence or absence of specified MPR features and/ or part of speech. Syntactic features (i.e. head and foot features) are invisible to phonological rules.
"Phonological rules", as here defined, encompass the allophonic and morpho-phonemic rules of structuralist linguistics, as well as the phonological rules of generative phonology.
Phonetic features are specified in phonological rules by a unique value (such as ‘+’ or ‘-’), or by alpha variables (as used e.g. in Chomsky and Halle 1968).
As mentioned above, phonological rules are allowed to apply at one or more strata. Within each stratum, phonological rules may be specified as being ordered linearly, or as applying simultaneously; disjunctive subsets of rules may also be defined.
One peculiarity of Hermit Crab's application of phonological rules is due to its operation in analysis mode. Rather than starting with an underlying form and applying phonological and morphological rules to unambiguously synthesize a surface form, Hermit Crab begins with a known surface form, and attempts to analyze it into one or more underlying forms. As with morphological rules, phonological rules are neutralizing in the synthesis mode: they assign a value to a feature regardless of what the previous value (if any) of that feature may have been. In undoing the application of such a rule, i.e. applying it in an analysis manner, one doesn't know what feature values the vowel had prior to the rule's application, at least until lexical lookup. The rule may have applied by changing the feature values in question, but it is equally possible the rule applied vacuously. When un-applying a phonological or morphological rule, Hermit Crab un-instantiates such features. When another rule needs to know the value of an un-instantiated feature, the morpher simply assumes that uninstantiated features have the required value. This may result in over-application of phonological rules, but any incorrect application will be discovered when the rules are applied in synthesis mode to looked-up lexical entries.
At any rate, a deeper rule will require the value of a feature which was uninstantiated by a shallower rule only in the case of opaque rule orderings, i.e. when one rule counterbleeds or counterfeeds another rule. For instance, consider two rules A ® B /C__D and C ® E /false__G, applying in that (synthesis) order, and suppose that there are at least some forms in which both rules will apply. When Hermit Crab undoes the application of the second (shallower) rule, it will uninstantiate the features of C. But when Hermit Crab comes to the first rule, the value of those features required by that rule are unknown, so Hermit Crab simply assumes that the rule applies. (For further background, see Maxwell 1991.)
The situation is more complex in the case of length-changing rules (e.g. deletion rules, epenthesis rules, and diphthongization rules). Due to the internal representation used by Hermit Crab, the morpher will need to explore two search paths: one for which the length-changing rule is unapplied, and one for which it is not. This is true regardless of the interaction of the length-changing rule with other rules.
When doing lexical lookup of a partially instantiated phonetic feature matrix, Hermit Crab looks for all lexical forms which would match the feature matrix, ignoring unspecified feature values. For instance, suppose a language has both front rounded and front unrounded vowels, and Hermit Crab has undone a rule which rounds vowels in some environment. Then given that the rule has been "un-applied" to a form with surface [ü], the morpher will create a high front vowel with unknown rounding, and attempt to find a lexical entry with either an [i] or an [ü] in that position. (If the phonological analysis which Hermit Crab were modeling in this example used archiphonemes, the morpher would also attempt to find a lexical entry with the appropriate archiphoneme in that position.)
Cyclic phonological rules apply once at the beginning of a cyclic stratum, and once after each application of a cyclic morphological rule; they are ordered among themselves as specified by the user. The morpher further constrains the application of cyclic phonological rules on all but the first cycle by Kiparsky's (1982) Strict Cycle Condition, given below:
Cyclic phonological rules apply only to derived representations.
A representation X is derived with respect to phonological rule R in cycle j iff X meets the structural analysis of R by virtue of a combination of morphemes introduced in cycle j or by virtue of the application of a previous cyclic phonological rule in cycle j (even if that application was vacuous).
Non-cyclic phonological rules are applied as a block after any applicable morphological rules of the same stratum have applied. Their order among themselves may be specified by the user.
Theories differ as to the number and kind of boundary markers they countenance. Hermit Crab makes no commitment to any of these theories, save that there is no provision for treating boundary markers as segments with features (as in Chomsky and Halle 1968).
Boundary markers are inserted as strings (not phonetic feature matrices).
Boundary markers in the phonetic shape of a lexical entry are ignored when matching that lexical entry against a phonological rule, unless the rule explicitly requires the boundary.
Boundary markers are erased at the end of each cycle and stratum.
Both morphological rules and phonological rules may insert boundary markers. However, the use of phonological rules to insert or alter boundary markers (i.e. readjustment rules) is discouraged, as it may lead to computational intractability.
In the absence of other restrictions, the fact that phonological rules can delete segments puts phonology into the domain of an unrestricted rewrite grammar. Since such a grammar would be impossible to parse, Hermit Crab places arbitrary (i.e. nonlinguistic) restrictions on deletion rules. (This is not to say that we have placed sufficient restrictions on such rules—a sufficiently ingenious linguist may still find some way of putting the morpher into an infinite loop, perhaps by including a deletion rule in a cyclic stratum!)
For the purposes of this discussion, a deletion rule is any rule which deletes part of its input, i.e. where the number of segments in the output of the rule is less than the number of segments in its input.
Understanding the arbitrary restriction that Hermit Crab uses requires an understanding of the way in which unapplication of deletion rules proceeds. Unlike all other rules, deletion rules are always unapplied as if they had been applied simultaneously. That is, during unapplication to a form X, X is scanned for all places where the deletion rule could be unapplied, and the rule is unapplied to those places, resulting in the new form X'. By default, that is the end of it; deletion rules cannot be unapplied again. However, if the user is brave, he can set the variable *del_re_apps* to some number greater than zero (its default); then the deletion rule is unapplied to X', and to X'', etc. *del_re_apps* times.
The above definition is couched in terms of simultaneous (un-)application of the deletion rule. However, if *del_re_apps* is set to a sufficiently large number, unapplication of a deletion rule will generate from a surface form all the underlying forms (and more) from which iterative application of the deletion rule might have generated the surface form (I think!).
Morphological/ phonological rule features (abbreviated as MPR features) and syntactic features are arbitrary features assigned by the user. MPR features govern the application of morphological and phonological rules, while syntactic features govern the application of morphological and syntactic rules. Syntactic features bear values, while MPR features do not bear values (i.e. if an MPR feature name appears on the MPR list of a given lexical entry, its value is implicitly +, while if it is absent, its value is implicitly –). Syntactic features include Head Features and Foot Features. However, this distinction is essentially invisible to the Morpher Module; morphological rules can assign features as either Head- or Foot-features in their output, but make no use of the distinction. (The distinction is, however, visible to the Parser Module.)
The value of a syntactic feature is a list (this is an extension of many theories, in which syntactic features are atomic valued; atomic valued features can be simulated by lists of length one). The interpretation of a list value whose length is greater than one is that the feature in question is ambiguous between (or among) the values listed.
Typical examples of features are tense (past present) (a syntactic feature-name feature-value pair) and verb_class_3 (an MPR feature).
A morphological rule may require that the syntactic features of the lexical entry which constitutes its input be unifiable with the features specified in the input of the rule. The rule may also require the presence or absence of specified MPR feature names.
A phonological rule may require the presence or absence of designated MPR features; syntactic features are invisible to phonological rules.
There are three ways in which features can become attached to a lexical entry. First, syntactic and MPR feature values may be assigned in the user's dictionary (i.e. lexically); syntactic feature assignments may later be changed by unification with specified features of a morphological rule’s input. Second, syntactic features have default values (see below); if a morphological rule calls for the unification of a specified feature with the value of that feature in a lexical entry, but the lexical entry has not received any values for that feature thus far in the derivation, then the unification of the rule’s feature specification with the default feature value becomes the new value of the feature in the lexical entry. Thirdly, both syntactic and MPR features may be introduced in the output of a morphological rule. However, assignment of syntactic feature values in the output side of a morphological rule overrides feature values (if any) previously assigned to the lexical entry. (E.g. a rule may change a singular noun into a plural noun.)
There is no restriction on the meaning of features. For instance, the English suffix –ee is restricted to verbs which take animate direct or indirect objects: employee, *tearee. This restriction might be encoded with the ad hoc Morphophonemic Rule feature AnimObj.
Feature value assignment, together with null affixation rules, allows Hermit Crab to distinguish between true null affixes, such as the plural marker on sheep, and optional affixes. That is, one analysis of English would hold that there are two words sheep: one singular, and one plural. The null affix pluralization rule for words like sheep, deer, antelope, reindeer, bison etc. would require that the value of the feature number of the input be unifiable with the value (singular), while the output would be assigned the feature number (plural). The lexical entry for the singular noun sheep (in the user's lexicon) would bear the feature number (singular). The surface string sheep would then be ambiguous between two lexical entries, one the singular noun sheep (listed in the user's lexicon), and the other the plural noun sheep (derived from the singular form by the rule of null affixation).
On the other hand, in a language in which the plural suffix was optional, the syntax will require that an unsuffixed word be unmarked (and therefore ambiguous) for the feature number (so as to support both singular and plural number agreement between the subject and the verb, for instance). Likewise, the morphological rule for plural affixation would require that the lexical entry which serves as its input be unifiable with the feature number (singular) (so as not to pluralize a noun already marked plural); the output of this rule would assign the feature number (plural). Under the system described in this chapter, if unsuffixed noun lexical entries bear no value for the feature number, they will be unifiable with the value (singular); i.e. a feature with no value serves as the identity feature under unification.
By default, any syntactic features not specifically assigned values are treated as having a maximal set of values for purposes of unification, i.e. the unification of A and B, where A is a feature with no values assigned and B is a feature of the same name with one or more values assigned, is B.
The grammar writer may assign other default values to any feature names by use of the function assign_default_morpher_feature_value (see section 6.1.11). There is no provision for making default feature assignment dependent on part of speech or on the values of other features, although this is a possible future enhancement.
2.5.1. Irregular and Suppletive Forms
Irregular or semi-regular forms may be treated in two ways:
(1) By specifying morphological or phonological rules which only apply to (or which fail to apply to) forms marked with specified features (see section 2.4, Syntactic and Phonological/ Morphological Rule Features); and
(2) By listing irregular forms in the lexicon.
Method (1) might be used for verb classes that take different suffixes (e.g. Spanish –ar, –er and –ir verbs), while (2) might be used for a highly irregular verb, such as the English verb be.
However, it is not sufficient for the morpher to merely recognize irregular forms; it must also not analyze a given string as if it were the regular form of an irregular word. For instance, not only must the morpher recognize the English word saw as the past tense of see, it must not morph the English word seed as if it were a regular past tense of see. This situation is treated in terms of the blocking of the analyzed form by an irregular form listed in the lexicon (see section 3.4, Families of Lexical Entries). Blocking allows for words which are irregular in their phonology, morphology, or subcategorization (for arguments that a form can be irregular in its subcategorization, see Carlson and Roeper 1981).
A morphological rule may require that the stem to which it attaches have a certain phonetic form.
However, occasionally affixes will attach to any morpho-phonological form except a certain one. An example is the English suffix –al, which does not attach to a stem ending in the suffix –ism: *fatalismal (Aronoff 1976). Aronoff's solution is a negative phonological condition on the rule attaching –al: the stem must not be analyzable into a root + the suffix –ism.
Hermit Crab does not allow negative conditions on the phonological composition of stems, but this particular case could easily be handled by having the –ism suffixation rule assign the ad hoc Morphosyntactic Rule feature ISM, and having the –al suffixation rule forbid that feature. An alternative analysis of this case (suggested by Siegel 1974) is that the –ism rule is ordered after the –al rule. Either of these solutions would fail if the negative condition were purely phonological (which it is not in this case: cf. baptismal). It is not clear whether affixes in natural languages can have purely phonetic negative conditions on their attachment (but see Scalise 1986: 46-48 for some possible examples). At any rate, Hermit Crab does not provide for negative phonetic conditions.
Rarely languages will have gaps in their paradigms. A paradigm gap occurs when there is no form for a given position of the paradigm. For instance, the English phrasal verb have got lacks a past tense (J.D. Fodor 1978).
Provided the nonexistent forms would not be derivable by rule from the existing forms (perhaps because the morphological rules that would derive them are blocked by MPR features), the nonexistent paradigm forms could be blocked by listing all and only the existing inflected forms in the lexicon. Beyond this, there is no special provision for handling paradigm gaps in the morpher. This is in part because there is no widely accepted theoretical explanation for this phenomenon.
Morphological rules may be written in Hermit Crab for compounding and incorporation processes, i.e. processes which combine two lexical entries to form a derived word, provided that the word is written solid (i.e. with no internal white space).
However, there is no provision for lexical entries for idioms and compound nouns which are not written solid. Such idioms and compound nouns must be handled syntactically (for instance by selecting one word as the head, and having that word subcategorize a special syntactic idiom rule).
This section defines various kinds of lexical entries.
Lexical entries represent words, stems, or roots, including their phonological, morphological and syntactic properties (plus any additional information added by the linguist).
As used in this specification, the term dictionary refers to a permanent repository of lexical information; this may be contained in one or more files. The lexicon, on the other hand, appears to the user as a temporary repository of information during a given session. The lexicon may be loaded from the dictionary or from a portion of the dictionary (such as a single file containing only nouns). Additions, deletions and changes to lexical entries affect only the lexicon until the lexicon is saved to the dictionary. This specification has little to say about the structure of the dictionary, except that the lexicon must be derivable from the dictionary. (The dictionary might be used as the lexicon, except that changes would be stored only in the main memory until saved.)
The actual form of the lexicon is not specified here; it may be in memory, in temporary disk files, or some combination of the two. What is specified is the form of the lexical entries which the lexicon contains.
Lexical entries may be classified as real (listed in the user's lexicon) or virtual (constructed from other lexical entries on the basis of morphological and phonological rules). Both real and virtual lexical entries may be cross-classified as complete entries, which correspond to full words in the target language, and incomplete entries, which correspond to roots or stems.
The following subsections further describe this classification of lexical entries. For a definition of lexical entries as data structures, see section 5.2.
A Real Lexical Entry is a lexical entry which is listed in the lexicon. A Real Lexical Entry must be Storable Lexical Entry (as defined below). Real Lexical Entries are added to the lexicon by the user (see section 6.4.1, load_lexical_entry; section 6.5.2 load_dictionary_from_text_file, and section 6.5.3 merge_text_file_with_dictionary).
A Virtual Lexical Entry is a lexical entry which is derived from another lexical entry (either real or virtual) by the application of one or more morphological or phonological rules (see section 4.2, Definitions of Morphological Rule Application, and section 4.4 Definitions of Phonological Rule Application).
A storable lexical entry is one which is a candidate for entry in the user's dictionary. In most cases, economy of storage (and the patience of the user) will dictate that only roots and irregular forms will actually be stored in the lexicon. However, lexical lookup is attempted for each storable lexical entry found in the analysis of an input word.
Each Real Lexical Entry may specify a Family Name. The set of all real lexical entries which have the same Family Name are referred to as a Family of Lexical Entries, and the individual members of that family are each other's Relative Lexical Entries.
The purpose of having families of lexical entries is to allow for blocking of regular derivations by the presence of irregular lexical entries listed in the lexicon. For instance, consider the English word seed. This word is properly formed as a noun, but not as the past tense of the verb see, since it is blocked by the irregular past tense saw. It would not be sufficient to simply list the irregular form saw in the lexicon, since that would not prevent morphing seed as a past tense verb. Rather, it is necessary to bock the incorrect morphing by setting up the irregular form as the unique past tense of see.
Suppose the morpher is analyzing some surface form. Once a real lexical entry has been looked up in the course of analysis, its Family Name (if any) is known. The morpher can then compare the various storable lexical entries which it produces in the course of the derivation which synthesizes the surface form from this real lexical entry against the relative lexical entries (i.e. all lexical entries with the same Family Name as that of the real lexical entry which it found). If any relative lexical entry has the same Part of Speech, Subcategorization, Head and Foot Features as one of the storable lexical entries in the derivation, then that Relative Lexical Entry represents an irregular form which blocks the derivation.
Note: There is nothing to prevent the user from redundantly listing a regular form in the lexicon as a relative lexical entry. Such a regular form will be found at lexical lookup, and will block its own derivation by rule from some other real lexical entry, which at least prevents duplicate analyses of a given word. One situation where it might be desirable to list productive forms is the case where tow forms of a given word exist (due to historical change or dialectal variation). Examples in English include hanged–hung and learned–learnt. If both forms are listed, either form will be correctly analyzed (since real lexical entries do not block each other).
The mechanism of blocking is detailed below (see section 3.6, Analyzable Word).
A Complete Lexical Entry potentially represents a fully inflected word, as opposed to an Incomplete Lexical Entry, which represents a form that is not fully inflected, i.e. a stem or root. ("Potentially", because it may in fact be blocked by an irregular form; see 3.6, Analyzable Word.)
A Complete Lexical Entry results from the application of zero or more morphological and phonological rules to some Real Lexical Entry, provided all Obligatory Features required by that Real Lexical Entry and the morphological rules which applied in the derivation are instantiated in the Complete Lexical Entry. The sequence of lexical entries beginning with the Real Lexical Entry, followed by a series of zero or more Virtual Lexical Entries, and terminating in the Complete Lexical Entry, represents the derivation of that Complete Lexical Entry.
More specifically, a lexical entry L is a Complete Lexical Entry if:
(1) it is a lexical entry of the *surface* stratum;
(2) it is derived from a Real Lexical Entry by the application of zero or more morphological rules and the corresponding phonological rules in accordance with the definitions of Morphological Rule Application and of Phonological Rule Application; and
(3) for each feature name in its Obligatory Head Features list, that feature name has been assigned a value in its Head Features list.
Note: Under part (3) above, it is not sufficient that a feature have a default value; it must have been assigned some value in the Real Lexical Entry from which the Complete Lexical Entry is derived, or by a morphological rule. (Default feature values may be assigned by the function assign_default_morpher_feature_value, section 6.1.11.)
Example of the use of Obligatory Features: Suppose that in some language, count nouns are obligatorily marked with a number suffix. Then the obligatory_features list of all count noun stems should contain the feature name number.
This mechanism provides a means of distinguishing between obligatory number marking (but where a null affix may indicate the unmarked value of number), and the situation in which number marking optional (so that the lack of a number marking affix indicates ambiguity as to number). In the former case, all count noun stems would be listed in the lexicon (or would be designated by some derivational rule) as requiring a value for the feature number, and there would be one or more rules attaching number affixes, of which rules one might be a rule of null affixation providing the unmarked (default) value of number. All lexical entries for count nouns which lack a value for the feature number would be incomplete lexical entries.
In the second case, in which number marking is optional, noun stems would not be listed as requiring the feature name number, and a noun to which a number affixation rule has not applied is simply unmarked for number. Such a noun would (all other requirements being met) be a Complete Lexical Entry, ambiguous for number.
An input word is analyzable if it can be matched by the morpher with one or more complete lexical entries.
An input token (word) matches a complete lexical entry if the phonetic shape of the complete lexical entry is identical to the input token's shape.
This section defines the application of morphological and phonological rules to lexical entries.
Externally, there are two different representations for sequences of phonetic segments in the morpher. Input words (tokens) and the phonetic shape of Real Lexical Entries are represented as strings, in which each segment and/or suprasegmental is represented by one or more string characters. Phonological and morphological rules, on the other hand, use a Phonetic Template data structure (defined below), in which each segment is defined in terms of its phonetic features. These differing representations are made compatible internally to the morpher by being translated into a Phonetic Sequence (also defined below). At the other "end", the phonetic shape of a virtual lexical entry (i.e. a lexical entry derived by the application of phonological and/or morphological rules) is translated from a Phonetic Sequence into a string before lexical lookup. We therefore begin with definitions of the correspondences among these phonetic representations: strings of characters, phonetic templates, and phonetic sequences.
The translation between a string and its representation as a Phonetic Sequence makes use of the Character Definition Table (defined below). The translation from string to phonetic sequence is unambiguous; the reverse translation may be ambiguous.
The translations are defined here in algorithmic form for convenience. (Hermit Crab need not use the same algorithm internally.)
The translation of the string representing an input word into a phonetic sequence, defined in this section, is unambiguous.
The phrase "exit with error, returning X" means return an error message containing X. Error messages for this translation process are listed under the command morph_and_lookup_word.
Let Str be a string consisting of string characters C1...Cm. (String characters are defined in chapter two.) This string may be translated into the Phonetic Sequence PS = (F1...Fn), where each Fi is a boundary marker or a set of phonetic features by the following procedure.
(1) Set PS equal to the empty list.
(2) Remove from Str the longest sequence of characters C = C1..Cj beginning at the left of Str and matching a Character Sequence in the Character Definition Table. (Note that Str is now of length m–j.) If no sequence beginning at the left end of Str matches with any Character Sequence in the Character Definition Table, exit with failure, returning the first character of Str.
(3) If sequence C matches the Character Sequence of a Segment Definition Record, append the Phonetic Features field of that Segment Definition Record to the right end of PS. If sequence C matches the Character Sequence of a Boundary Definition Record, append C to the right end of PS. (Boundary markers are not associated with any phonetic features, hence the character(s) which represent them in Str are also used to represent them in PS.)
(4) If Str is non-empty, go to step (2). Else exit with success, returning PS.
Note that some features in PS may be uninstantiated for some segments.
In the following definition of the translation from phonetic sequence to a regular expression, no translation is defined for a Phonetic Sequence which contains an Optional Segment Sequence record. Phonetic sequences containing Optional Segment Sequence records should appear only in rule environments, not in the structural change of rules or in lexical entries, and therefore will never need to be translated into a regular expression. (However, traces of rule unapplication may contain optional segments resulting from the unapplication of epenthesis or deletion rules (see section 5.8.3.2 Phonological Rule Analysis Trace Record--Rule Input.)
Let PS = (F1..Fn) be a Phonetic Sequence. This list may be translated into the Regular Expression RegExpr consisting of the terms C1..Cm by the following algorithm. (If each Fi is sufficiently instantiated to be unambiguously translated into a segment, RegExpr will represent a single string.)
(1) Set RegExpr equal to the empty string, and i = 1.
(2) (a) If Fi is a string (i.e. a boundary marker), append it to the right end of RegExpr (bracketing it with ASCII 2 (STX) and ASCII 3 (ETX) to the left and right respectively if it is marked "optional"), and go to step (3).
(b) Else, let SDR = {SDRi...SDRj} be the set of all Segment Definition Records whose Phonetic Features Field are a superset of Fi, and let CS = {CS1...CSj} be the set of Character Sequences of SDRi. Then if SDR is of length one (i.e. Fi is unambiguously translatable into a segment), set RegExpr equal to the result of appending CS1 to the right end of RegExpr; else (if SDR is of length greater than one, meaning Fi is ambiguously translatable), set RegExpr equal to RegExpr plus an ASCII 28 (FS) plus the members of CS, each separated by an ASCII 29 (GS), plus an ASCII 30 (RS). If the segment(s) is/are marked as optional, enclose the segment or the bracketed list of segments in ASCII 2 (STX) and ASCII 3 (ETX) to the left an right respectively. If there is no Segment Definition Record whose features are a superset of Fi, exit with error, returning Fi.
(3) If i < n, set i = i+1 and go to step 2. Else exit with success, returning RegExpr.
4.1.2. Definition of the Partition of a Phonetic Sequence by a Phonetic Template
Let PSTSeq = (PST1...PSTm) be a Phonetic Sequence of a Phonetic Template, and let INIT and FINAL be the values of the init and final fields of that Phonetic Template. Furthermore, let PSLSeq = (PSLx...PSLy) (the Lexical Sequence) be a subsequence of the Phonetic Sequence PSL1...PSLz of a lexical entry. Then PSTSeq partitions PSLSeq into the list PART = (BMs1 Part1...BMSm Partm BMsm+1), where each MSsi is a list of zero or more Boundary Markers, and Parti is a variable-free phonetic sequence, iff:
(1) If INIT is true, the left-most segment of the left-most non-empty Parti in PART is PSL1 (i.e. PSTSeq must match PSLSeq beginning at the left-most segment of PSLSeq);
(2) If FINAL is true, the right-most segment of the right-most non-empty Parti in PART is PSLy (i.e. PSTSeq must match PSLSeq ending with the right-most segment of PSLSeq);
(3) If PSTi is a Simple Context, then Parti contains a single segment Seg such that PSTi is a subset of Seg (i.e. every feature in PSTi has that same value in Seg);
(4) If PSTi is a string of one or more boundary markers, then Parti is that same string of boundary markers;
(5) If PSTi is an Optional Segment Sequence, let MIN and MAX be the values of the Minimum Occurrence and Maximum Occurrence fields of PSTi (default 0 and 1, respectively), and let PSTSeq be the Optional Sequence of PSTi. Then Parti is a list divisible into between MIN and MAX nonoverlapping adjacent subsequences, each of which matches PSTi; and
(6) For all i, BMi is a list of zero or more boundary markers. (Boundary markers in the lexical sequence need not be accounted for by the template; this corresponds to the generally accepted notion that phonological rules can apply freely across morpheme boundaries. However, the definition of the application of a phonetic rule to a lexical entry, as given below, requires that the portion of a phonetic sequence matched by the input of a phonetic rule must not contain a boundary marker unless the marker is specifically required by the rule.)
Note 1: The above definition assumes synthesis order, whereas rules must be applied in analysis order to the morpher's input. In particular, when (un-)applying rules in analysis order, boundary markers which the input side of a phonological rule may call for are unlikely to be present in the lexical form.
Note 2: By step (3) above, a template which requires a feature-value pair (Fi Vi) will not match (during synthesis) against a segment for which Fi does not have an instantiated value.
This section describes the lexical entry generated by applying a morphological rule to another lexical entry.
In the following subsections, the application of a morphological rule MR is defined in terms of its application to an input lexical entry ILE, resulting in an output lexical entry OLE. ILE may be a real or virtual lexical entry; OLE will be a virtual lexical entry. (The terms "input" and "output" are here used in the synthesis sense.)
A morphological rule may be blocked under certain circumstances. When blocking occurs, the input lexical entry is replaced by a different lexical entry, and the derivation continues as if the rule had already applied.
Blocking
of morphological rules is defined as follows. (Blocking of affix templates is defined separately, see section 4.3.)Let DLE be a Derived Lexical Entry to which morphological rule MR has just applied, and let StemSet be the Family of DLE.
Then DLE is replaced with a member RLE of StemSet if:
MR is a blockable rule; and
the Stratum, Part of Speech, Subcategorization, of RLE are identical to the corresponding fields of DLE; and
the Head and Foot Features of DLE are subsets of the corresponding fields of RLE.
Example of the Use of Blocking: Suppose that the word seed has been (incorrectly) analyzed as being derived from the verb see by the application of the morphological rule attaching the –ed suffix, a rule which adds the Head Feature tense (past); and changes the phonetic form of this stem to seed. Suppose further that the lexical entry for see and the lexical entry for the verb saw are Relative Lexical Entries, with the entry for saw identical to the lexical entry for see save for its phonetic form, the addition of the head feature tense (past). Then the analysis of seed as the past tense of see will be blocked by the lexical entry for saw, that is, in the derivation of the past tense of see, the Derived Lexical Entry seed is replaced by the Lexical Entry for saw. (If Hermit Crab is parsing seed, i.e. running the command morph_and_lookup_word, the resulting word saw will not match the input, and the derivation will fail. If Hermit Crab is instead generating the past tense of saw, i.e. it is running the command generate_word, the output will be saw instead of seed.)
This section defines the unification of the Head (or Foot) Features of an Input Lexical Entry ILE with the head (foot) features of the Required Head (Foot) Features field of a subrule SR of a morphological rule. (The result of this operation is then combined with the Head (Foot) Features of the subrule to create the Head (Foot) Features of the Output Lexical Entry; see 4.2.6 below.)
Note that features may be either uninstantiated or instantiated. An instantiated feature is a feature which either has one or more values, or whose value is the designated atom ‘*NONE*’. (The latter is used in Required Features to ensure that no value has been assigned to a lexical entry’s Head Features.)
We first define the unification of a single Required Feature (RFN RFV) with the Head (or Foot) Features LF= (LFN1 LFV1...LFNn LFVn) of a Lexical Entry.
If RFV is the atom ‘*NONE*’, then
if RFN is not included in (LFN1...LFNn) and there is no default value for RFN, unification succeeds with the value (RFN ‘*NONE*’);
else if RFN is included in (LFN1...LFNn) with the value ‘*NONE*’, unification succeeds with the value (RFN ‘*NONE*’);
else if RFN has the default value ‘*NONE*’, unification succeeds with the value (RFN ‘*NONE*’);
otherwise (RFN is included in (LFN1...LFNn) but has a value other than ‘*NONE*’, or it is not included in LF but there is a default value for RFN other than ‘*NONE*’), unification is said to fail (and the value of the unification is undefined).
Otherwise (if RFV is not ‘*NONE*’), then
If the feature name RFN is included in (LFN1...LFNn), let the set intersection of RFV and the value of RFN in the Head (Foot) Features of the Lexical Entry be OFV. If OFV is non-empty, unification succeeds with the value (RFN OFV); otherwise, unification fails;
else (if RFN is not included in (LFN1...LFNn)), then if RFN has a default value, then let the set intersection of RFV and the default value of RFN be OFV. If OFV is non-empty, unification succeeds with the value (RFN OFV); otherwise (if OFV is empty), unification fails;
else (if RFN does not appear among the Head (Foot) Features of the Lexical Entry, and it does not have a default value), unification succeeds with the value (RFN RFV). (That is, an uninstantiated feature in the lexical entry acts as the identity element under unification.)
The unification of a set of Required Features (RFN1 RFV1...RFNn RFVn) with the Head (or Foot) Features of a Lexical Entry succeeds if the unification of each of the Required Features with the Head (or Foot) Features of a Lexical Entry succeeds, producing an output set of features OF = (OFN1 OFV1...OFNm OFVm) determined as follows:
For every RFNi, OF includes the unification of RFi with LFi;
and for every LFNi not included in (RFN1...RFNn), OF includes (LFNi LFVi) (that is, the features of the Lexical Entry not mentioned in the Required Features of the rule remain unchanged in the output features).
The output features OF are used as the new features of the lexical entry for the purposes of applying the morphological rule.
In (hopefully!) more intuitive terms, unification means that any features in the Input Lexical Entry’s Head Features which are incompatible with the Required Head Features of the morphological subrule are removed; if the result is empty, unification fails. Furthermore, if the Lexical Entry lacks a specified value for any Required Feature, the default value (if any) is used in place of a specified value; failing a default value, the value of the feature in the Lexical Entry is treated as compatible with anything, which is to say the value of the Required Feature is taken to be the value of the actual feature. The special value ‘*NONE*’ is used when it is required that a feature have no assigned value (e.g. if an affix attaches to a noun only if the noun does not as yet bear any marking for number).
An Ordinary (non-realizational) Morphological Rule R applies to a Lexical Entry ILE if:
(1) If a Part of Speech is specified on the input side of R, it is identical to the Part of Speech of ILE;
(2) If the Required Subcategorization Rules list of R is non-empty, the Subcategorization field of ILE contains at least one of the syntactic rule names contained in the Required Subcategorization Frame field of R;
(3) The Head and Foot Features lists of R have been successfully unified with the Required Head/ Foot Features lists of ILE (as defined above, see section 4.2.2, Definition of Feature Unification);
(4) The value of the Multiple Application field of R is greater than the number of times the Rule Name of R appears in the Morphological Rules list of ILE; and
(5) The Rule Stratum of R is one deeper than or the same as the Morphological Stratum of ILE. (See section 3.3, Storable Lexical Entries for a more detailed definition of when a morphological rule may apply to a lexical entry of a given stratum.)
If the Morphological Rule applies to the Lexical Entry, its subrules are applied disjunctively. That is, the Input Side of each of the Subrules is checked in order for a match (see below); if there is a match, that Subrule is applied, and the application of the Morphological Rule is complete. (It is not an error if the Morphological Rule as a whole applies to the Lexical Entry, but none of its subrules apply.)
Let the Phonetic Template MRITemp (= Morphological Rule Input Template) be the Required Phonetic Input of a subrule SR of a morphological rule, and let the Phonetic Sequence PLSeq be the Phonetic Shape of the Lexical Entry ILE.
Then subrule SR matches against ILE iff:
(1) MRISeq matches against PLSeq;
(2) For each atom in SR's Required Morphological Rule Features list, ILE must contain that same atom in its MPR Features list;
(3) For each atom in SR's Excluded Morphological Rule Features list, ILE must not contain that atom in its MPR Features list.
4.2.5. Definition of Transformation of a Phonetic Sequence by a Morphological Rule
Note: The following definition is given in terms of synthesis of a derived phonetic sequence from another phonetic sequence (that of the stem) plus the phonetic sequence of an affix (given by a morphological rule).
Let the Phonetic Template MRITemp = MRI1...MRIm be the Required Phonetic Input of a subrule SR of a morphological rule MR, and MROList = MRO1...MROn be the Phonetic Output of SR. (Note that while MRITemp is a phonetic template, MROList is a list of integers, simple contexts, lists of integers plus feature specifications, and lists of strings plus the name of a character definition table; cf. Morphological Rule Notation—Phonetic Output.) Further let PLISeq be the Phonetic Sequence which represents the Phonetic Shape of some lexical entry LE, let PartI = (BM1 PI1...BMIm PIm BMIm+1) be the partition of PLISeq by MRITemp, and let PLOSeq be the Phonetic Sequence which is to represent the transformation of PLISeq according to rule SR.
Then the rule SR transforms PartI into PartO = (BMO1 PO1...BMOn POn BMOn+1), a list of boundary markers (BMOq) and phonetic sequences (PIq), according to the following rules:
(1) If MROq is an integer p, POq = PIp; (boundary markers in the input phonetic sequence which are not mentioned in the rule associate with the segments to their left if).
(2) If MROq is a list composed of an integer p followed by a feature list FL, then POq is identical to PIp except that for every Simple Context Sk in POq, and for every feature-name feature-value pair {FN FV} in FL, the value of FN in Sk is FV; and boundary markers associate as per (a). (The feature values specified in FL are inserted in each segment of POq, replacing the values of those same features, if any, in PLISeq. Note that any boundary markers in PIp are simply copied over into POq.).
(3) If MROq is a list composed of a string s followed by the name of a character definition table CT, then POq is the sequence of segments into which the string s is translated using the specified character definition table.
(4) If MROq is a Simple Context, POq is identical to MROq (i.e. it is a single segment whose features are those of MROq.;
(5) If MROq is a boundary marker (string), POq is identical to MROq.
65) BMO1 = BMI1; and all BOq not specified above are empty.
Finally, SR transforms PLISeq into PLOSeq iff PLOSeq is the phonetic sequence composed by concatenating all the members of the list PartO.
Note: It is unwise to have a morphological rule delete optional segment sequences. One reason is that it is computationally expensive to insert (during analysis) an unknown number of unknown segments. There is also the undesirable possibility of inadvertently deleting boundary markers during synthesis.
The following definition is written in the synthesis sense: rule MR attaches an affix to ILE to produce OLE. Note also that this defines a single application of MR; in some cases, a morphological rule may apply more than once (see section 4.2.8, Definition of Application of a Set of Non-Realizational Morphological Rules).
Rule MR transforms the input lexical entry ILE into the output lexical entry OLE iff for SR, the first subrule of MR to match lexical entry ILE (as defined above, see section 4.2.3 Definition of Match between a Morphological Rule and a Lexical Entry):
(1) The phonetic sequence representing the Phonetic Shape of ILE has been transformed into the Phonetic Sequence of OLE by the application of SR (as defined above, see section 4.2.5, Definition of Transformation of a Phonetic Sequence by a Morphological Rule);
(2) The Lexical Entry ID of OLE is the same as the Lexical Entry ID of ILE;
(3) The Stratum of OLE is the same as the Rule Stratum of MR;
(4) The Gloss String of OLE is the result of concatenating the Gloss String of SR to the right of the Gloss String of ILE, with a space separating the two;
(5) The Part of Speech of OLE is the same as the Part of Speech of the output of SR if that field is non-empty; otherwise it is the same as the Part of Speech of ILE.
(6) If there is a Subcategorization field in the output of SR, the Subcategorization field of OLE consists of (1) all atomic members of the Subcategorization field of the output of SR, (2) the second member (if any) of each sublist of that field for which the first member of the sublist is a member of the Subcategorization field of ILE, and (3) any members of the Subcategorization field of ILE which are not mentioned in the Subcategorization field of SR. Otherwise (if there is no Subcategorization field in the output of SR), the Subcategorization field of OLE is the same as the Subcategorization field of ILE.
Note: If the Subcategorization field of ILE is absent, it is considered to be empty, i.e. the Subcategorization of OLE = the Subcategorization of SR. If, however, the Subcategorization field of the output record of SR is the empty list, the above definition implies that the Subcategorization field of OLE will be empty.
(7) The Morphological Rules list field of OLE consists of the Morphological Rules list of ILE appended to (the left of) a list containing the Rule Name of MR.
(8) The MPR Features list of OLE is the set union of the MPR Features list of ILE and the MPR Features list of SR.
(9) The Head Features list of OLE is the Head Features to be realized on ILE, plus any non-conflicting features of the Head Features list of SR, plus any non-conflicting features of the Head Features list of ILE as modified by the unification of the Required Head Features of the input of SR with the previous Head Features of ILE (see section 4.2.2, Definition of Feature Unification). (That is, the Head Features to be realized on ILE take precedence over the Head Features of SR, which in turn take precedence over any other Head Features of ILE.)
(10) The Foot Features list of OLE is the Foot Features list of SR plus any non-conflicting features of the Foot Features list of ILE, as modified by the unification of the Required Foot Features of the input of SR with the previous Foot Features of ILE (see section 4.2.2, Definition of Feature Unification).
(11) The Obligatory Features list of OLE is the set union of the Obligatory Features lists of ILE and SR.
Note: The Head- and Foot-features fields of OLE bear only values which have been assigned to them by a virtue of percolation from a real lexical entry. Default values are not listed in lexical entries, and therefore are not output by the morpher module.
A compounding rule is a morphological rule with two input fields: one Head field and one Non-head field. Such a rule analyzes a word into two lexical entries; for computational reasons, the Non-head field is required to be a Real Lexical Entry. (This is probably linguistically motivated, as well.) Compounding rules are applied in the same way as other morphological rules, except for the differences specified in the following subsections.
For these subsections, SRH and SRNH refer to the Head and Non-head fields respectively of SR, and ILEH and ILENH refer to the corresponding input lexical entries.
The Head and Foot Features of ILEH and ILENH must be unifiable with the Required Head and Required Foot Features of SRH and SRNH respectively, as defined above (see section 4.2.2, Definition of Feature Unification).
ILEH and ILENH must each be partitionable by SRH and SRNH respectively, as defined above (section 4.2.4, Definition of Match between the Input Side of a Morphological Subrule and a Lexical Entry). (Given the specification of compounding rules given later, SRNH cannot contain a Multiple Application field.)
OLE is formed by appending the partition of the Phonetic Sequence of ILEH by SRH to the left of the partition of the Phonetic Sequence of ILENH by SRNH, and transforming the resulting partition as if it were the input to an ordinary morphological rule (section 4.2.5, Definition of Transformation of a Phonetic Sequence by a Morphological Rule). (This does not imply that the non-head word will appear to the right of the head word, but is only a convention to standardize application of compounding rules.)
The result of applying a compounding rule to two lexical entries is the same as the result of applying an ordinary morphological rule to a single lexical entry (section 4.2.6, Definition of Application of a Morphological Rule to a Lexical Entry), with the following exceptions:
The Phonetic Sequence of OLE is as defined in the section immediately above (see 4.2.7.3, Transformation of Phonetic Sequences by Compounding Rule).
The Gloss String of OLE is the result of concatenating the Gloss String of ILENH to the right of the Gloss String of ILEH; the two Gloss Strings are separated by a space (ASCII 32).
The Lexical Entry ID, Part of Speech, Subcategorization, Morphological Rules list, MPR Features, Head Features, Foot Features, and Obligatory Features fields of OLE are as specified above for ordinary morphological rules, but substituting ILEH for ILE.
Finally, ILENH must be a Real Lexical Entry.
4.2.8. Definition of Application of a Set of Non-Realizational Morphological Rules
This section specifies the application of a set of ordinary and/or compounding (but not realizational) morphological rules of a given stratum.
Let the set of morphological rules of the stratum be MRSet = {MR1,...MRn}, and let ILE be the Input Lexical Entry to which MRSet applies to produce the Output Lexical Entry OLE. (Again, "input" and "output" are used here in the synthesis sense.) Each subsection below defines the application of one or more rules of MRSet, according to the ordering of morphological rules for the stratum.
Note: Additional applications of phonological rules, not described in the following subsections, may be necessary to generate a Storable Lexical Entry; see section 3.3, Storable Lexical Entries.
This definition applies if the value of the m_rule_order field of the current stratum is linear.
Let MRList = MR1...MRn be the list of morphological rules in MRSet in their order of application. Then ILE is related to OLE by the following algorithm:
(1) Set InterLE = ILE.
(2) If MRList is empty, set OLE = InterLE and exit, returning InterLE. Otherwise set CurRule to one of the rules in MRList, and remove CurRule and all rules preceding it from MRList. Set NumApplics = 0.
(3) Apply CurRule to InterLE, set InterLE equal to the result, and increment NumApplics by 1.
(4) If the current stratum is cyclic, apply the phonological rules of the current stratum to InterLE, and set InterLE equal to the result.
(5) If NumApplics is less than the Multiple Application Field of CurRule, optionally go to step (3).
(6) Go to step (2).
This definition applies if the value of the m_rule_order field of the current stratum is unordered.
For each rule MRi in MRSet, applics(MRi) represents the number of times MRi has applied. Then OLE is derivable from ILE by the following algorithm:
(1) Set MRSub equal to any subset (including the empty set) of MRSet. For all MRi in MRSub, set applic(MRi) = 0. Set InterLE = ILE.
(2) If MRSub is empty, set OLE = InterLE and exit. If MRSub contains only rules whose Multiple Application Field is greater than one, optionally set OLE = InterLE and exit. Otherwise set CurRule to any rule of MRSub. Increment applics(MRi); if the result is equal to the Multiple Application Field of CurRule, remove CurRule from MRSub.
(3) Apply CurRule to InterLE, and set InterLE equal to the result.
(4) If the current stratum is cyclic, apply the phonological rules of the current stratum to InterLE, and set InterLE equal to the result.
(5) Go to step (2).
Warning: Because all possible permutations of rules are tried in every order, this algorithm can be very slow. In practice, the situation is not quite as bad as it might seem, because Hermit Crab will either be given a particular ordering of rules to use (if it is running the command generate_word), or it will have chosen a particular order of rules based on the analysis of a surface form. (However, the analysis may be indeterminate if the stratum in question contains null affixes.)
Realizational Morphological Rules are applied according to an Affix Template. The Affix Template of a given Stratum applies after all relevant ordinary Morphological Rules of that Stratum have been applied, but before any Phonological Rules of a non-cyclic Stratum have been applied.
Let Templates = T1...Tk be the list of Affix Templates of a Stratum. (Note that Slots may be empty, in which case there are no Realizational Rules to be applied for this Stratum.). Also let LE be a Lexical Entry to which the Stratum is being applied, and let RzF be the set of features to be realized in the derivation.
Then a stem LE' is selected as follows: Let StemSet be the set of lexical entries in the family of LE. Then set LE' to the member of StemSet whose Head and Foot Features are a superset of LE and the largest subset of RzF. (This should be a unique lexical entry; if there are more than one lexical entries matching this description, an error results. If RzF is empty, this step is skipped.) If there is no such lexical entry in StemSet, then set LE' to LE.
The application of the Realizational Morphological Rules of the Stratum to LE' is as follows. Templates is scanned for an Affix Template whose Required Part of Speech matches the Part of Speech and LE', and whose Required Subcategorized Rules are a (possibly improper) subset of the Subcategorized Rules of LE'. (It is not an error if no Affix Template matches against LE', but an error will occur if more than one Template matches.) Let T be the selected Template.
Let Slots = S1...Sm be the list of Slots of T. The Slots are scanned in order. For Slot Sj, let Rules = R1...Rn be the list of Realizational Rules. Rules are then applied in disjunctive order, that is: the Head Features of R1 are checked against RzF. If the Realizational Features of R1 are a subset of RzF, and not also a subset of the Head Features of LE', the rule is applied, and if the Stratum is cyclic, the phonological rules of the Stratum are then applied. Processing then continues with Slot Sj+1. If the Realizational Features of R1 are not a subset of RzF, rule R2 is checked, and so forth. If none of the rules of slot Sj match, processing continues with Slot Sj+1. (It is not an error if none of the rules of a given slot apply, nor is it an error if a rule of a slot matches LE', but none of its subrules matches. Note that the test of the Realizational Features is not a unification test; any features of the Realizational Features of the rule must be present with that same value in the Realizational Features of the derivation.)
After processing the slots, set the head features of the resulting word equal to RzF plus any nonconflicting Head Features of LE'. (An alternative would be to assign the Head Features of each Realizational Rule as it is applied, which would have the effect of allowing one affix to block attachment of a later affix. It is not clear which of these approaches is correct.)
The reason for requiring that the Realizational Features of R1 are not a subset of the Head Features of LE', is to allow blocking of inflectional affixation if the stem is inherently specified for all the features which the inflectional affix would realize. For instance, on the assumption that oxen is listed in the lexicon and bears the feature [+plural], the plural suffix -s should be prevented from attaching to it to give *oxens; see Anderson (1992: 134, example (20)).
The application of a phonological rule to a lexical entry changes the phonetic form of the input lexical entry.
The following subsections define the application of a phonological rule PR to an input lexical entry ILE, resulting in the output lexical entry OLE. (ILE may be a Real or a Virtual Lexical Entry, and OLE will be a Virtual Lexical Entry.)
4.4.1. Phonetics of Phonological Rule Application
This section describes the phonetic effects of the application of a phonological rule to a lexical entry.
Let the Phonetic Template PRLTemp = <LInit, LFinal, (PRL1...PRLi)> be the Left Environment of phonological rule PR, the Phonetic Sequence PRISeq = (PRI1...PRIj)> be the Phonetic Input Sequence of PR, and the Phonetic Template PRRTemp = <RInit, RFinal, (PRR1...PRRk)> be the Right Environment of PR. Let the Phonetic Sequence PrevWord be the prev_word field (if any) of PR, and let the Phonetic Sequence NextWord be the next_word field (if any) of PR.
Further let the Phonetic Template PETemp = <LInit, RFinal, (PRL1...PRLi PRI1...PRIj PRR1...PRRk)> be the combined template for PR, where the Phonetic Sequence of PETemp is the concatenation of the Phonetic Sequences of the Left Environment template + Phonetic Input Sequence + Right Environment. (Note that LFinal and RInit are ignored. Also, either PRLSeq or PRRSeq may be omitted in PR, and any of the Phonetic Sequences of the environments or of the input of PR may be empty; in that case, the Phonetic Sequence of PETemp consists of the concatenations of the non-empty fields. PESeq itself should never be empty, since then the rule would apply everywhere.)
Then phonological rule PR matches against the Phonetic Sequence PLSeq = PL1...PLn, a subsequence of the phonetic sequence representing the Phonetic Shape of the lexical entry ILE, iff:
(1) PETemp partitions PLSeq (section 4.1.2, Definition of Partition of a Phonetic Sequence by a Phonetic Template);
(2) If PLSeq = PLw...PLx is the subsequence of PLSeq that matches PRI1...PRIj (the input sequence of PR), then PLSeq does not contain any boundary markers not specifically required in PRI1...PRIj. (Unlike the part of the phonetic shape which matches against the rule’s environment, the portion which matches the rule’s input cannot contain any boundary markers not called for by the rule.)
(3) If the Phonetic Sequence of the Input Template of PR is empty (i.e. a rule of epenthesis), and if PRR1 is not a boundary marker, then if PLy...PLz is the subsequence of PLSeq that matches PRRSeq (the Right Environment of PR), then PLy is not a boundary marker. (In a rule of epenthesis, the epenthesized segment(s) is (arbitrarily) attached to the right of a boundary marker not specifically mentioned in the rule); and
(4) The Stratum of ILE must be included in the Rule Strata of PR.
(5) If PrevWord has a value, then PrevWord matches the Phonetic Shape of the word preceding the word being analyzed, if there is one; if there is no preceding word and PrevWord has a value, it is the atom *null*.
(6) If NextWord has a value, then NextWord matches the Phonetic Shape of the word following the word being analyzed, if there is one; if there is no following word and NextWord has a value, it is the atom *null*.
The sub-sequence PLISeq of PLSeq which matches against the Input Sequence of PR is referred to as the "input stretch" of PLSeq. (There may be more than one input stretch in a given lexical entry.)
In this section, the single application of a phonological rule to a phonetic sequence is defined. This is an abstraction from the more general situation in which a phonological rule may apply multiple times to a single phonetic sequence; that case is defined in the next section, based on the definition given here. It is also an abstraction from the application of a disjunctive set of phonological rules to a lexical entry, which is described in the second section following.
Note: The following definition is given in terms of the synthesis of a derived lexical entry by applying a phonological rule to another (underlying) lexical entry.
Let the variable-free Phonetic Sequence PRIseq = (PRI1...PRIm) be the Phonetic Input Sequence of rule PR, and PROSeq = (PRO1...PROn) be its Phonetic Output Sequence, and let the variable-free phonetic sequences PLISeq = (PLI1...PLIi) and PLOSeq = (PLO1...PLOj) be the Input Stretch of some lexical entry LE and its transformation according to rule PR.
(There is no guarantee that m=i or that n=j, since the Input Lexical Sequence and its transformation may have a boundary marker not mentioned in the rule; and there is no guarantee that m=n or that i=j, since segments may be epenthesized or deleted by the rule.)
Then PR transforms PLISeq into PLOSeq iff:
(1) Rule PR matches LE, with PLISeq being the Input Stretch according to this match (see definition in section 4.4.1.1, Match between a Phonological Rule and a Lexical Entry);
(2) If PRISeq and PLISeq are the empty list (a rule of epenthesis), PLOSeq = PROSeq;
(3) If PROSeq is the empty list (a rule of deletion), PLOSeq is the empty list;
(4) If PRISeq and PROSeq are non-empty phonetic sequences of the same length, then each PLOk is identical to PLIk except that for each segment PLIk matched to the corresponding simple context PRIl, each feature-name feature-value pair in PROli is substituted into PLOk in place of the corresponding feature of the same name (if any) in PLIk;
(5) If PRISeq is of length one and PROSeq is of length greater than one (for instance, a diphthongization rule), then PLOSeq consists of the same number of segments as PROSeq, and each segment PLOk bears all the features of PLI1 except that the feature-name feature-value pairs given in PROk have been substituted for the features of the same name (if any) in PLI1; or
(6) If PRISeq and PLISeq are of length greater than one, and PROSeq is of length one (for instance, a rule of degemination), PLOSeq is of length one, and its features are those of PRO1 plus any non-conflicting features from the intersection of the feature-name feature-value pairs of the set of all segments in PLISeq.
Note 1: There is no provision for a rule which takes as input two or more segments, and transforms them into some different number of segments greater than one.
Note 2: For reasons of computational tractability, the use of phonological rules to add, delete or change boundary markers is not recommended.
If its structural description is met more than once in a given input, a phonological rule will apply to that sequence multiple times (cf. Kenstowicz and Kisseberth 1979, chapter 8). The way multiple application works in Hermit Crab depends on the setting of the field mult_applic for the rule (section 4.4.1.3 Definition of Phonetics of Multiple Application of a Phonological Rule). This field may have the value simultaneous (section 4.4.1.3.1), lr_iterative (section 4.4.1.3.2), or rl_iterative (section 4.4.1.3.3). Left-to-right iterative application is the default. The following subsections define the application of a phonological rule to a phonetic sequence under these three settings of the mult_applic field.
For the purposes of this specification, a rule is said to apply to a form when one of the following algorithms has been applied, regardless of whether the rule actually changes the input form. In other words, a rule "applies" whenever it is tried against an input string, regardless of whether its structural description is met by any part of that string.
The definitions below refer to application of phonological rules. Because of the difficulty of parsing forms to which deletion rules have been applied, Hermit Crab imposes an arbitrary restriction on the unapplication of deletion rules. (A deletion rule is one whose Phonetic Output Sequence is the empty list.) The application of deletion rules remains unchanged, but there is the possibility that during the analysis phase, a form will not be found that would have produced the correct surface form during the synthesis phase. This could happen if the variable *del_re_app* were set to zero (the default) and a deletion rule was self-opaquing (by virtue of deleting part of its own environment through multiple application). The solution is to set the variable *del_re_app* to a number higher than zero (probably one; setting it too high will cause the search space to expand greatly and likely result in severe slowing). This will cause the morpher to generate further forms in which the deletion rule has been unapplied to its own output, and should generate the forms from which iterative application of the deletion rule can later generate the surface form. See Phonological Rules—Deletion Rules (section 2.3.5) for further details.
As a result of the application of a set of phonological rules, the stratum to which a lexical entry belongs may change; see Storable Lexical Entries (section 3.3).
The application of a disjunctive rule set to a lexical entry differs from the application of a (simple) phonological rule (which is modeled as a disjunctive rule with a single subrule); see Definition of Phonetics of Application of a Disjunctive List of Phonological Rules, section 4.4.1.4.
If the mult_applic variable for the rule has the value simultaneous, the following describes the application of a phonological rule to a phonetic sequence.
Phonological rule PR transforms the phonetic sequence ILESeq into the phonetic sequence OLESeq, iff ILESeq is identical to OLESeq except that for every phonetic sub-sequence SSi = Seg1...Segj of ILESeq which matches against rule PR (see Definition above of a Match between a Phonological Rule and a Lexical Entry, section 4.4.1.1); and which, if the stratum is cyclic, contains one or more segments which have been changed or inserted since the beginning of this cycle, or which has had one or more segments deleted between Seg1 and Segj since the beginning of the cycle, the Input Stretch I1...Im of SSi has been transformed into the Phonetic Sequence O1...On by the application of PR.
Note 1: The special condition on the application of a cyclic phonological rule approximates the Strict Cycle Condition.
Note 2: There is no guarantee that the portions of ILESeq that matched against the Left and Right Environments of PR will still match in OLESeq. In other words, "why" opacity may occur.
Note 3: The input stretch of SSi should not overlap the input stretch of SSi+1. (This possibility can arise only if the input stretches contain more than one segment. The results of simultaneous application of a rule to overlapping sequences of segments is in the general case ill-defined.)
If the mult_applic variable for the rule has the value lr_iterative (the default), the following describes the application of a rule to a phonetic sequence.
Phonological rule PR transforms the phonetic sequence ILESeq into the phonetic sequence OLESeq, by the following algorithm:
(1) Set TempSeq = ILESeq, and set CurSeg = the first segment of TempSeq.
(2) If PR matches against TempSeq, then set InStretch = the left-most input stretch of TempSeq such that the first segment of InStretch is CurSeg or to the right of CurSeg, and either
(a) the current rule stratum is noncyclic, or
(b) the portion of TempSeq which PR partitions with InStretch its input stretch, contains one or more segments which have been changed or inserted since the beginning of this cycle, or one or more segments has been deleted from that stretch since the beginning of this cycle,
then set OutStretch = the result of applying PR to InStretch, and then replace InStretch in TempSeq with OutStretch.
Otherwise (if PR does not match against TempSeq while meeting the above requirements), then set OLESeq = TempSeq and exit.
(3) Else set CurSeg to the first segment after OutStretch and go to step (2).
Note 1: Condition (2b) approximates the Strict Cycle Condition.
If the mult_applic variable for the rule has the value rl_iterative, the rule is applied iteratively from right to left. The algorithm is identical to that for left-to-right iterative application (see above), except for the obvious difference of direction.
For any given segment in a lexical entry, a disjunctive list of phonological rules may apply only once in a given stratum (unless the disjunctive rule belongs to a cyclic stratum, in which case it may apply only once in each cycle, as allowed by the principle of Strict Cyclicity). Furthermore, only one subrule of the disjunctive list may apply to that segment. (Note that "ordinary" phonological rules are modeled by disjunctive rules with a single subrule.)
Let disjunctive rule R be a list of subrules (R1...Rn), and LESeq a phonetic sequence (the input sequence). Then R maps applies to LESeq by the following algorithm:
(1) Set CurSeg = the first segment of LESeq.
(2) Set CurRule = R1.
(3) Test CurRule for a match beginning with CurSeg in LESeq.
(4) If CurRule matches LESeq beginning with CurSeg, let InStretch be the input stretch of LESeq beginning with CurSeg. Then set CurSeg to the first segment following InStretch; set LESeq to the result of applying CurRule to InStretch. If this moves CurSeg past the end of the word, exit, returning LESeq; else go to step 2.
Else (if CurRule does not match LESeq beginning with CurSeg), set CurRule = the next rule after CurRule. If there is no rule after CurRule, set CurRule = R1 and set CurSeg = the next segment after CurSeg. If this moves CurSeg past the end of the word, exit, returning LESeq; else go to step 2.
If the current rule stratum is cyclic, the stretch of ILESeq matching CurRule must contain one or more segments which have been changed or inserted since the beginning of the cycle, or one or more segments has been deleted from that stretch since the beginning of this cycle.
Note 1: Step 4 provides for vacuous application of a subrule to count as application, i.e. the first subrule which applies blocks other subrules even if it only applies vacuously.
Note 2: The above algorithm (like all algorithms in this specification) is not necessarily the most computationally efficient way to implement the process in question.
The application of a phonological rule PR to an input lexical entry ILE translates ILE into an output lexical entry OLE iff the application of rule PR to the Phonetic Shape of ILE results in the Phonetic Shape of OLE (see Definition of the Phonetics of Multiple Application of a Phonological Rule, section 4.4.1.3).
This section specifies the application of a set of phonological rules of a given stratum.
The ordering of such sets of rules of different strata or in different cycles with respect to each other, and with respect to morphological rules, is defined above (see Storable Lexical Entries, section 3.3).
Let the set of phonological rules of the stratum be PRSet = {PR1...PRn}, and let ILESeq be the input Phonetic Shape to which PRSet applies to produce the output Phonetic Shape OLESeq. (Again, "input" and "output" are used in the synthesis sense.) Each subsection below then defines the application of PRSet, according to the rule ordering of phonological rules for the current stratum, whether linear or simultaneous.
In addition to linear and simultaneous ordering, it is logically possible that a set of rules would be freely ordered, that is, the set would reapply to a given form until they produced no further change. In Kenstowicz and Kisseberth (1979, chapter 8), this is referred to as "the Free Reapplication Hypothesis." Hermit Crab does not implement this form of ordering, because (1) it is computationally expensive (and can lead to nontermination); and (2) few if any phonologists have proposed such ordering.
This definition applies to PRSet if the value of the p_rule_order field of the current stratum is linear.
Let PR1...PRn be the list of phonological rules in PRSet in order of application. Then ILESeq is the first applying rule PR1 to ILESeq, then applying PR2 to the output of PR1, etc., and finally applying PRn to the output of PRn–1.
Note: In Kenstowicz and Kisseberth (1979, chapter 8), this is referred to as "the Ordered Rule Hypothesis."
This definition applies to PRSet if the value of the p_rule_order field of the current stratum is simultaneous.
ILESeq is derived form OLESeq by the set of phonological rules PRSet iff, for every rule PRi in PRSet which matches against ILESeq, that rule has been applied to ILESeq to produce OLESeq.
Warning: Hermit Crab does not prevent two rules with contradictory effects from applying in such a way that one rule undoes the effect of the other, nor does Hermit Crab signal this situation.
Note: In Kenstowicz and Kisseberth (1979, chapter 8), this is referred to as "the Direct Mapping Hypothesis."
The following defines the application of a single Stratum of rules to a Lexical Entry.
4.5.1. Application of a Noncyclic Stratum
Let Si be a noncyclic stratum. Then the application to a lexical entry from stratum Si of one morphological rule of stratum Si produces a storable lexical entry of stratum Si. If stratum Si+1 is a cyclic stratum, then the application to a lexical entry from stratum Si of the relevant Affix Template (if any) of Si, followed by the application of all the phonological rules of stratum Si, followed by the erasure of any boundary markers, followed by the application of all the phonological rules of stratum Si+1, produces a storable lexical entry of stratum Si+1. Otherwise (if stratum Si+1 is a non-cyclic stratum), the application to a lexical entry from stratum Si of the relevant Affix Template (if any) of Si, followed by all the phonological rules of stratum Si, followed by the erasure of any boundary markers, produces a storable lexical entry of stratum Si+1.
4.5.2. Application of a Cyclic Stratum
Let Sj be a stratum of cyclic rules. Then the application to a storable lexical entry from stratum Sj of one or more cycles is also a storable lexical entry of stratum Sj. (A "cycle" is defined as the application of one morphological rule of the stratum, followed by the application of all phonological rules of that stratum, followed by the erasure of any boundary markers.) If stratum Sj+1 is also a cyclic stratum, then the application of all the phonological rules of stratum Sj+1 to a storable lexical entry of stratum Sj, followed by the application of the relevant Affix Template (if any) of Si, is a storable lexical entry of stratum Sj+1. Otherwise (if stratum Sj+1 is a non-cyclic stratum), then a storable lexical entry of stratum Sj to which the relevant Affix Template (if any) of Si has been applied is also a storable lexical entry of stratum Sj+1.
4.6. Definition of Generation of a Surface Lexical Entry
For convenience, the pseudo-stratum *surface* is defined as the final stratum; it has no rules and is considered a non-cyclic stratum for purposes of the following definition. (That is, a lexical entry belonging to the *surface* stratum may have no further rules, morphological or phonological, applied to it. The user should not define another stratum with the name *surface*.)
Let LE be a lexical entry of stratum S1 to which no morphological or phonological rules have applied, and let RzHF be a set of Head Features which are to be realized on LE. Then LE may be converted into a Derived Lexical Entry of the Surface Stratum by first setting the Head Features list of LE to RzHF plus any non-conflicting features of the existing Head Features of LE, then applying all the Strata beginning with S1 through the Surface Stratum in order.
5. Data Structures
The data input to the morpher module for the commands morph_and_lookup_word and morph_and_lookup_list is the output of the Preprocessor module (see chapter five), and contains the data to be morphed. To summarize that chapter: the input to the morpher is a list of one or more Token Record data structures, each containing the print form of the word and its normalized form, and representing a single word of the input string.
The Phonetic Shape field of those records is visible to the morpher, while the Orthographic Shape field is invisible to the morpher rules (although the morpher module passes it on to downstream modules in the Orthographic Shape field of Lexical Entry records).
The function morph_and_lookup_word accepts a list of length three; each member of the list is a Token Record data structure, and represent a single input word, plus the preceding and following words, in that order. The function morph_and_lookup_list accepts a list of Token Record data structures of any length. The morpher morphs each word separately; the previous word and the following word (if any) are, however, accessible to phonological rules through the phonological rule fields prev_word and next_word.
The input to the morpher module for the commands generate_word, apply_stratum, and apply_morpher_rule are similar, but are described under each command.
Lexical Entries are record structures; as described above (see Lexical Entries, section 3), each lexical entry represents a root, stem or word. The Lexical Entry data structure is used in the lexicon and in the output of the morpher. (A nearly identical structure is used in the syntactic parser to represent terminal nodes; see chapter seven, Parse Tree Format—Terminal Node Record Structure.)
This section describes the record structure of a lexical entry.
Note: The Lexical Entry structure may be augmented in future versions of Hermit Crab by the addition of fields, e.g. for indicating functional structure.
Record Label: lexical_entry
Fields:
Optionality: obligatory
Label: id
Type: string
Contents: A code which uniquely identifies this lexical entry data structure.
Purpose: used in debugging to refer to lexical entries.
A derived lexical entry inherits the lex ID of the lexical entry from which it is derived.
A real lexical entry's lex ID remains valid during a single session of Hermit Crab; a virtual lexical entry's lex ID remains valid only until the next time either the function morph_and_lookup_word or the function morph_and_lookup_list is called. Deleting a (real) lexical entry also causes its lex ID to become invalid, as does resetting the lexicon (see reset_lexicon, section 6.4.6).
Optionality: obligatory in Real Lexical Entries; pertains to Virtual Lexical Entries only during debugging
Label: sh
Type: string
Contents: A string which represents the phonological form of the lexical entry. For lexical entries which represent entire tokens in the input, this field is copied from the field of the same name in the input Token Record data structure; in the case of lexical entries in the lexicon, it is the result of lexical lookup. In the case of virtual lexical entries, this field is translated from the phonetic sequence which represents its phonological form; this translation is only necessary when matching a storable lexical entry against a real lexical entry, or during debugging.
Implementation note: The translation of the phonetic sequence of a virtual lexical entry into a string may be ambiguous; see Translation from Phonetic Sequence to Regular Expression, section 4.1.1.2.
Optionality: optional, used only in Real Lexical Entries
Label: fam
Type: atom
Contents: Gives the family to which a given (real) lexical entry belongs.
Purpose: To allow blocking of derivations by irregular forms listed in the lexicon.
It may be useful for the shell to treat families of lexical entries as units when the user is editing lexical entries, so that changes to one member of the family are consistently propagated to others. An inheritance schema is one way this might be implemented.
Optionality: optional
Label: gl
Type: string
Contents: A translation of the lexical item as listed in the dictionary (for real lexical entries) or as morphed (for virtual lexical entries).
If this field is empty in a real lexical item, the default string "?" is used, as described below (see Morphological Rule Notation—Gloss String, section 7.2.1.14).
Purpose: To represent the morpher's analysis of the word's meaning. The intention is that it will contain the translation of one or more of the morphemes composing the word. This field may also the Display Module as a label for the word.
Glosses are shown in Hermit Crab’s output if the global variable *show_glosses* is true (default), otherwise they are not included.
Optionality: obligatory
Label: pos
Type: atom
Contents: The name of the part of speech of the lexical item.
Optionality: optional
Label: sub
Type: list
Contents: A list of atoms, each one of which is the name of a syntactic (parser) rule which the lexical item subcategorizes. If this field is absent, the lexical item does not subcategorize any rules.
Purpose: To allow the lexical item to subcategorize certain syntactic rules. Morphological rules may also be constrained to require that the lexical entry to which they apply subcategorize a specified rule.
Warning: The morpher does not check whether the rules in this list actually exist in the parser's rulebase.
Optionality: empty
Label: gf
Type: atom
Purpose: This field is meant to carry information specified in syntactic rules as to the function of this node. This information is added by the Parser and/or Functional Structure Modules; the field is always empty in the Morpher module, and may therefore be omitted from all lexical entries within this module. (It is mentioned here only for completeness.)
Optionality: optional (defaults to "?")
Label: mrs
Type: list
Contents: The names (atoms) of the morphological rules (if any) which have applied to form this lexical entry; left-to-right order of this list represents the order in which morpher rules applied to produce this lexical entry (in the synthesis sense). This field will often be the empty list for real lexical entries. However, if a real lexical entry represents a stem, rather than a root, it may be desirable to indicate the morphological rules which "would have" applied, in order to prevent their applying. (For instance, if the irregular past tense verb ran is listed in the lexicon, its lexical entry might list the past tense rule as having applied, to avoid generating *ranned.)
Purpose: Used to prevent multiple application of morphological rules, and in debugging.
Optionality: obligatory
Label: str
Type: atom
Contents: The name of a rule stratum.
Purpose: This encodes the stratum of rules which may apply to this lexical entry.
The value of *surface* means that no more rules may apply to the lexical entry (it is a surface form).
For real lexical entries, the value of this field must be supplied by the user. For virtual lexical entries, the value is automatically supplied by the morpher.
See also: Storable Lexical Entries (section 3.3)
Optionality: optional
Label: rf
Type: list
Contents: zero or more atoms, each of which is the name of a Morphological/ Phonological Rule (MPR) feature.
Purpose: These rule features govern which morphological or phonological rules a lexical entry will exceptionally undergo or not undergo. They may be used to encode such things as conjugation class and gender.
If this field is absent, the lexical entry has no MPR features.
If membership in a conjugation class or gender class is important in the syntax, the class membership should be indicated as a Head Feature, since syntactic rules make reference only to Head and Foot Features. Head and Foot Features are visible both to morphological and phonological rules, and to syntactic (phrase structure) rules, whereas MPR features are visible only to morphological/ phonological rules.
Optionality: optional
Label: hf
Type: list-valued feature list
Purpose: This list represents the assigned (non-default) Head Features of the lexical entry.
If this field is absent, the values of all Head Features of the lexical entry are the default values.
See also: Foot Features (section 5.2.12); Morphological/ Phonological Rule Features (5.2.10)
Optionality: optional
Label: ff
Type: list-valued feature list
Purpose: This list represents the assigned (non-default) Foot Features of the lexical entry.
If this field is absent, the values of all Foot Features of the lexical entry are the default values.
Foot features are invisible to phonological rules.
See also: Head Features (section 5.2.11); Morphological/ Phonological Rule Features (5.2.10)
Optionality: optional
Label: of
Type: list
Contents: A list of atoms, each of which is the name of a Head Feature.
Purpose: For each feature-name listed, some value must be assigned to that feature by the end of the derivat