Implementation of the ESIS file parser
From:
Importing SGML data into CELLAR by means of architectural forms
Gary F. Simons
Summer Institute of Linguistics
Last revised: 12 November 1997
Contents
- Overview
- Sample of input
- An introduction to CELLAR's sublanguage for defining parsers
- The main recursive function
- The supporting code
1. Overview
This document describes the implementation of the parser that converts an SGML document (which has been mapped onto CELLAR architectural forms) into objects in the CELLAR database. (See the main paper for an explanation of the mapping process.) The input is actually the ESIS file output by the SGML parser. After briefly introducing the sublanguage used to define parsers in CELLAR, the source code is given for the main parsing function that recursively handles one SGML element at a time. Finally, all the smaller supporting functions are presented.
2. Sample of ESIS input
The output of the mapping performed by the nsgmls parser is not actually in SGML format. Rather it is a simple text representation of the architectural document's Element Structure Information Set. (See http://www.sil.org/sgml/topics.html#esis for information about ESIS.) It is a file in this ESIS format that the CELLAR parser reads. See http://jclark.com/sp/sgmlsout.htm for an explanation of the nsgmls output format.
For instance, here is the beginning of the ESIS output file for the CriticalText example (see complete file). Note that attribute lines for attributes with IMPLIED values have been omitted to improve readability.
ACLASS CDATA CriticalText APCDATACLASS CDATA String (OBJECT (IGNORE (IGNORE ACONTENTATTR CDATA title APCDATACLASS CDATA String (ATTR -2 Clement, chapter 7 )ATTR ACONTENTATTR CDATA authorities APCDATACLASS CDATA String (ATTR ACLASS CDATA Manuscript ACONTENTATTR CDATA description APCDATACLASS CDATA String AID TOKEN A AATTRNAME CDATA siglum AATTRVALUE CDATA A AATTRTYPE CDATA String (OBJECT -Codex Alexandrinus\n ACONTENTATTR CDATA source APCDATACLASS CDATA String (ATTR -A Greek uncial of the fifth century. Housed in the British \nMuseum. Published in: The Codex Alexandrinus in reduced photographic\n facsimile, with an introduction by F. G. Kenyon, London 1909. )ATTR )OBJECT ...
The first character of each line is a command character that identifies the kind of information on the line:
- A specifies an attribute of the next element
- (gi marks the start of an element whose generic identifier is gi
- )gi marks the end of an element whose generic identifier is gi
- - marks character data
The A command has three fields: attribute name, type, and value. In character data, \n represents a line end.
3. An introduction to CELLAR's sublanguage for defining parsers
CELLAR's model of class definitions for objects [RST93] includes not only the attributes which store information, but also:
- queries for retrieving information
- methods for manipulating the information
- views for displaying the information in a predefined format
- parsers for building instances of objects from data in text files
The programming language built into CELLAR has sublanguages for each of the above purposes. The full documentation for these are given in the books CELLAR Programmer's Tutorial and CELLAR Programmer's Reference in the infobase file \LIBRARY\WRKLNK.NFO on the LinguaLinks CD-ROM [SIL97].
The function that converts the data in the ESIS file to the equivalent structure of objects in CELLAR is implemented as a parser. A parser definition has the following form:
parser-name( parameter-list ) --> local-variable-declarations pattern
The body of the parser is a regular-expression-like pattern. The following metacharacters are used to express a pattern:
{ } encloses a set of alternative patterns
( ) groups a sequence of patterns into a single pattern
? an optional pattern
* the pattern may match zero or more times
+ the pattern must match one or more times
The following are some of the primitive patterns out of which complex patterns can be built:
'xxx' match a literal xxx nl match a newline (i.e. line end) blank match a span of spaces, tabs, and newlines c.n invoke parser named n for class c v:=p set variable v to the value returned by parsing pattern p
For instance, the following pattern assumes that test and action have been declared as local variables. It repeatedly matches one of three alternatives: a line beginning with 'if' (in which case it puts the contents of the line into test), a line beginning with 'do' (in which case it puts the contents of the line into action), or any other line (in which it does nothing with the contents). String.upToNL is a built-in parser for class String which matches (and returns) all the characters up to (but not including) the next newline:
*{ ( 'if' blank test:=String.upToNL nl )
( 'do' blank action:=String.upToNL nl )
( String.upToNL nl )
}
There are two special pseudo-patterns that provide an interface to the sublanguage for expressing actions and queries:
do( ) perform the enclosed action (which may return a value) test( ) treat the enclosed boolean query as a pattern that succeeds or fails
Some key constructs of that sublanguage are:
^v get value of the local variable or parameter named v !g get value of the global variable named g
a of e get attribute a of each object returned by expression e do m to e execute method m on each object returned by expression e [ ] create a sequence of objects m over s execute method m over the sequence s as a whole
This brief introduction to the programming language should make it possible to read the source code which follows.
4. The main recursive function
The heart of the implementation is a parser named ESISelement which is called recursively for each element and each instance of character data in the ESIS file. The loadESISfile method (which is defined below) opens the ESIS file and sets up the parameters for the top-level call to the ESISelement parser.
The complete source code of the ESISelement parser follows, with commentary inserted between the major sections of the code. At the beginning is a header comment which summarizes the effect of the parser and the functions of its parameters:
/* ESISelement : a ParserDefn on Object ****************************************
Description:
Reads one element out of an ESIS (i.e. parsed SGML) file
that has been mapped onto CELLAR architectural forms and builds
the CELLAR object that corresponds to it (along with everything it
recursively owns).
Parameters:
currentObject -- This is the object that is currently being built.
It will be the owner of the object that this call with create.
currentAttr -- This is the attribute of the currentObject that new
content is currently being added to. If the value of this
parameter is DISCARD, then the potential value is discarded.
idTable -- This is the table of SGML ID to CELLAR object associations.
currentEncoding -- This is the encoding that strings in element
content should be put into.
pcdataClass -- The class of basic object to create for PCDATA.
textBefore, textAfter, textBetween -- Strings to add before, after,
and between PCDATA that become Strings.
Side Effects:
This parser does not return an object. Rather it modifies
the currentObject which was called by reference.
************************************************************************************************/
Next come the declarations of the parser name, the parameters, and the local variables.
ESISelement ( currentObject : Object default is lit. missing,
currentAttr : String default is lit. missing,
idTable : IdTable default is lit. missing,
currentEncoding : Encoding default is lit. missing,
pcdataClass : String default is 'String',
textBetween : String default is lit. missing,
textBefore : String default is lit. missing,
textAfter : String default is lit. missing ) -->
var void, cdata, newObject, newObject2, attrValue,
class, contentAttr, parentAttr, id, attrName, attrValue, attrType,
class2, contentAttr2, attrName2, attrValue2, attrType2
The first action taken in the parser is to read all the architectural attributes for the next element; in the ESIS file these are in A lines. This is a large repeatable alternatives pattern. On each repetition, one parenthesized pattern matches. Since the asterisk operator makes it optional, it simply falls through if there are no attributes at all; this is the case when the next thing in the ESIS file is a line of data (i.e. beginning with the hyphen (-) code). In the ESIS file, an attribute value of IMPLIED is given when the SGML document specifies no value for an architectural attribute; the parser must explicitly match these lines in order to consume them. For an attribute that allows a conditional expression, the value is read by using the String.ESISevalExpr parser (which evaluates the expression as it is being parsed and returns just the value appropriate for the current context); otherwise, the value is simply copied from the file by using the String.upToNL parser (nl = newline).
/* First, read the attributes */
*{ ('ACLASS CDATA ' class:=String.ESISevalExpr(^currentObject) nl)
('ACONTENTATTR IMPLIED' nl)
('ACONTENTATTR CDATA ' contentAttr:=String.ESISevalExpr(^currentObject) nl)
('APARENTATTR IMPLIED' nl)
('APARENTATTR CDATA ' parentAttr := String.ESISevalExpr(^currentObject) nl
do( if exists of ^parentAttr then currentAttr:=^parentAttr) )
('APCDATACLASS IMPLIED' nl)
('APCDATACLASS CDATA ' pcdataClass:=String.ESISevalExpr(^currentObject) nl)
('AATTRNAME IMPLIED' nl)
('AATTRNAME CDATA ' attrName:=String.ESISevalExpr(^currentObject) nl)
('AATTRVALUE IMPLIED' nl)
('AATTRVALUE CDATA ' attrValue:=String.upToNL nl)
('AATTRTYPE IMPLIED' nl)
('AATTRTYPE CDATA ' attrType:=String.ESISevalExpr(^currentObject) nl)
('AID IMPLIED' nl)
('AID TOKEN ' id:=String.upToNL nl)
('AENCODING IMPLIED' nl)
('AENCODING CDATA ' cdata:=String.ESISevalExpr(^currentObject) nl
do( currentEncoding:= encodingWithCode(^cdata) of !Configuration ) )
('ACLASS2 CDATA ' class2:=String.ESISevalExpr(^currentObject) nl)
('ACONTENTATTR2 IMPLIED' nl)
('ACONTENTATTR2 CDATA ' contentAttr2:=String.ESISevalExpr(^currentObject) nl)
('AATTRNAME2 IMPLIED' nl)
('AATTRNAME2 CDATA ' attrName2:=String.ESISevalExpr(^currentObject) nl)
('AATTRVALUE2 IMPLIED' nl)
('AATTRVALUE2 CDATA ' attrValue2:=String.upToNL nl)
('AATTRTYPE2 IMPLIED' nl)
('AATTRTYPE2 CDATA ' attrType2:=String.ESISevalExpr(^currentObject) nl)
('ATEXTBEFORE IMPLIED' nl)
('ATEXTBEFORE CDATA ' textBefore:=String.ESISevalExpr(^currentObject) nl)
('ATEXTAFTER IMPLIED' nl)
('ATEXTAFTER CDATA ' textAfter:=String.ESISevalExpr(^currentObject) nl)
('ATEXTBETWEEN IMPLIED' nl)
('ATEXTBETWEEN CDATA ' textBetween:=String.ESISevalExpr(^currentObject) nl)
('A' void:=String.upToNL nl /* Ignore anything else */ ) }
The remainder of the parser is a large alternatives pattern. There are five alternatives: one for PCDATA and the others for the four architectural forms--IGNORE, ATTR, OBJECT, and DOUBLE.
{ /*Then perform the action that corresponds to the architectural form */
When the next line of the ESIS file is a line of character data, we first test if the currentAttr is set to 'DISCARD'; if so, we do nothing more which has the effect of discarding the data. Otherwise, we must add the data to the current attribute of the current object. There are two if statements. The first converts the string of data from the ESIS file into the appropriate CELLAR object. In the case of String and Text a special ESIS parser (defined below) is used that handles the line break (\n) codes and sets the encoding appropriately. Furthermore, if we are building a String, the textBefore, textAfter, and textBetween strings are concatenated to it. The second if statement handles putting the data into the currentAttr of the currentObject. If the data item is a String or Text and the target attribute can store only a single value, then the new data is concatenated to the end of the existing attribute value; otherwise, the new value is just appended to the attribute (which overwrites an atomic attribute and adds another value to a sequence attribute).
/* On DATA: add an object of pcdataClass to the currently open attr */
('-' cdata:=String.upToNL nl
do( if (^currentAttr ~= 'DISCARD') then
begin
if (^pcdataClass = 'String')
then newObject := join over [
if storesValue(^currentAttr) of ^currentObject then ^textBetween,
^textBefore,
parse ^cdata using String.ESIS( ^currentEncoding ),
^textAfter ]
else if (^pcdataClass = 'Text')
then newObject := parse ^cdata using Text.ESIS( ^currentEncoding )
else newObject := parse ^cdata using Action(!Action(^pcdataClass)).default
if ( accepts(^newObject) of !Text and
atomic of attrDefnFor(^currentAttr) of class of ^currentObject )
then set Action( ^currentAttr) of ^currentObject to
join over [ Action( ^currentAttr) of ^currentObject, ^newObject ]
else append ^newObject to Action(^currentAttr) of ^currentObject
end )
)
When the next line of the ESIS file is an IGNORE element, we don't do anything at this level. We make a recursive call to parse zero or more embedded ESIS elements, and then match the end tag for the IGNORE.
/* On IGNORE: recurse as though this element wasn't there */
( '(IGNORE' nl
*Object.ESISelement( ^currentObject, ^currentAttr, ^idTable, ^currentEncoding,
^pcdataClass, ^textBetween, ^textBefore, ^textAfter )
')IGNORE' nl
)
When the next line of the ESIS file is an ATTR element, we simply recurse with the value given for the contentAttr architectural attribute as the new value of the currentAttr parameter. Since textBefore and textAfter are not architectural attributes of ATTR, no value is given for these parameters in the recursive call. This has the effect of supplying the default values which are declared to be "missing" (or empty).
/* On ATTR: recurse with new contentAttr */
( '(ATTR' nl
*Object.ESISelement( ^currentObject, ^contentAttr, ^idTable, ^currentEncoding,
^pcdataClass, ^textBetween )
')ATTR' nl
)
When the next line of the ESIS file is an OBJECT element, we recurse with a new value of the currentAttr parameter as for ATTR. But first, we also create a new object to pass as a new value of currentObject. The creation of the new object proceeds in five steps:
- create the type of object specified by the class architectural attribute,
- append the object to the currentAttr of the currentObject,
- add an ID-to-object association (see below) if the id architectural attribute is set,
- set the CELLAR attribute specified by the architectural attributes attrName, attrType, and attrValue (see below), and
- set the attribute for attrName2, attrType2, and attrValue2.
When the currentAttr is specified as 'DISCARD', the new object does not get added to the current object. Although everything else happens, including the recursive building of the new object, the effect is that it gets discarded since it never gets added to the current object.
/* On OBJECT: create it, add to currentAttr, and recurse */
( '(OBJECT' nl
do( begin
newObject := create of !Action(^class)
if (^currentAttr ~= 'DISCARD')
then append ^newObject to Action(^currentAttr) of ^currentObject
if exists of ^id then do add(^id, ^newObject) to ^idTable
do ESISattribute( ^attrName, ^attrType, ^attrValue, ^idTable) to ^newObject
do ESISattribute( ^attrName2, ^attrType2, ^attrValue2, ^idTable) to ^newObject
end )
*Object.ESISelement( ^newObject, ^contentAttr, ^idTable, ^currentEncoding,
^pcdataClass, ^textBetween )
')OBJECT' nl
)
The behavior for the final alternative, the DOUBLE element, is similar to that for OBJECT. The difference is that two objects are created. The first object is put into the currentAttr of the currentObject, the second object is put into the contentAttr of the first object, and embedded content when the parser recurses is put into the contentAttr2 of the second object. Note that attrName, attrType, and attrValue apply to the first object, while attrName2, attrType2, and attrValue2 apply to the second object. Note too that assigning an ID in this construction is not yet supported; the plan is to provide architectural support for assigning it to either the first object or the second object.
/* On DOUBLE: create both objects and recurse */
( '(DOUBLE' nl
do( begin
newObject := create of !Action(^class)
if (^currentAttr ~= 'DISCARD')
then append ^newObject to Action(^currentAttr) of ^currentObject
do ESISattribute( ^attrName, ^attrType, ^attrValue, ^idTable) to ^newObject
newObject2 := create of !Action(^class2)
if (^contentAttr ~= 'DISCARD')
then append ^newObject2 to Action(^contentAttr) of ^newObject
do ESISattribute( ^attrName2, ^attrType2, ^attrValue2, ^idTable) to ^newObject2
end )
*Object.ESISelement( ^newObject2, ^contentAttr2, ^idTable, ^currentEncoding,
^pcdataClass, ^textBetween )
')DOUBLE' nl
)
}
5. The supporting code
This section gives the source code for all the smaller parsers and methods that support the main parser.
5.1 The method for loading an ESIS file
The ESISelement parser cannot be called on an ESIS file directly. Rather, it needs a driver function to set up the parameters and then call it. The function is named loadESISfile and it is a method defined on the RootFolder. When executed, this method does the following:
- invokes a file open dialog to allow the user to select the ESIS file to load
- creates an empty IdTable to use for storing ID-to-object associations
- calls the ESISelement parser on the file with the parameters set such that the object that corresponds to the ESIS file is appended to the contents of the RootFolder and that by default the strings in the resulting object will use the encoding which is the default for this CELLAR installation
- ignores the C code in the last line of the ESIS file which indicates that the original document was SGML conforming
The source code is as follows:
/* loadESISfile : a MethodDefn on RootFolder **********************************
Description:
Reads an ESIS (parsed SGML) file that is mapped onto CELLAR
architectural forms, builds the corresponding CELLAR object, and
appends it to the contents of the RootFolder
Returns: Nothing
Side Effects: Adds new item to end of contents of RootFolder
************************************************************************************************/
loadESISfile : means
begin
var file, idTable
file := do getFilePathName ( "ESIS file to load" ) to !System
idTable := create of !IdTable
parse ^file using ( Object.ESISelement( self, "contents", ^idTable,
defaultEncoding of !Configuration )
/* The file may have a C at end to signal that
it was an SGML conformong document */
?'C' ?blank )
end
5.2 The parser for evaluating conditional expressions
The value of an architectural attribute may be a conditional expression. (The syntax of conditional expressions is reviewed in the header comment in the parser definition below.) When the CELLAR parser reads the value of an architectural attribute, it does so with the String parser named ESISevalExpr. This parser implements the logic which evaluates the conditional expression. The only condition that can be tested in these conditional expressions is the class of the object currently being built. Thus the parser is passed that object in a parameter named currentObject. The source code is as follows:
/* ESISevalExpr : a ParserDefn on String ****************************************
Description:
Evaluates a guarded expression in ESIS file to determine
appropriate class or attribute name for this context.
guarded-expression ::= guarded-case* otherwise-case
guarded-case ::= "if" current-class target-value
otherwise-case ::= target-value
target-value ::= quoted-string | cellar-name | "MISSING"
The guarded-cases are tested in order. If current-class is the
class of the current object, then that target-value is returned.
Otherwise, the value for the otherwise-case is returned.
A target-value of "MISSING" returns nothing
Examples:
heading
if CaptionedChunk caption heading
if CaptionedChunk caption if Article titleField heading
Parameters:
currentObject -- The object currently being built
************************************************************************************************/
ESISevalExpr( currentObject : Object ) -->
var current, target, void
{ /* If there's an "if", process a guarded-case */
( 'if' blank current:=String.cellarName blank
target:={ 'MISSING' String.doubleQuoted
String.singleQuoted String.cellarName} blank
/* If this current-class matches the current object, return target-value */
{ ( test( (name of class of ^currentObject = ^current) )
void:=String.upToNL do( ^target ) )
/* Else recurse to eval the rest of the expression */
String.ESISevalExpr( ^currentObject )
}
)
/* Otherwise, return the otherwise-case */
{ 'MISSING' String.doubleQuoted
String.singleQuoted String.cellarName }
}
At the top level, this parser is a pattern with two alternatives:
- if the string begins with if, then there is a guarded case to be tested;
- otherwise, the string gives an unconditional value.
When there is a guarded case, the name of the current class in the expression and the corresponding target value are read by the parser. If the current class just read is the same as the name of the class of object currently being built (currentObject), then the parser throws away the rest of the conditional expression (i.e. everything to the end of the line) and returns the target value. Otherwise, the parser calls itself recursively to process the remainder of the expression.
The target value of a guarded case or of the otherwise case is matched by a pattern with four alternatives. The target value may be the literal string 'MISSING', in which case no value is actually returned. It may be a double-quoted string or it may be a single-quoted string; in either case, the appropriate string parser is called to return the value of the string between the quotes. Finally, the value may be a string that is not quoted; this is parsed by the cellarName parser. Note, however, that the latter allows no punctuation or spaces in the string; thus a quoted string must be used when punctuation or spaces are required in the value.
5.3 The method for setting an attribute
In the CELLAR architecture, attributes of the CELLAR object are defined by giving values to the architectural attributes attrName, attrType, and attrValue, or attrName2, attrType2, and attrValue2. When the CELLAR object is constructed, the attribute is actually set by calling the ESISattribute method of class Object. It has four parameters. The first three--attrName, attrType, and attrValue--are for passing in the values of the architectural attributes; the fourth, idTable, is for passing in the current table of ID-to-object associations so that IDREFs can be handled. The source code is as follows:
/* ESISattribute : a MethodDefn on Object ***********************************
Description:
Sets an attribute of self following the specification in an ESIS
(i.e. parsed SGML) file that has been mapped onto CELLAR architectural
forms. If it is a forward reference (IDREF) to an ID, then the
attribute value is not actually set, but an unresolved reference
record is set up which will set the attribute when the ID is finally
encountered. If no attribute name is passed in, the method does nothing.
Parameters:
attrName -- The name of the attribute to set
attrType -- The type of object to put into the attribute. It is either
the name of a CELLAR class, or the keyword IDREF to indicate that
it is a reference to another object or IDREFS for multiple references
attrValue -- A string to convert into the attribute value (which is
the ID of the target element for IDREF, or IDs separated by spaces
for IDREFS)
idTable -- The table of associations from IDs to objects
Returns: nothing
************************************************************************************************/
ESISattribute( attrName : String, attrType : String,
attrValue : String, idTable : IdTable ) : means
if exists of ^attrName then
begin
var newValue
if ( ^attrType = 'IDREF' )
then /* If it is a backward reference, a value is returned.
If it is a forward reference, no value is returned but an
IdUnresolved record is added to the IdTable */
newValue := do find( ^attrValue, self, ^attrName) to ^idTable
else if ( ^attrType = 'IDREFS' )
then newValue :=
perform( { |item| do find( ^item, self, ^attrName) to ^idTable } )
of parse ^attrValue using *( String.upToBlank ?blank)
else if ( ( ^attrType = 'String' ) or ( ^attrType = 'Text' ) )
then newValue := ^attrValue
else newValue := parse ^attrValue using Action(!Action(^attrType)).default
append ^newValue to Action(^attrName) of self
end
The next section explains how the idTable works to resolve IDREF and IDREFS values.
5.4 The table of ID-to-object associations
The ID-to-object associations are handled by a set of three classes. (Their definitions are found in the conceptualModel attribute of the SGML97 DomainModel.) The main class is IdTable which has a contents consisting of a sequence of IdAssociations sorted by IDs:
class IdTable has contents: seq of IdAssociation sorted by 'id'
An IdAssociation has two main attributes: an id string which is an element ID from the SGML document and a reference (i.e. pointer) to the object in CELLAR which corresponds to the SGML element. For the case where the IDREF in the SGML document is a forward reference (that is, the target ID has not yet been encountered in the SGML document), we have the special case of an unresolved association. For every reference to an ID before the corresponding object is known, the IdAssociation stores an IdUnresolved in its unresolved attribute:
class IdAssociation has id: String object: refers to Object unresolved: seq of IdUnresolved
Each IdUnresolved object records the fact that the object pointed to in the source attribute has an unresolved reference to the ID given in the owning IdAssociation. When the ID is finally encountered, the reference to the target object will be set in the attribute named attr of the source object:
class IdUnresolved has attr: String source: refers to Object
The IdTable is passed to the main parsing function as a parameter. It is accessed in two situations:
- When an IDREF is encountered in the ESIS input, a method named find is executed on the IdTable in order to find the object that is associated with that ID.
- When an ID is encountered in the ESIS input, a method named add is executed on the IdTable in order to add an association between than ID and the CELLAR object currently being constructed.
The find method has three parameters. The first is the id that is being looked up. The other two, the CELLAR object (source) from which the reference originates and the name of the attribute (attr) of that object in which that reference will be stored, must be passed in for the case in which this is actually a forward reference. When this is a backward reference, the method simply returns the object that corresponds to the ID. When it is a forward reference, there are two possible cases: either this the first reference to the given ID and both an IdAssociation and an IdUnresolved must be set up, or the this ID has already been referenced so we only need to add another IdUnresolved. The code is as follows:
/* find : a MethodDefn on IdTable ****************************************
Description:
Finds an ID in the table and returns its associated object. If there
is no object yet associated with the ID, an unresolved forward
reference is recorded in the table (which is fixed up when the ID is
later defined).
Parameters:
id -- The ID to lookup in the table
source -- The object from which the reference to the ID is originating
attr -- The attribute of the source object in which the reference will be stored
Returns:
If this is a backward reference, the associated object is returned.
If this is a forward reference, nothing is returned.
Side Effects:
If this is a forward reference, the table is modified to record an
unresolved forward reference that is resolved later when the ID is
defined.
************************************************************************************************/
find( id : String, source : Object, attr : String ) : Object means
begin
var assoc, forward
assoc := find( ^id, 'contents' ) of self
if isMissing of ^assoc then
begin
/* This is a first-time forward reference. Create an association. */
assoc := create of !IdAssociation
set id of ^assoc to ^id
append ^assoc to contents of self
end
if exists of object of ^assoc
then /* This is a backward reference. Return object */
object of ^assoc
else /* This is a forward reference. Set up the info needed to
resolve it later when the ID is finally encountered. */
begin
forward := create of !IdUnresolved
set attr of ^forward to ^attr
set source of ^forward to ^source
append ^forward to unresolved of ^assoc
end
end
The add method is called when an ID is encountered in the ESIS file. It has two parameters: the id being declared and the CELLAR object to be associated with it (which happens to be the object currently under construction). The method must first see if it already has an association for the given ID. If it does, then there were forward references to this ID; the appropriate action is to call the resolve method on the IdAssociation (see below) to set its object and resolved all the pending IdUnresolved forward references to this object. If there is no association already, then the ID has not yet been referenced and we can simply create a new IdAssociation. The code is as follows:
/* add : a MethodDefn on IdTable ****************************************
Description:
Adds a new ID-to-object association to the IdTable. If the ID is
already in the table, it means that an IDREF was encountered before
the ID itself. In this case, the prior references are also cleaned up.
Parameters:
id -- The SGML ID
object -- The corresponding CELLAR object
Returns: Nothing
Side Effects: Goes back and fixes pending forward references to this ID.
************************************************************************************************/
add( id : String, object : Object ) : means
begin
var assoc
assoc := find( ^id, 'contents' ) of self
if exists of ^assoc
then
/* This ID has already been referred to.
Resolve the pending forward references. */
do resolve( ^object) to ^assoc
else
/* This ID hasn't been referred to yet. Add the association. */
begin
assoc := create of !IdAssociation
set id of ^assoc to ^id
set object of ^assoc to ^object
append ^assoc to contents of self
end
end
The resolve method on IdAssociation has one parameter, the object to which the pending forward references are to be resolved. The method first sets the object attribute of the association to this target object. Then it goes through the unresolved references one at a time and sets the attr of the source object of each IdUnresolved to the object passed in as parameter. (This is done in the perform action which performs the embedded lambda action once for each value in the unresolved attribute; the current value is assigned to the ref parameter of the lambda action.) The unresolved list is then set to empty since there are now no pending forward references. The code is as follows:
/* resolve : a MethodDefn on IdAssociation ****************************************
Description:
Set the object associated with the ID already stored in this
IdAssociation to be the object passed in as the parameter. Then
resolve all the pending forward references to this ID.
Parameters:
object -- The object to be associated with the ID already stored in
the id attribute
Returns: Nothing
Side Effects:
Changes all source objects noted in the IdUnresolved records stored
in the unresolved attribute to now point to the associated object
that has just been passed in as parameter.
Assumptions:
Assumes both that the object attribute is not set and that the
unresolved attribute is.
************************************************************************************************/
resolve( object : Object ) : means
begin
/* Set the association */
if isMissing of my object
then set object of self to ^object
else /* This should not happen */
do printDebug( 'Double association for' ) to my id
if isMissing of my unresolved
then /* This, too, should not happen */
do printDebug( 'No unresolved references for' ) to my id
/* Resolve the pending forward references */
perform( { | ref |
set Action( attr of ^ref) of source of ^ref to ^object
} ) of my unresolved
set unresolved of self to lit. missing
end
5.5 The parsers for converting CDATA into CELLAR Strings and Texts
When a CDATA item is encountered in the main parser, in a line of the ESIS file beginning with - (hyphen), the parser must convert the CDATA string in the ESIS file to a CELLAR object. The pcdataClass parameter names the target CELLAR class. When the target class is String, the String.ESIS parser is called to make the conversion. The main issue that this parser must deal with is handling the \n codes that are in the ESIS representation of the CDATA. Whenever the original SGML document had a line break in the content of an element with a declared content type of PCDATA or CDATA, the ESIS representation encodes that line break as the character sequence \n. A String in CELLAR is not allowed to contain a line break; thus each occurrence of \n must be converted to a space. The following parser performs this task:
/* ESIS : a ParserDefn on String ****************************************
Description:
Converts string data from an ESIS file to a CELLAR string.
Converts \n (newline) in the string to space.
Parameters:
encoding -- Specifies the encoding for the resulting string
************************************************************************************************/
ESIS( encoding : Encoding default is lit. missing ) -->
var string, substrings
substrings := ( ?String.upTo('\n')
*( '\n' do( ' ' ) ?String.upTo('\n') )
)
do( begin
string := create of !String
set basicValue of ^string to ^substrings
set encoding of ^string to ^encoding
^string
end )
Note that this parser performs a second function; it also sets the encoding of the String to the one specified in the encoding parameter. In the above parser, the substrings variable ends up holding a sequence of all the substrings returned by the String.upTo('\n') parsers. A match of a literal string does not return a value. Thus the code '\n' do(' ') has the effect of returning a literal space when the \n sequence is matched. Later in set basicValue of ^string to ^substrings, all of the substrings are automatically concatenated as they are placed into the basicValue of the String.
A Text object in CELLAR is allowed to have line breaks. Thus, a pcdataClass of Text is specified when one wants the data to preserve the line breaks that are in the original SGML document. The Text.ESIS parser differs from String.ESIS only in that the value substituted when \n is matched is a literal line break rather than a literal space.
/* ESIS : a ParserDefn on Text ****************************************
Description:
Converts string data from an ESIS file to a CELLAR Text.
Converts \n (newline) in the string to a newline.
Parameters:
encoding -- Specifies the encoding for the resulting string
************************************************************************************************/
ESIS( encoding : Encoding default is lit. missing ) -->
var string, substrings
substrings := ( ?String.upTo('\n')
*( '\n' do( '
' ) ?String.upTo('\n') )
)
do( begin
string := create of !String
set basicValue of ^string to ^substrings
set encoding of ^string to ^encoding
^string
end )
Document date: 12-Nov-1997
