How to import SGML data into CELLAR using architectural processing
From:
Importing SGML data into CELLAR by means of architectural forms
Gary F. Simons
Summer Institute of Linguistics
Last revised: 13 December 1997
Contents
- Analyze the relationship between the original DTD and the CELLAR conceptual model
- Create a mapping DTD that maps from the original DTD to the architectural DTD
- Create a client DTD that invokes architectural processing for the original DTD
- Associate the client document with the client DTD
- Run the architecture engine to translate the client document into the corresponding architectural document
- Load the architectural document into CELLAR
0. Overview
This document explains how to import existing SGML data into the CELLAR database without changing the SGML data or doing any programming in CELLAR. This has been achieved by using the architectural processing features of James Clark's SP parser [Cla97] and a predefined parser in CELLAR that reads the output of the architectural processing.
The process involves four DTDs:
- Original DTD
- This is the DTD for the SGML document to be imported into the object database.
- Client DTD
- A substitute for the original DTD which adds an invocation of architectural processing for the mapping DTD.
- Mapping DTD
- The meta-DTD which maps the elements and attributes of the client DTD onto the elements and attributes of the architectural DTD.
- Architectural DTD
- The meta-DTD for the CELLAR architecture; this is always CELLAR.DTD.
The first and last DTD are given at the outset; the other two are created as part of the process.
The process for importing an SGML file into the CELLAR database follows these steps:
- Analyze the relationship between the original DTD and the CELLAR conceptual model.
- Create a mapping DTD that maps from the original DTD to the architectural DTD.
- Create a client DTD that invokes architectural processing for the original DTD.
- Associate the client document with the client DTD.
- Run the architecture engine to translate the client document into the corresponding architectural document.
- Load the architectural document into CELLAR.
These steps are detailed in the subsections which follow. The example used throughout this procedure is that of the critical edition of a passage from the Greek text of Second Clement; all the files for the example are given in full elsewhere.
1. Analyze the relationship between the original DTD and the CELLAR conceptual model
To begin, you must find:
- a sample SGML data file that you want to import
- the DTD for this SGML file
- the conceptual model for the classes in CELLAR into which the SGML data will be imported
With these resources in front of you, go through the DTD one element at a time and determine which class and/or attribute it corresponds to in the conceptual model. Then consider how its attributes correspond to attributes in the conceptual model. If the DTD is much more complex than the document instance you want to import, you can do an inventory of the SGML file to see what elements and attributes it actually uses, and then consider just how these map to the conceptual model. The results of this analysis are expressed formally in step 2; that step is typically performed concurrently with this analysis step.
Skip over example to next step
The SGML data file for the sample critical text is as follows. Note that a significant portion of the content has been elided in the interest of brevity. The Greek text is encoded in TLG beta code.
<!DOCTYPE TEI.2 SYSTEM "textcrit.dtd"> <TEI.2> <text> <front> <docTitle>2 Clement, chapter 7</docTitle> <witlist> <wit id=A type=Manuscript>Codex Alexandrinus <bibl>A Greek uncial of the fifth century. Housed in the British Museum. Published in: The Codex Alexandrinus in reduced photographic facsimile, with an introduction by F. G. Kenyon, London 1909. </bibl></wit> <wit id=C type=Manuscript>Codex Constantinopolitanus <bibl> . . . </bibl></wit> <wit id=S type=Manuscript>Syriac Version <bibl> . . . </bibl></wit> <wit id=L type=Edition>Lightfoot 1890 <bibl>Lightfoot, J. B. 1890. The Apostolic Fathers: Clement, Ignatius, Polycarp (2nd edition). Part One: Clement, volume 2, pages 210-261. Macmillan. (Reprinted 1989 by Hendrickson Publishers, Peabody, MA) </bibl></wit> <wit id=Lb type=Edition>Loeb edition <bibl> . . . </bibl></wit> <wit id=B type=Edition>Bihlmeyer 1970 <bibl> . . . </bibl></wit> <wit id=W type=Edition>Wengst 1984 <bibl> . . . </bibl></wit> </witlist> </front> <body> <div n=7> <!-- ***************** Verse 1 ********************* --> <s n=1> w(/ste <app><rdg wit='A L Lb B'>ou)=n</rdg> <rdg wit='C S W'><omit></rdg></app> a)delfoi/ <app><rdg wit='A L Lb B'>mou</rdg> <rdg wit='C W'><omit></rdg></app> a)gwnisw/meqa ei)do/tej, o(/ti e)n xersi\n o( <app><rdg wit='C S L Lb B W'>a)gw\n</rdg> <rdg wit='A'>ai)w/n</rdg></app> kai\ o(/ti ei)j tou\j fqartou\j a)gw=naj kataple/ousin polloi/, a)ll' ou) pa/ntej stefanou=ntai, <app><rdg wit='C L Lb B W'>ei) mh\</rdg> <rdg wit='A'>oi( mh/</rdg> <rdg wit='S'>ei) mh\ mo/non</rdg></app> oi( polla\ kopia/santej kai\ kalw=j a)gwnisa/menoi. </s> <!-- and so forth for remaining verses --> </div> </body></text> </TEI.2>
The DTD for this file is the following:
<!-- TextCrit.DTD
A DTD for encoding a text critical edition. All tags
are from the TEI guidelines (Text Encoding Initiative).
The content models have been simplified to deal only
with the tags needed for the sample text of II Clement.
The aim is to faithfully represent the TEI scheme of
markup without having to deal with the huge TEI DTD.
This DTD reflects the "Parallel segmentation method"
of encoding. See section 19.2.3 of the TEI Guidelines.
Gary Simons, Summer Institute of Linguistics
Last revised: 18 october 1997 -->
<!ELEMENT TEI.2 - - ( text ) >
<!ELEMENT text - - ( front, body ) >
<!ELEMENT front - - ( docTitle, witList ) >
<!ELEMENT docTitle - - (#PCDATA) >
<!ELEMENT witList - - ( wit+ ) >
<!ELEMENT wit - - ( #PCDATA, bibl? ) >
<!ATTLIST wit id ID #REQUIRED
type CDATA #REQUIRED >
<!ELEMENT bibl - - (#PCDATA) >
<!ELEMENT body - - ( div+ ) >
<!ELEMENT div - - ( s+ ) >
<!ATTLIST div n CDATA #IMPLIED >
<!ELEMENT s - - ( #PCDATA | app )+ >
<!ATTLIST s n CDATA #IMPLIED >
<!ELEMENT app - - ( rdg+ ) >
<!ELEMENT rdg - - ( #PCDATA | omit ) >
<!ATTLIST rdg wit IDREFS #REQUIRED >
<!ELEMENT omit - O EMPTY >
The conceptual model for the objects and attributes into which we want to import the SGML file is diagrammed below. The notation and the model are explained in [Sim97a]. Here suffice it to say that solid arrows mean "contains" and the dotted arrow means "holds pointers to."
2. Create the mapping DTD
The correspondence between the element and attributes of the original DTD and the objects and attributes of the CELLAR conceptual model are formally expressed in the mapping DTD. This second DTD is a meta-DTD that defines how the elements in the client DTD should be annotated so as to express their mapping onto the elements of the architectural DTD.
While performing this task, you will want to have access to:
- the architectural DTD and the explanation of its elements and attributes
- the listing of common mapping problems and their solutions
- the set of complete solved examples
All of the examples provided with this working paper follow the convention that the mapping DTD is named by adding a map- prefix to the name of the original DTD. The mapping DTD thus always has the following form:
<!-- map-original.dtd
This maps original.dtd onto CELLAR architectural forms -->
<!afdr "ISO/IEC 10744:1992" --Allow multiple ATTLIST declarations-->
<?ArcBase cellar>
<!ENTITY % cellarDTD SYSTEM "cellar.dtd" >
<!NOTATION cellar SYSTEM>
<!ATTLIST #NOTATION cellar
arcDocF NAME #FIXED object
arcFormA NAME #FIXED cellar
arcNamrA NAME #FIXED cellarNames
ArcDTD CDATA #FIXED "%cellarDTD" >
<!-- Mapping declarations go here; one ATTLIST declaration for
each original DTD element to be mapped. -->
<!ENTITY % originalDTD SYSTEM "original.dtd" >
%originalDTD;
The <!AFDR> declaration (for Architectural Form Definition
Requirements) invokes meta-DTD extensions needed for architectural processing.
It this case it instructs the SGML parser to permit multiple ATTLIST declarations
for a single element; otherwise it would be a syntax error for this DTD to
both define an ATTLIST for an element and then to read one from the original
DTD (as it does at the end).
The <?ArcBase cellar> processing instruction declares cellar as the name of the base architecture. The architectural support attributes for this architecture declare that:
-
object is the top-level document element in the architectural document
(
ArcDocF), -
cellar is the attribute in the client document which names the
corresponding architectural form to use in the architectural document
(
ArcFormA), -
cellarNames is the "attribute renamer" attribute
(
ArcNamrA), and -
cellar.dtd is the architectural DTD
(
ArcDTD).
These settings should be the same for all mapping DTDs.
This meta-DTD produces an architecturally annotated version of the client document. It must include all the declarations of the original DTD plus all the new ones that add the architectural information. The original DTD is thus included in full without modification at the end.
The bulk of this meta-DTD consists of second ATTLIST declarations for the elements in the original DTD. Their purpose is to add declarations for the attributes of the cellar architecture.
Skip over example to next step
For our example, the original DTD is named textcrit.dtd. The complete mapping DTD for our example is as follows:
<!-- map-textcrit.dtd
This maps textcrit.dtd onto CELLAR architectural forms
Gary simons, SIL, 18 Oct 1997 -->
<!afdr "ISO/IEC 10744:1992" --Allow multiple ATTLIST declarations-->
<?ArcBase cellar>
<!ENTITY % cellarDTD SYSTEM "cellar.dtd" >
<!NOTATION cellar SYSTEM>
<!ATTLIST #NOTATION cellar
arcDocF NAME #FIXED object
arcFormA NAME #FIXED cellar
arcNamrA NAME #FIXED cellarNames
ArcDTD CDATA #FIXED "%cellarDTD" >
<!ATTLIST TEI.2
cellar NAME #FIXED object
class CDATA #FIXED CriticalText >
<!ATTLIST text
cellar NAME #FIXED ignore >
<!ATTLIST front
cellar NAME #FIXED ignore >
<!ATTLIST docTitle
cellar NAME #FIXED attr
contentAttr CDATA #FIXED title >
<!ATTLIST witList
cellar NAME #FIXED attr
contentAttr CDATA #FIXED authorities >
<!ATTLIST wit
cellar NAME #FIXED object
cellarNames CDATA #FIXED "class type attrValue id"
attrName CDATA #FIXED siglum
attrType CDATA #FIXED String
contentAttr CDATA #FIXED description
-- id automatically preserved from
client attr of same name -- >
<!ATTLIST bibl
cellar NAME #FIXED attr
contentAttr CDATA #FIXED source >
<!ATTLIST body
cellar NAME #FIXED attr
contentAttr CDATA #FIXED body >
<!ATTLIST div
cellar NAME #FIXED object
class CDATA #FIXED CriticalTextChapter
contentAttr CDATA #FIXED contents
attrName CDATA #FIXED n
attrType CDATA #FIXED String
cellarNames CDATA #FIXED "attrValue n" >
<!ATTLIST s
cellar NAME #FIXED object
class CDATA #FIXED CriticalTextVerse
contentAttr CDATA #FIXED contents
attrName CDATA #FIXED n
attrType CDATA #FIXED String
cellarNames CDATA #FIXED "attrValue n"
encoding CDATA #FIXED GKOb >
<!ATTLIST app
cellar NAME #FIXED object
class CDATA #FIXED TextVariation
contentAttr CDATA #FIXED readings >
<!ATTLIST rdg
cellar NAME #FIXED object
class CDATA #FIXED Reading
contentAttr CDATA #FIXED text
attrName CDATA #FIXED witnesses
attrType CDATA #FIXED IDREFS
cellarNames CDATA #FIXED "attrValue wit" >
<!ATTLIST omit
cellar NAME #FIXED object
class CDATA #FIXED String >
<!ENTITY % originalDTD SYSTEM "textcrit.dtd" >
%originalDTD;
3. Create a client DTD that invokes architectural processing
The next step is to create a client DTD for the architectural processing. This is done by adding the declarations for architectural processing to the declarations from the original DTD. This could be done by editing the original DTD, but in keeping with the guideline that we do not want to modify the original DTD, all the examples provided with this working paper create a new file for the client DTD. By convention the client DTD has been named by adding a my- prefix to the name of the original DTD. The client DTD thus always has the following form:
<!-- my-original.dtd
This is a version of original.dtd that invokes
the mapping to CELLAR architectural forms. -->
<?ArcBase mapping>
<!ENTITY % mappingDTD SYSTEM "map-original.dtd" >
<!NOTATION mapping SYSTEM>
<!ATTLIST #NOTATION mapping
ArcDocF NAME #FIXED "originalDOCTYPE"
ArcDTD CDATA #FIXED "%mappingDTD" >
<!ENTITY % originalDTD SYSTEM "original.dtd" >
%originalDTD;
Note that this DTD does not modify the declarations in the original DTD in any way. Rather, it duplicates them exactly by including the original DTD in full at the end. The purpose of this version of the DTD is to declare the architectural support that will invoke the mapping DTD developed in the previous step.
The <?ArcBase mapping> processing instruction declares mapping as the name of the base architecture. Two architectural support attributes for this architecture must be declared:
-
ArcDocFnames the top-level document element in the resulting document, which in this case is the architecturally annotated client document; it is the same as the DOCTYPE for the original DTD. -
ArcDTDnames the meta-DTD for this application of the architecture engine; it is the mapping DTD created in the previous step.
Skip over example to next step
For our example, the original DTD is named textcrit.dtd and the DOCTYPE is TEI.2. The client DTD for our example is thus as follows:
<!-- my-textcrit.dtd
This is a version of textcrit.dtd that invokes
the mapping to CELLAR architectural forms. -->
<?ArcBase mapping>
<!ENTITY % mappingDTD SYSTEM "map-textcrit.dtd" >
<!NOTATION mapping SYSTEM>
<!ATTLIST #NOTATION mapping
ArcDocF NAME #FIXED "TEI.2"
ArcDTD CDATA #FIXED "%mappingDTD" >
<!ENTITY % originalDTD SYSTEM "textcrit.dtd" >
%originalDTD;
4. Associate the client document with the modified DTD
Before performing the automatic translation from client document to architectural document, one more detail must be attended to. The DOCTYPE declaration of the original SGML data file must be changed so that it uses the client DTD defined in the previous step.
For our example, the result would be:
<!DOCTYPE TEI.2 SYSTEM "my-textcrit.dtd"> <TEI.2> <!-- Nothing is changed in the content --> </TEI.2>
5. Run the architecture engine to translate the document
The freeware SGML parsers in the SP family [Cla97] include an architecture engine that can perform the mapping to translate a client document instance into an architectural document instance. For the nsgmls parser, the following command line does the job:
nsgmls -Amapping -Acellar input.sgm >output.clr
This invokes the mapping architecture followed by the cellar architecture to result in an output file which is a document instance that conforms to the CELLAR architecture.
Skip over example to next step
For our example, the command line is:
nsgmls -Amapping -Acellar clement.sgm >clement.clr
Performing this translation step on the sample Clement text (see step one) using the mappings from step two yields a document like the following (note that most of the content is elided to avoid excessive detail):
<object class="CriticalText">
<attr contentAttr="title" pcdataClass="String">
2 Clement, chapter 7</attr>
<attr contentAttr="authorities">
<object class="Manuscript" id="A" contentAttr="description"
attrName="siglum" attrType="String" attrValue="A"
pcdataClass="String">
Codex Alexandrinus
<attr contentAttr="source" pcdataClass="String">
A Greek uncial of the fifth century. . . </attr>
</object>
<!-- The other six authorities -->
</attr>
<attr contentAttr="body">
<!-- The CriticalTextChapter and its contents -->
</attr>
</object>
Note, however, that the nsgmls parser does not actually output an SGML-tagged document. Rather, it outputs a simple text representation of the document's Element Structure Information Set . The ESIS format is much easier to parse in the final step of the process. (Note that if you really want to output an SGML document like the above, use the sgmlnorm tool from the SP package. Simply substitute sgmlnorm for nsgmls in the above command line.) In the ESIS format, the first character of each line is a command character that identifies the kind of information on the line:
- A specifies an attribute of the next element
- (gi marks the start of an element whose generic identifier is gi
- )gi marks the end of an element whose generic identifier is gi
- - marks character data
The actual output is thus as follows:
ACLASS CDATA CriticalText (OBJECT ACONTENTATTR CDATA title APCDATACLASS CDATA String (ATTR -2 Clement, chapter 7 )ATTR ACONTENTATTR CDATA authorities (ATTR ACLASS CDATA Manuscript ACONTENTATTR CDATA description APCDATACLASS CDATA String AID TOKEN A AATTRNAME CDATA siglum AATTRVALUE CDATA A AATTRTYPE CDATA String (OBJECT -Codex Alexandrinus\n ACONTENTATTR CDATA source APCDATACLASS CDATA String (ATTR -A Greek uncial of the fifth century. Housed in the British \nMuseum. Published in: The Codex Alexandrinus in reduced photographic\n facsimile, with an introduction by F. G. Kenyon, London 1909. )ATTR )OBJECT ...
6. Load the architectural document into CELLAR
The final step in the process is to run a method of the CELLAR system that invokes a data input parser that converts the architectural document instance into the corresponding structure of objects. The input to the CELLAR parser is the ESIS output file generated by the nsgmls parser in the previous step.
To load the data into CELLAR, do the following:
- Open a CELLAR object inspector to the RootFolder.
- Select the Methods command from the Attr menu.
- Scroll down the method list until you find loadESISfile.
- Double click loadESISfile and answer Yes to the confirmation dialog.
- When the parser terminates, you will find your imported data as the last item in the contents of the RootFolder. Select Owning from the Attr menu and click on contents to view the contents of the RootFolder.
The parser will fail with a CELLAR error message if the document cannot be imported. The most typical cause of failure is specifying the mapping DTD is such a way that it generates a structure of objects that violates the target conceptual model in CELLAR (for instance, if might attempt to put an object in an attribute that cannot accept that class). A separate page with pointers on troubleshooting failures is planned. For information on how the parser in CELLAR actually works, see the separate implementation documentation.
Document date: 12-Nov-1997
