banner
SIL International Home

How to import SGML data into CELLAR using architectural processing

From:

Importing SGML data into CELLAR by means of architectural forms
Gary F. Simons
Summer Institute of Linguistics

Last revised: 13 December 1997


Contents

  1. Analyze the relationship between the original DTD and the CELLAR conceptual model
  2. Create a mapping DTD that maps from the original DTD to the architectural DTD
  3. Create a client DTD that invokes architectural processing for the original DTD
  4. Associate the client document with the client DTD
  5. Run the architecture engine to translate the client document into the corresponding architectural document
  6. Load the architectural document into CELLAR


0. Overview

This document explains how to import existing SGML data into the CELLAR database without changing the SGML data or doing any programming in CELLAR. This has been achieved by using the architectural processing features of James Clark's SP parser [Cla97] and a predefined parser in CELLAR that reads the output of the architectural processing.

The process involves four DTDs:

Original DTD
This is the DTD for the SGML document to be imported into the object database.
Client DTD
A substitute for the original DTD which adds an invocation of architectural processing for the mapping DTD.
Mapping DTD
The meta-DTD which maps the elements and attributes of the client DTD onto the elements and attributes of the architectural DTD.
Architectural DTD
The meta-DTD for the CELLAR architecture; this is always CELLAR.DTD.

The first and last DTD are given at the outset; the other two are created as part of the process.

The process for importing an SGML file into the CELLAR database follows these steps:

  1. Analyze the relationship between the original DTD and the CELLAR conceptual model.
  2. Create a mapping DTD that maps from the original DTD to the architectural DTD.
  3. Create a client DTD that invokes architectural processing for the original DTD.
  4. Associate the client document with the client DTD.
  5. Run the architecture engine to translate the client document into the corresponding architectural document.
  6. Load the architectural document into CELLAR.

These steps are detailed in the subsections which follow.  The example used throughout this procedure is that of the critical edition of a passage from the Greek text of Second Clement; all the files for the example are given in full elsewhere.

1. Analyze the relationship between the original DTD and the CELLAR conceptual model

To begin, you must find:

With these resources in front of you, go through the DTD one element at a time and determine which class and/or attribute it corresponds to in the conceptual model.  Then consider how its attributes correspond to attributes in the conceptual model.  If the DTD is much more complex than the document instance you want to import, you can do an inventory of the SGML file to see what elements and attributes it actually uses, and then consider just how these map to the conceptual model.  The results of this analysis are expressed formally in step 2; that step is typically performed concurrently with this analysis step.

Skip over example to next step

The SGML data file for the sample critical text is as follows. Note that a significant portion of the content has been elided in the interest of brevity. The Greek text is encoded in TLG beta code.

<!DOCTYPE TEI.2 SYSTEM "textcrit.dtd"> 
<TEI.2>
<text>
<front>
<docTitle>2 Clement, chapter 7</docTitle>
<witlist>
<wit id=A type=Manuscript>Codex Alexandrinus
<bibl>A Greek uncial of the fifth century.  Housed in the British 
Museum.  Published in:  The Codex Alexandrinus in reduced photographic
 facsimile, with an introduction by F. G. Kenyon, London 1909.
</bibl></wit>
<wit id=C type=Manuscript>Codex Constantinopolitanus
   <bibl> . . . </bibl></wit>
<wit id=S type=Manuscript>Syriac Version
   <bibl> . . . </bibl></wit>
<wit id=L type=Edition>Lightfoot 1890
<bibl>Lightfoot, J. B.  1890.  The Apostolic Fathers: Clement,
Ignatius, Polycarp (2nd edition).  Part One: Clement, volume 2, pages 210-261.
Macmillan.  (Reprinted 1989 by Hendrickson Publishers, Peabody, MA)
</bibl></wit>
<wit id=Lb type=Edition>Loeb edition
   <bibl> . . . </bibl></wit>
<wit id=B type=Edition>Bihlmeyer 1970
<bibl> . . . </bibl></wit>
<wit id=W type=Edition>Wengst 1984
   <bibl> . . . </bibl></wit>
</witlist>
</front>
<body>
<div n=7>
<!-- ***************** Verse 1 ********************* -->
<s n=1>
w(/ste
<app><rdg wit='A L Lb B'>ou)=n</rdg>
   <rdg wit='C S W'><omit></rdg></app>
a)delfoi/
<app><rdg wit='A L Lb B'>mou</rdg>
   <rdg wit='C W'><omit></rdg></app>
a)gwnisw/meqa ei)do/tej, o(/ti e)n xersi\n o(
<app><rdg wit='C S L Lb B W'>a)gw\n</rdg>
   <rdg wit='A'>ai)w/n</rdg></app>
kai\ o(/ti ei)j tou\j fqartou\j a)gw=naj kataple/ousin
polloi/, a)ll' ou) pa/ntej stefanou=ntai,
<app><rdg wit='C L Lb B W'>ei) mh\</rdg>
   <rdg wit='A'>oi( mh/</rdg>
   <rdg wit='S'>ei) mh\ mo/non</rdg></app>
oi( polla\ kopia/santej kai\ kalw=j a)gwnisa/menoi.
</s>
<!-- and so forth for remaining verses  -->
</div>
</body></text>
</TEI.2>

The DTD for this file is the following:

<!-- TextCrit.DTD

     A DTD for encoding a text critical edition.  All tags
     are from the TEI guidelines (Text Encoding Initiative).
     The content models have been simplified to deal only
     with the tags needed for the sample text of II Clement.
     The aim is to faithfully represent the TEI scheme of
     markup without having to deal with the huge TEI DTD.

     This DTD reflects the "Parallel segmentation method"
     of encoding.  See section 19.2.3 of the TEI Guidelines.

     Gary Simons, Summer Institute of Linguistics
     Last revised: 18 october 1997                      -->  

<!ELEMENT TEI.2     - - ( text )              >

<!ELEMENT text      - - ( front, body )       >

<!ELEMENT front     - - ( docTitle, witList ) >

<!ELEMENT docTitle  - - (#PCDATA)             >

<!ELEMENT witList   - - ( wit+ )              >

<!ELEMENT wit       - - ( #PCDATA, bibl? )    >
<!ATTLIST wit       id   ID    #REQUIRED
                    type CDATA #REQUIRED      >

<!ELEMENT bibl      - - (#PCDATA)             >

<!ELEMENT body      - - ( div+ )              >

<!ELEMENT div       - - ( s+ )                >
<!ATTLIST div       n  CDATA  #IMPLIED        >

<!ELEMENT s         - - ( #PCDATA | app )+    >
<!ATTLIST s         n  CDATA  #IMPLIED        >

<!ELEMENT app       - - ( rdg+ )              >

<!ELEMENT rdg       - - ( #PCDATA | omit )    >
<!ATTLIST rdg       wit  IDREFS  #REQUIRED    >

<!ELEMENT omit      - O  EMPTY                >

The conceptual model for the objects and attributes into which we want to import the SGML file is diagrammed below. The notation and the model are explained in [Sim97a]. Here suffice it to say that solid arrows mean "contains" and the dotted arrow means "holds pointers to."

2. Create the mapping DTD

The correspondence between the element and attributes of the original DTD and the objects and attributes of the CELLAR conceptual model are formally expressed in the mapping DTD. This second DTD is a meta-DTD that defines how the elements in the client DTD should be annotated so as to express their mapping onto the elements of the architectural DTD.

While performing this task, you will want to have access to:

All of the examples provided with this working paper follow the convention that the mapping DTD is named by adding a map- prefix to the name of the original DTD. The mapping DTD thus always has the following form:

<!-- map-original.dtd
     This maps original.dtd onto CELLAR architectural forms -->

<!afdr "ISO/IEC 10744:1992" --Allow multiple ATTLIST declarations-->

<?ArcBase cellar>
<!ENTITY % cellarDTD SYSTEM "cellar.dtd" >
<!NOTATION cellar SYSTEM>
<!ATTLIST  #NOTATION cellar
    arcDocF  NAME  #FIXED object 
    arcFormA NAME  #FIXED cellar
    arcNamrA NAME  #FIXED cellarNames
    ArcDTD   CDATA #FIXED "%cellarDTD" >

<!-- Mapping declarations go here; one ATTLIST declaration for 
     each original DTD element to be mapped. -->

<!ENTITY % originalDTD SYSTEM "original.dtd"  >
%originalDTD;

The <!AFDR> declaration (for Architectural Form Definition Requirements) invokes meta-DTD extensions needed for architectural processing. It this case it instructs the SGML parser to permit multiple ATTLIST declarations for a single element; otherwise it would be a syntax error for this DTD to both define an ATTLIST for an element and then to read one from the original DTD (as it does at the end).

The <?ArcBase cellar> processing instruction declares cellar as the name of the base architecture. The architectural support attributes for this architecture declare that:

These settings should be the same for all mapping DTDs.

This meta-DTD produces an architecturally annotated version of the client document.  It must include all the declarations of the original DTD plus all the new ones that add the architectural information. The original DTD is thus included in full without modification at the end.

The bulk of this meta-DTD consists of second ATTLIST declarations for the elements in the original DTD. Their purpose is to add declarations for the attributes of the cellar architecture.

Skip over example to next step

For our example, the original DTD is named textcrit.dtd. The complete mapping DTD for our example is as follows:

<!-- map-textcrit.dtd
     This maps textcrit.dtd onto CELLAR architectural forms
     Gary simons, SIL, 18 Oct 1997 -->

<!afdr "ISO/IEC 10744:1992" --Allow multiple ATTLIST declarations-->

<?ArcBase cellar>
<!ENTITY % cellarDTD SYSTEM "cellar.dtd" >
<!NOTATION cellar SYSTEM>
<!ATTLIST  #NOTATION cellar
    arcDocF  NAME  #FIXED object 
    arcFormA NAME  #FIXED cellar
    arcNamrA NAME  #FIXED cellarNames
    ArcDTD   CDATA #FIXED "%cellarDTD" >

<!ATTLIST TEI.2
     cellar      NAME  #FIXED object
     class       CDATA #FIXED CriticalText    >

<!ATTLIST text
     cellar      NAME  #FIXED ignore          >

<!ATTLIST front
     cellar      NAME  #FIXED ignore          >

<!ATTLIST docTitle
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED title           >

<!ATTLIST witList
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED authorities     >

<!ATTLIST wit 
     cellar      NAME  #FIXED object 
     cellarNames CDATA #FIXED "class type attrValue id"
     attrName    CDATA #FIXED siglum
     attrType    CDATA #FIXED String
     contentAttr CDATA #FIXED description     
  -- id          automatically preserved from
                 client attr of same name --  >

<!ATTLIST bibl 
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED source          >

<!ATTLIST body
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED body            >

<!ATTLIST div
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED CriticalTextChapter
     contentAttr CDATA #FIXED contents
     attrName    CDATA #FIXED n
     attrType    CDATA #FIXED String
     cellarNames CDATA #FIXED "attrValue n"   >

<!ATTLIST s 
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED CriticalTextVerse
     contentAttr CDATA #FIXED contents
     attrName    CDATA #FIXED n
     attrType    CDATA #FIXED String
     cellarNames CDATA #FIXED "attrValue n"
     encoding    CDATA #FIXED GKOb            >

<!ATTLIST app
     cellar      NAME  #FIXED object
     class       CDATA #FIXED TextVariation
     contentAttr CDATA #FIXED readings        >

<!ATTLIST rdg
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED Reading
     contentAttr CDATA #FIXED text
     attrName    CDATA #FIXED witnesses
     attrType    CDATA #FIXED IDREFS
     cellarNames CDATA #FIXED "attrValue wit" >

<!ATTLIST omit 
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED String          >

<!ENTITY % originalDTD SYSTEM "textcrit.dtd"  >
%originalDTD;

3. Create a client DTD that invokes architectural processing

The next step is to create a client DTD for the architectural processing.  This is done by adding the declarations for  architectural processing to the  declarations from the original DTD.  This could be done by editing the original DTD, but in keeping with the guideline that we do not want to modify the original DTD, all the examples provided with this working paper create a new file for the client DTD.  By convention the client DTD has been named by adding a my- prefix to the name of the original DTD. The client DTD thus always has the following form:

<!-- my-original.dtd
     This is a version of original.dtd that invokes
     the mapping to CELLAR architectural forms. -->

<?ArcBase mapping>

<!ENTITY % mappingDTD SYSTEM "map-original.dtd" >
<!NOTATION mapping SYSTEM>
<!ATTLIST #NOTATION mapping
    ArcDocF  NAME  #FIXED "originalDOCTYPE" 
    ArcDTD   CDATA #FIXED "%mappingDTD" >

<!ENTITY % originalDTD SYSTEM "original.dtd" >
%originalDTD; 

Note that this DTD does not modify the declarations in the original DTD in any way. Rather, it duplicates them exactly by including the original DTD in full at the end. The purpose of this version of the DTD is to declare the architectural support that will invoke the mapping DTD developed in the previous step.

The <?ArcBase mapping> processing instruction declares mapping as the name of the base architecture. Two architectural support attributes for this architecture must be declared:

Skip over example to next step

For our example, the original DTD is named textcrit.dtd and the DOCTYPE is TEI.2. The client DTD for our example is thus as follows:

<!-- my-textcrit.dtd
     This is a version of textcrit.dtd that invokes
     the mapping to CELLAR architectural forms. -->

<?ArcBase mapping>

<!ENTITY % mappingDTD SYSTEM "map-textcrit.dtd" >
<!NOTATION mapping SYSTEM>
<!ATTLIST #NOTATION mapping
    ArcDocF  NAME  #FIXED "TEI.2" 
    ArcDTD   CDATA #FIXED "%mappingDTD" >

<!ENTITY % originalDTD SYSTEM "textcrit.dtd" >
%originalDTD; 

4. Associate the client document with the modified DTD

Before performing the automatic translation from client document to architectural document, one more detail must be attended to. The DOCTYPE declaration of the original SGML data file must be changed so that it uses the client DTD defined in the previous step.

For our example, the result would be:

<!DOCTYPE TEI.2 SYSTEM "my-textcrit.dtd">
<TEI.2>
   <!-- Nothing is changed in the content -->
</TEI.2>

5. Run the architecture engine to translate the document

The freeware SGML parsers in the SP family [Cla97] include an architecture engine that can perform the mapping to translate a client document instance into an architectural document instance.   For the nsgmls parser, the following command line does the job:

nsgmls -Amapping -Acellar input.sgm >output.clr

This invokes the mapping architecture followed by the cellar architecture to result in an output file which is a document instance that conforms to the CELLAR architecture.

Skip over example to next step

For our example, the command line is:

nsgmls -Amapping -Acellar clement.sgm >clement.clr

Performing this translation step on the sample Clement text (see step one) using the mappings from step two yields a document like the following (note that most of the content is elided to avoid excessive detail):

<object class="CriticalText">
   <attr contentAttr="title" pcdataClass="String">
      2 Clement, chapter 7</attr>
   <attr contentAttr="authorities">
      <object class="Manuscript" id="A" contentAttr="description"
         attrName="siglum" attrType="String" attrValue="A"
         pcdataClass="String">
         Codex Alexandrinus
         <attr contentAttr="source" pcdataClass="String">
            A Greek uncial of the fifth century. . . </attr>
      </object>
      <!-- The other six authorities -->
   </attr>
   <attr contentAttr="body">
      <!-- The CriticalTextChapter and its contents -->
   </attr>
</object>

Note, however, that the nsgmls parser does not actually output an SGML-tagged document.  Rather, it outputs a simple text representation of the document's Element Structure Information Set . The ESIS format is much easier to parse in the final step of the process. (Note that if you really want to output an SGML document like the above, use the sgmlnorm tool from the SP package. Simply substitute sgmlnorm for nsgmls in the above command line.) In the ESIS format, the first character of each line is a command character that identifies the kind of information on the line:

The actual output is thus as follows:

ACLASS CDATA CriticalText
(OBJECT
ACONTENTATTR CDATA title
APCDATACLASS CDATA String
(ATTR
-2 Clement, chapter 7
)ATTR
ACONTENTATTR CDATA authorities
(ATTR
ACLASS CDATA Manuscript
ACONTENTATTR CDATA description
APCDATACLASS CDATA String
AID TOKEN A
AATTRNAME CDATA siglum
AATTRVALUE CDATA A
AATTRTYPE CDATA String
(OBJECT
-Codex Alexandrinus\n
ACONTENTATTR CDATA source
APCDATACLASS CDATA String
(ATTR
-A Greek uncial of the fifth century.  Housed in the British \nMuseum.  Published in:  The Codex Alexandrinus in reduced photographic\n facsimile, with an introduction by F. G. Kenyon, London 1909.
)ATTR
)OBJECT
...

6. Load the architectural document into CELLAR

The final step in the process is to run a method of the CELLAR system that invokes a data input parser that converts the architectural document instance into the corresponding structure of objects. The input to the CELLAR parser is the ESIS output file generated by the nsgmls parser in the previous step.

To load the data into CELLAR, do the following:

  1. Open a CELLAR object inspector to the RootFolder.
  2. Select the Methods command from the Attr menu.
  3. Scroll down the method list until you find loadESISfile.
  4. Double click loadESISfile and answer Yes to the confirmation dialog.
  5. When the parser terminates, you will find your imported data as the last item in the contents of the RootFolder.  Select Owning from the Attr menu and click on contents to view the contents of the RootFolder.

The parser will fail with a CELLAR error message if the document cannot be imported.  The most typical cause of failure is specifying the mapping DTD is such a way that it generates a structure of objects that violates the target conceptual model in CELLAR (for instance, if might attempt to put an object in an attribute that cannot accept that class).  A separate page with pointers on troubleshooting failures is planned.  For information on how the parser in CELLAR actually works, see the separate implementation documentation.


Document date: 12-Nov-1997