banner
SIL International Home

The SGML model versus the object model, and the problem of converting from one to the other

From:

Importing SGML data into CELLAR by means of architectural forms
Gary F. Simons
Summer Institute of Linguistics

Last revised: 12 November 1997


Contents

  1. The SGML model in a nutshell
  2. The object model in a nutshell
  3. An unsatisfactory default mapping from elements to objects
  4. An example of the kind of mapping we need
  5. The fundamental problem
  6. A basic architecture for mapping SGML data into objects


The problems inherent in importing SGML data into an object database stem from the differences between the SGML model of data and the object model of data. In speaking of the "object model of data," I am referring specifically to the way object databases [Cat97] and conceptual modeling languages [Bor85] represent information. Such systems replace the simple instance variables of an object-oriented programming language with attributes that encapsulate integrity constraints and the semantics of relationships to other objects.

In this document, the SGML model and the object model are reduced to their most basic features in order to make comparison easy. Then the problem of converting data from the SGML model to the object model is discussed.  First a fully automatic approach to translation is demonstrated, but shown to be unusable because of a fundamental problem: sometimes an SGML element corresponds to an object, sometimes to an attribute, and sometimes to both. This sets the stage for the solution adopted in this working paper of performing an automatic translation guided by formal mapping rules derived by a human analysis of how elements in the source SGML data relate to objects and attributes in the target object model. SGML architectural forms are used to encode the formal mapping.

1. The SGML model in a nutshell

In SGML, the fundamental unit of data representation is the element. Each element must have a generic identifier; it may optionally have a number of attributes or content or both. Each attribute has a name and a value; the value is represented by a string of characters. The content of an element may consist of character data or embedded elements or a combination of both. These generalizations may be expressed in terms of the following declarations:

<!ELEMENT element   - - (attr* & content?)   >
<!ATTLIST element   gi    NAME #REQUIRED     >

<!ELEMENT attr      - O  EMPTY               >
<!ATTLIST attr      name  NAME #REQUIRED
                    value CDATA #IMPLIED     >

<!ELEMENT content   - - (#PCDATA | element)* >

2. The object model in a nutshell

In the object model, the fundamental unit of data representation is the object. Each object must have a class, and is either a primitive object that stores primitive data like a string or a number, or is a complex object that has attributes. Each attribute has a name and a value; the value consists of embedded objects. These generalizations may be expressed in terms of the following declarations:

<!ELEMENT object    - - (attr)*                     >
<!ATTLIST object    class NAME #REQUIRED            >

<!ELEMENT attr      - - (primitiveObject | object)* >
<!ATTLIST attr      name  NAME #REQUIRED            >

<!ELEMENT primitiveObject    - - (#PCDATA)          >
<!ATTLIST primitiveObject    class NAME #REQUIRED   >

3. An unsatisfactory default mapping from elements to objects

Element and object are superficially similar: generic identifier corresponds to class, both have attributes, and both occur recursively. They differ fundamentally, however, in the nature of the attributes and the recursion. With elements, the attributes cannot contain embedded structure; the recursion of elements is allowed only within the content of an element. With objects, there is no specialized notion of content; rather, the recursive embedding of further objects takes place within the attributes.

An SGML document following the model of section 1 can be automatically mapped onto the object model of section 2 by making four transformations:

  1. Convert every instance of <element gi=X>...</element> to <object class=X>...</object>.
  2. Convert every instance of <attr name=X value=Y> to <attr name=X><primitiveObject class="String">Y</primitiveObject></attr>.
  3. Convert every instance of <content>...</content> to <attr name="content">...</attr>.
  4. Embed every instance of #PCDATA within the tags <primitiveObject class="String">...</primitiveObject>.

For example, the following sample SGML element contains an instance of each of the four conditions listed above:

<phrase rend="ital">an italic phrase</phrase>

Following the nutshell model of SGML in section 1, this corresponds to the following semantic representation:

<element gi="phrase">
   <attr name="rend" value="ital">
   <content>an italic phrase</content>
</element>

This would be converted into the following object representation by the proposed default mapping:

<object class="phrase">
   <attr name="rend">
      <primitiveObject class="String">ital</primitiveObject>
   </attr>
   <attr name="content">
      <primitiveObject class="String">an italic phrase</primitiveObject>
   </attr>
</object>

4. An example of the kind of mapping we need

The default transformation described in the preceding section can easily be done on any SGML document, but it will seldom yield a result that actually fits the conceptual model of a target object database. Consider, for instance, the following simplistic SGML document:

<!DOCTYPE document SYSTEM "document.dtd">
<document>
   <creationDate>12-Jun-97</creationDate>
   <title>
      <maintitle>The main title</maintitle>
      <subtitle>a subtitle</subtitle>
   </title>
   <authors>
      <author>
         <name>First Author</name>
         <affil>Some Company</affil>
      </author>
      <author>
         <name>Second Author</name>
         <affil>Another Company</affil>
      </author>
   </authors>
   <p>An introductory paragraph</p>
   <div1><!-- The first section --></div1>
   <div1><!-- The second section --></div1>
</document>

The above represents a typical approach to encoding a document in SGML. But compare it to the following which is also typical of how a Document class might be defined in an object database:

class Document has
   creationDate : Date
   title        : TitleStatement
   authors      : sequence of Person
   content      : sequence of Paragraph or Division

The default mapping proposed in section 3 would first go wrong by putting all the subelements within the document in a single attribute named content; instead we want to map them into four different attributes. The first three subelements (<creationDate>, <title>, and <authors>) correspond to Document attributes of the same name. The remaining subelements (<p> and two instances of <div1>) correspond to objects that go into the Document attribute named content (which happens not to be explicitly tagged). Though the first three subelements correspond to attributes, they differ significantly in the way they do so. <creationDate> additionally carries the information that the embedded PCDATA content should be mapped onto a basic object of class Date. <title> not only corresponds to the attribute title but also to an object of class TitleStatement (which in turn has attributes maintitle and subtitle). By contrast, <authors> corresponds to the attribute and nothing more; each embedded <author> element corresponds to an object of class Person.

5. The fundamental problem

This example illustrates the following fundamental result when comparing the SGML model to the object model: some SGML elements encode an object, some encode an attribute, and still others simultaneously encode both. (We see in the full CELLAR architecture that still other relationships are possible.) The basic challenge of importing SGML data into an object database is to determine which of these cases holds for each of the element types occurring in the data, and then to express formally how each maps onto the corresponding classes and attributes of the target database schema.

6. A basic architecture for mapping SGML data into objects

The HyTime standard [ISO92] first introduced the concept of architectural forms as a way to associate standardized semantics with elements in user-defined DTDs [DD94]. Now that this notion has been generalized in the SGML Extended Facilities (defined in Annex A of the revised HyTime standard [ISO97]), we can use it to good advantage in solving the problem at hand. Architectural forms provide a mechanism we can use to express the semantics of how SGML elements map onto the object model. See [Cov97] for pointers to other applications of architectural forms.

There are two basic element forms in the architecture, <object> and <attr>. Rather than having a third form for the case when an element corresponds to both an object and an attribute, this case is treated as being a mapping to an object, and the object form adds an architectural attribute to name the attribute it also maps to. The basic definitions of these two forms are as follows (see the main paper for their full definition):

<!ELEMENT object  - - (object | attr | #PCDATA)*                >
<!ATTLIST object
     class       -- Create this class of CELLAR object        --
                 CDATA #REQUIRED
     parentAttr  -- Put the object in this attr of its parent --
                 CDATA #IMPLIED  
     contentAttr -- Put embedded objects in this attribute    --
                 CDATA #IMPLIED
     pcdataClass -- Create this class for embedded PCDATA     --
                 CDATA "String"                                 >

<!ELEMENT attr    - - (object | #PCDATA)*                       >
<!ATTLIST attr
     contentAttr -- Put embedded objects in this attribute    --
                 CDATA #IMPLIED
     pcdataClass -- Create this class for embedded PCDATA     --
                 CDATA "String"                                 >

The easiest way to explain these forms is by example. In the illustrative document in section 4, the <document> element corresponds to an object of class Document; the element content (unless an embedded element names a specific target attribute) goes into the content attribute of the object. The <document> element would be annotated as follows to indicate its mapping into the object model:

<document cellar=object class="Document" contentAttr="content">

This says that in the architecture named cellar, this <document> element corresponds to an <object> element whose class is "Document" and whose contentAttr is "content".

The <creationDate> element corresponds to an attribute. Its content goes into the creationDate attribute, and the embedded PCDATA needs to be converted into Date objects. Thus,

<creationDate cellar=attr contentAttr="creationDate" pcdataClass="Date">

The <title> element corresponds to a TitleStatement object, but it also corresponds to an attribute in that it maps into the title attribute of its parent object (that is, the Document). Thus,

<title cellar=object class="TitleStatement" parentAttr="title">

Finally, the <authors> element corresponds to the authors attribute; thus,

<authors cellar=attr contentAttr="authors">

An SGML parser that performs architectural processing can take elements annotated like this and translate them into elements of the target architecture.  Return to the main paper for an explanation of how this works.


Document date: 12-Nov-1997