| Date issued: | 2003-05-29 |
|---|---|
| Status of document: | Draft. This is only a preliminary draft that is still under development. |
| This version: | http://www.sil.org/~simonsg/metaschema/sil_2003-05-29.htm |
| Latest version: | http://www.sil.org/~simonsg/metaschema/sil.htm |
| Previous version: | None. |
| Abstract: |
This document specifies SIL (Semantic Interpretation Language), an XML application that is a language for expressing a metaschema that maps the XML markup in a document (such as a language resource) to its semantic interpretation in terms of a formal semantic schema (or ontology). |
| Editors: |
Copyright © 2003 Gary Simons (SIL International). This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is presently available at http://www.opencontent.org/openpub/).
References
A metaschema is an XML document that formally maps the elements and attributes used in the markup schema for a language resource onto an interpretation of what they mean in terms of the concepts of a formal semantic schema (or ontology). When a metaschema is provided with the XML document for a particular language resource, the particular resource can be interpreted in terms of the concepts of the semantic schema. A program that can perform this interpretation task is called a document interpreter; the reference implementation of a document interpreter is described in [ref]. When language resources that follow different markup schemas are mapped onto the same semantic schema, then disparate resources become interoperable.
This document specifies an XML application named SIL, for Semantic Interpretation Language. It is a language for expressing a metaschema that maps from markup to semantics. The following XML document type definition (DTD) may be used to create and validate metaschema documents that conform to the vocabulary and syntax of the metaschema language specified herein:
The following special terms are used in this specification of the metaschema language:
- source document
A language resource in its original XML-encoded format.
- source markup
The markup vocabulary (that is, XML elements and attributes) used in a source document.
- markup schema
A formal definition (as with an XML DTD [XML] or an XML Schema [XSD]) of the permitted vocabulary and syntax of markup for a class of source documents.
- semantic schema
A formal definition (as with an RDF Schema [RDFS] or an OWL ontology [OWL]) of the concepts in a particular domain, including the types of resources that exist, the properties that can relate pairs of resources, and the properties that can describe a single resource in terms of literal values.
- target semantics
The particular semantic schema (or set of semantic schemas) into which a source document is being interpreted.
- semantic interpretation
The interpretation of what a source document means in terms of the concepts defined in a particular semantic schema (or set of semantic schemas).
- metaschema
A formal definition (expressed as an XML document conforming to this specification) of how a particular source document (or set of source documents that share the same markup schema) is to be interpreted in terms of the concepts of a particular semantic schema (or set of semantic schemas).
- document interpreter
A process that applies a metaschema to a source document to yield the corresponding semantic interpretation of the document.
The approach to semantics followed in this specification is the approach embodied in the Resource Description Framework [RDF]. A semantic interpretation has the form of a set of statements. Each statement is a triple consisting of a subject, a predicate, and an object. The subject of a statement is always a resource (designated by a URI). The predicate is always a subclass of resource called a property (also designated by a URI). The object may be a resource or it may be a literal value. A built-in property named rdf:type is used to identify the class of thing that a particular resource is an instance of.
The convention followed in this document is that the names of resource classes begin with an upper case letter, while the names of properties begin with a lower case letter. The serialization syntax for RDF statements (which is followed in this specification) allows for a resource to be represented by an XML element whose tag name is its type and whose URI is given in the rdf:about attribute. Statements with that resource as the subject are made by embedding elements whose tag names are the properties being predicated. An object which is a literal value is expressed as the string content of the element representing the property. An object which is an existing resource is referenced by placing its URI in the rdf:resource attribute. An object which is a new resource is created by recursively embedding an element for the type, and so on.
Two XML attributes are used throughout the metaschema language: markup to identify elements and attributes in the source markup, and concept to identify concepts in the target semantics.
Many elements of the metaschema language use the markup attribute. In the simplest case, the attribute value is the name of a single XML element in the source markup. Other times the same interpretation directive may apply to multiple elements, or only to element that satisfy certain constraints in terms of their context in the document or what they contain. To handle all of these situations, the metaschema language uses the full power of the XPath expression language [XPath] to identify elements and attributes in the source markup. Here are some examples of commonly used expression types:
markup="entry" Match any element named entry.
markup="entry | subentry" Match any element named entry or subentry.
markup="entry/ptr" Match an element named ptr only when it is directly inside an entry.
markup="@type" Match any attribute named type.
markup="entry/@type" Match an attribute named type only when it is on an entry element.
markup="entry[@type='minor']" Match an element named entry only when it has a type attribute with the value minor.
markup="pos[.='noun']" Match an element named pos only when its entire contents are the string noun.
These examples illustrate the most commonly used types of expressions for identifying source markup elements and attributes. Many more are possible; see a standard reference on the XPath expression language for complete documentation (e.g., [XPath]).
Many elements of the metaschema language use the concept attribute to identify the corresponding concept in the target semantics. The attribute value is always a qualified name (or QName as it is called in the XML standards). That is, the attribute value consists of a namespace prefix followed by a colon followed by the identifier defined within that namespace for the referenced concept. For instance,
concept="rdfs:label" The concept of the label property as defined in the rdfs namespace.
concept="gold:LexicalSense" The concept of the LexicalSense resource class as defined in the gold namespace.
These namespace prefixes are only abbreviations with local scope. They must be mapped by means of an XML namespace declaration to the URI for the semantic schema being referenced. For instance, rdfs is meant as an abbreviation for the RDF Schema namespace whose URI is http://www.w3.org/2000/01/rdf-schema#. Thus, the following namespace declaration must be given:
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
With this declaration in place, a document interpreter can resolve the concept named in concept="rdfs:label" to its globally unique URI, namely, http://www.w3.org/2000/01/rdf-schema#label.
The following subsections describe and illustrate each of the XML elements that comprise the metaschema language.
The root element of a metaschema document is <metaschema> which is defined as follows:
<!ELEMENT metaschema (interpret | ignore)+ > |
The only elements it may contain are directives to <ignore> markup and to <interpret> markup.
As the root element, <metaschema>, is also where declarations of the namespaces for the target semantic schemas used in the metaschema should be placed (in accordance with the [XML-Names] standard). For instance, the following metaschema maps markup onto concepts defined in the RDF Schema namespace and the General Ontology for Linguistic Description namespace.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE metaschema SYSTEM "..\metaschema.dtd"> <metaschema xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:gold="http://www.emeld.org/GOLD-namespace#"> <!-- Directives to ignore and interpret markup --> </metaschema> |
The <ignore> element declares that matching elements from the markup vocabulary (as well as their descendants) should simply be ignored. It is defined as follows:
<!ELEMENT ignore (resource | property | literal)* >
<!ATTLIST ignore
markup CDATA #REQUIRED>
|
The result is that the document interpreter passes over the identified markup elements (including all of their attributes and embedded elements) without producing any output for the semantic interpretation. The <ignore> directive is allowed to have the same child elements as the <interpret> directive, but they are completely ignored. The purpose of this feature is to allow the metaschema developer to temporarily turn off an <interpret> directive by changing it to <ignore>.
The <interpret> element declares that matching elements or attributes in the source document should be translated into the semantic interpretation specified by the content of the <interpret> element. It is defined as follows:
<!ELEMENT interpret (resource | literal | property)* >
<!ATTLIST interpret
markup CDATA #REQUIRED>
|
The result is that the document interpreter processes the matching markup elements or attributes in the source document and produces the interpretation indicated by the elements embedded within <interpret>. (See The markup attribute above for a discussion of the expression language used for matching markup elements and attributes.)
The <interpret> directive may have empty content. This indicates that the matched markup elements do not contribute anything to the semantic interpretation, but that interpretation of the source document should proceed with the child elements and attributes. This is in contrast to <ignore> which blocks further processing of child elements and attributes.
The following sample source document is used to illustrate the difference betwwen <ignore> and an empty <interpret> directive:
Source Document: <!DOCTYPE body> <body> <entry><!-- entry 1 --></entry> <entry><!-- entry 2 --></entry> <entry><!-- entry 3 --></entry> </body> |
The following metaschema defines interpretations for both element types in the document:
Metaschema 1:
<metaschema xmlns:ss="SemanticSchemaNamespace">
<interpret markup="body">
<resource concept="ss:Whole"/>
</interpret>
<interpret markup="entry"><!-- Interpretation directives --></interpret>
</metaschema>
Interpretation 1:
<rdf:RDF xmlns:ss="SemanticSchemaNamespace">
<ss:Whole rdf:about="#element(/1)">
<!-- Interpretation of entry 1 -->
<!-- Interpretation of entry 2 -->
<!-- Interpretation of entry 3 -->
</ss:Whole>
</rdf:RDF>
|
If one does not wish to interpret the <body> element as being a resource in its own right, but would rather interpret the document as being a set of entries, then an empty <interpret> directive is used to simply pass through <body>, as shown in metaschema 2:
Metaschema 2: <metaschema xmlns:ss="SemanticSchemaNamespace"> <interpret markup="body"/> <interpret markup="entry"><!-- Interpretation directives --></interpret> </metaschema> Interpretation 2: <rdf:RDF xmlns:ss="SemanticSchemaNamespace"> <!-- Interpretation of entry 1 --> <!-- Interpretation of entry 2 --> <!-- Interpretation of entry 3 --> </rdf:RDF> |
If <ignore> were used instead, as in metaschema 3 below, the result would be an empty interpretation since processing of all child elements is blocked.
Metaschema 3: <metaschema xmlns:ss="SemanticSchemaNamespace"> <ignore markup="body"/> <interpret markup="entry"><!-- Interpretation directives --></interpret> </metaschema> Interpretation 3: <rdf:RDF xmlns:ss="SemanticSchemaNamespace"> </rdf:RDF> |
A similar result is achieved if the metaschema simply says nothing about <body>, as in metaschema 4 below. In this case as well, the document interpreter blocks on <body> and does not process the child elements. The result differs, however, in that the document interpreter inserts a warning (as a comment) into the interpretation. Thus, if <ignore> and empty <interpret> are used to handle elements that deliberately do not contribute to the interpretation, then the metaschema developer can search for comments in the resulting interpretation in order to discover markup elements that have inadvertently failed to be accounted for.
Metaschema 4: <metaschema xmlns:ss="SemanticSchemaNamespace"> <interpret markup="entry"><!-- meaning --></interpret> </metaschema> Interpretation 4: <rdf:RDF xmlns:ss="SemanticSchemaNamespace"> <!-- Warning: No directive for body --> </rdf:RDF> |
The <resource> element declares that the interpretation of the source markup requires the creation of an RDF resource at this point. It is defined as follows:
<!ELEMENT resource (literal | property | embed)*>
<!ATTLIST resource
concept CDATA #REQUIRED>
|
When the document interpreter applies this directive, it creates a resource of the type named in the concept attribute and places it in the current context of the semantic interpretation. (See The concept attribute above for a discussion of concept identification.)
Rather than creating anonymous nodes in the RDF graph, a document interpreter should, whenever possible, use the rdf:about attribute to specify a URI that links the resource in the semantic interpretation back to an element in the source document. The reference impleentation uses URIs based on the XPointer element() scheme [XPointer] for this purpose.
The following example illustrates the effect of processing a <resource> directive:
Document context:
<entry id="aba">
<!-- Contents of entry -->
</entry>
Metaschema directive:
<interpret markup="entry">
<resource concept="gold:LexicalItem"/>
</interpret>
Resulting interpretation:
<gold:LexicalItem rdf:about="#element(aba)">
<!-- Interpretation of contents of entry -->
</gold:LexicalItem>
|
The <literal> element declares that the interpretation of the source markup requires the creation of an RDF property with a literal value at this point. It is defined as follows:
<!ELEMENT literal (text-content)* >
<!ATTLIST literal
concept CDATA #REQUIRED>
|
When the document interpreter applies this directive, it creates a property of the type named in the concept attribute and places it in the current context of the semantic interpretation. (See The concept attribute above for a discussion of concept identification.) As the object of that property it creates a literal string corresponding to the contents of the current element or attribute of the source document.
The following example illustrates the effect of processing a <literal> directive:
Document context:
<def>a group of small fish (<i>malau seli</i>) when seen inside the reef</def>
Metaschema directive:
<interpret markup="def">
<literal concept="gold:definition"/>
</interpret>
Resulting interpretation:
<gold:definition>a group of small fish (malau seli) when seen inside the reef
</gold:definition>
|
The <text-content> element may only occur embedded within a <literal> directive and is used to generate text content to include within that literal value. The <text-content> element declares that a specific literal text string or the text content of specified source markup (with optional label or punctuation) should be inserted into the literal value at this point. It is defined as follows:
<!ELEMENT text-content (#PCDATA)>
<!ATTLIST text-content
markup CDATA #IMPLIED
before CDATA #IMPLIED
after CDATA #IMPLIED>
|
When the document interpreter applies this directive, it creates a literal string. If the <text-content> element has content, then that content is the value of the generated literal string. If the <text-content> element is empty, then the value of the literal string comes from the source document; it is the text content of the element or attribute identified by the markup attribute of <text-content>, or if that attribute is absent, of the currently matching element or attribute. Optional before and after attributes may be used to specify strings (such as labels or punctuation) that are affixed to either side of the generated content, but only if there is a content value.
The following example illustrates the three main cases of the <text-content> directive: with content, with empty content and an explicit markup attribute, and with empty content and no markup attribute. The example shows <text-content> being used to construct an rdfs:comment from varous sources.
Document context:
<note>a very interesting example</note>
<note type="source">Encyclopaedia Britannica</note>
Metaschema directive:
<interpret markup="note">
<literal concept="rdfs:comment">
<text-content>Note</text-content>
<text-content markup="@type" before=" [" after="]"/>
<text-content before=": "/>
</literal>
</interpret>
Resulting interpretation:
<rdfs:comment>Note: a very interesting example</rdfs:comment>
<rdfs:comment>Note [source]: Encyclopaedia Britannica</rdfs:comment>
|
The <property> element declares that the interpretation of the source markup requires the creation of an RDF property with a non-literal value at this point. That is, the object of the property is another resource. It is defined as follows:
<!ELEMENT property (resource | resourceRef | embed)>
<!ATTLIST property
concept CDATA #REQUIRED>
|
When the document interpreter applies this directive, it creates a property of the type named in the concept attribute and places it in the current context of the semantic interpretation. (See The concept attribute above for a discussion of concept identification.)
The following example illustrates the effect of processing a <property> directive:
Document context:
<senses>
<!-- sense 1 -->
<!-- sense 2 -->
</senses>
Metaschema directive:
<interpret markup="senses">
<property concept="gold:sense"/>
</interpret>
Resulting nterpretation:
<gold:sense>
<!-- Interpretation of sense 1 -->
<!-- Interpretation of sense 2 -->
</gold:sense>
|
It is more typical that markup elements do not represent just a property, but both a property and the resource that is its object. This is the case, for instance, when a dictionary uses <sense> elements without embedding them in a <senses> element. This case is handled by calling for both a property and a resource in the interpretation, as in the following example:
Document context:
<sense>
<!-- part of speech -->
<!-- definition -->
<!-- examples -->
</sense>
Metaschema directive:
<interpret markup="sense">
<property concept="gold:sense">
<resource concept="gold:LexicalSense"/>
</property>
</interpret>
Resulting nterpretation:
<gold:sense>
<gold:LexicalSense>
<!-- Interpretation of part of speech -->
<!-- Interpretation of definition -->
<!-- Interpretation of examples -->
</gold:LexicalSense>
</gold:sense>
|
There is no limit to the number of directives that can be part of the interpretation of a markup element. For instance, in the following example, a part of speech element is interpreted as a property, embedding a resource, embedding a literal property:
Document context:
<pos>noun</pos>
Metaschema directive:
<interpret markup="pos">
<property concept="gold:category">
<resource concept="gold:SyntacticCateogry"/>
<literal concept="rdfs:label"/>
</resource>
</property>
</interpret>
Resulting nterpretation:
<gold:category
<gold:SyntacticCategory>
<rdfs:label>noun</rdfs:label>
</gold:SyntacticCategory>
</gold:category>
|
The <resourceRef> element may only occur embedded within a <property> directive and is used to make a reference to an existing RDF resource. It declares that the object of that property is the resource that corresponds to the given XML ID. It is defined as follows:
<!ELEMENT resourceRef EMPTY>
<!ATTLIST resourceRef
markup CDATA #REQUIRED>
|
The value of the markup attribute is an XPath expression that selects the XML ID of the source document element that is the object of the containing property. When the document interpreter applies this directive, it applies the XPath expression to get the XML ID of the target element, and then enters an rdf:resource attribute with the URI that corresponds to the target element.
The following example illustrates the effect of processing a <resourceRef> directive:
Document context:
<xref type="synonym">
<ptr target="aba"/>
</xref>
Metaschema directive:
<interpret markup="xref[@type='synonym']">
<property concept="gold:synonym">
<resourceRef markup="ptr/@target"/>
</property>
</interpret>
Resulting nterpretation:
<gold:synonym rdf:resource="#element(aba)"/>
|
The <embed> element may occur only within a <resource> or <property> directive. It declares that the contents of that resource or property in the semantic interpretation should be the interpretation of the source document elements or attributes identified in the markup attribute of the <embed> directive. It is defined as follows:
<!ELEMENT interpret (resource | literal | property)* >
<!ATTLIST embed
markup CDATA #IMPLIED>
|
When the document interpreter applies this directive, it executes the XPath expression in the markup attribute against the current source document context in order to select the elements or attributes that the directive references. It then embeds the interpretation of those elements or attributes. If the <embed> directive has no other directives embedded within it, then the subelements in the source document are interpreted by means of the matching <interpret> directives. However, if the <embed> directive contains other directives, the subelements in the source document are interpreted in terms of those directives.
By default, a <resource> or <property> directive that has empty content embeds the interpretations of all source document elements that are within the content of the element matched by the <interpret> directive. The <embed> directive is used to override this default behavior. There are two major applications of this functionality. One is the resulting interpretation is a complex structure and some of the source document child element go in one part of it while others go in another; the other is when elements of the source document must be "moved" such that their interpretation goes under an element that they are not part of in the source document.
The first example below shows a case of building a complex interpretation with multiple parts. The source document has subentries embedded within the entries of a lexicon, but subentries have a very limited description, allowing only a form, a part of speech, and a definition. A subentry is to be interpreted as a LexicalItem that is related to the LexicalItem for the main entry. The semantic schema for LexicalItem has a form property and a sense property. The latter has LexicalSense as its object, and part of speech and definition are treated as properties of the LexicalSense. Thus the metaschema must interpret <subentry> as a complex structure that places the form in one place and the part of speech and definition in another. The following example shows how the <embed> directive is used to accomplish this result:
Document context:
<subentry>
<form><!-- form --></form>
<pos><!-- pos --></pos>
<def><!-- def --></def>
</subentry>
Metaschema directive:
<interpret markup="subentry">
<property concept="gold:relatedLexicalItem">
<resource concept="gold:LexicalItem">
<embed markup="form"/>
<property concept="gold:sense">
<resource concept="gold:LexcialSense">
<embed markup="pos | def"/>
</resource>
</property>
</resource>
</property>
</interpret>
Resulting nterpretation:
<gold:relatedLexicalItem>
<gold:LexicalItem>
<!-- Interpretation of form -->
<gold:sense>
<gold:LexcialSense>
<!-- Interpretation of pos -->
<!-- Interpretation of def -->
</gold:LexcialSense>
</gold:sense>
</gold:LexicalItem>
</gold:relatedLexicalItem>
|
The next example illustrates the movement of a source document element. The source document has subentries following the sense they pertain to, but the semantic schema treats relatedLexicalItem as a property of a LexicalSense. Thus in building the semantic interpretation, the subentry must in essence be moved under the sense. This is achieved by instructing the document interpreter to <ignore> subentry when it is encountered in its original position, and then to <enbed> it in the interpretation of sense. The markup attribute uses an XPath expression called a location path to find the subentry (which happens to be a following sibling).
Document context:
<entry>
<form><!-- form 1 --></form>
<sense><!-- sense 1 --></sense>
<subentry>
<form><!-- form 2 --></form>
<sense><!-- sense 2 --></sense>
</subentry>
</entry>
Metaschema directives:
<interpret markup="entry">
<resource concept="gold:LexicalItem"/>
</interpret>
<interpret markup="form"><!-- Directives --></interpret>
<ignore markup="subentry"/>
<interpret markup="sense">
<property concept="gold:sense">
<resource concept="gold:LexcialSense">
<embed markup="*"/>
<embed markup="following-sibling::subentry"/>
<property concept="gold:relatedLexicalItem">
<resource concept="gold:LexicalItem">
<embed markup="*"/>
</resource>
</property>
</embed>
</resource>
</property>
</interpret>
Resulting nterpretation:
<gold:LexicalItem>
<!-- Interpretation of form 1 -->
<gold:sense>
<gold:LexcialSense>
<!-- Interpretation of sense 1 children -->
<gold:relatedLexicalItem>
<gold:LexicalItem>
<!-- Interpretation of form 2 -->
<gold:sense>
<gold:LexcialSense>
<!-- Interpretation of sense 2 children -->
</gold:LexcialSense>
</gold:sense>
</gold:LexicalItem>
</gold:relatedLexicalItem>
</gold:LexcialSense>
</gold:sense>
</gold:LexicalItem>
|
Actually the <embed markup="*"> in the last example should not be necessary, but check it out in the implementation to make sure.
What about the RDF containers (bag, sequence, alternative)? Do I need something like <containter type="bag | seq | alt"> to create one of the those?
Fix compile_ms to ensure that no two resources have the same rdf:about. And what about duplicate resources and properties? Sense the RDF graph is a set of statements, it ought to be okay, but check it out.
Make sure that in the <senses> example, the conversion from XML/RDF to triples generates multiple instances of the sense property. (N.B. The DTD for property doesn't allow repeating inside; should it?)
At this point, PCDATA element content and CDATA attribute values can be mapped to semantic concepts by matching them literally in a @markup spec, e.g. pos[.='noun'] or re[@type='synonym']. That is enough power to handle the need, but it might be nice to develop a more expressive way to handle the mapping of string content to concepts.
I've been using "element or attribute" throughout. Consider changing these to "document node". Would "current document node" work to describe the current context matched by <interpret>?
In the XSLT documentation, find out what the text content of a node (generated by select=".") is called and use that name and definition where appropriate.
Note that <text-content> cannot be used to build arbitrarily complex comments since there is no embedding or no interpret directives containing just <text-content>. We would need one or the other (but if one, then both for the sake of symmetry) to handle the general case. E.g. serializing an <fs> with arbitrary nesting into a comment string would be a nice example.
| [OWL] | OWL Web Ontology Language
Reference.
W3C Working Draft 31 March 2003. <http://www.w3.org/TR/owl-ref/> |
| [RDF] | Resource Description Framework
(RDF) Model and Syntax Specification.
W3C Recommendation 22 February 1999. <http://www.w3.org/TR/REC-rdf-syntax/> |
| [RDFS] | RDF Vocabulary Description Language 1.0: RDF Schema.
W3C Working Draft 23 January 2003. <http://www.w3.org/TR/rdf-schema/> |
| [XML] | Extensible Markup Language (XML) 1.0 (Second Edition).
W3C Recommendation 6 October 2000. <http://www.w3.org/TR/REC-xml> |
| [XML-Names] | Namespaces in XML.
W3C Recommendation 14 January 1999. <http://www.w3.org/TR/REC-xml-names/> |
| [XPath] | XML Path Language (XPath)
Version 1.0.
W3C Recommendation 16 November 1999. <http://www.w3.org/TR/xpath> |
| [XPointer] | XPointer element() Scheme.
W3C Recommendation 25 March 2003. <http://www.w3.org/TR/xptr-element/> |
| [XSD] | XML Schema Part 1: Structures.
W3C Recommendation 2 May 2001. <http://www.w3.org/TR/xmlschema-1/> |