Databases
|
Using Databases to Represent Linguistic Data
Database systems can be used to represent linguistic data. This appendix
gives pointers regarding three kinds of databases:
Special-purpose linguistic databases
Some linguists have solved the problem of representing linguistic data by
building special-purpose linguistic database programs. The following are three
programs developed by the Summer Institute of Linguistics that specialize in
interlinear text analysis and lexical data management:
- IT, the Interlinear Text processor, for
DOS (1987; version 1.2, 1992) and Macintosh (1988; version 1.01r7, 1992)
- The linguist's Shoebox: an integrated data management
and analysis tool, for DOS (1990; version 2.0, 1993), Windows (version
3.0, 1996), and Macintosh (version 3.0, 1997)
- LinguaLinks, an electronic
performance support system for field language workers (see Linguistics
Workshop for data management tools), for Windows (1996)
Relational databases
Relational databases are the most popular kind of database in use today.
In spite of their popularity, they are found wanting for handling linguistic
data. In terms of the requirements proposed in this
chapter of the book, relational databases handle well the multidimensional
and integrated nature of linguistic data (requirement 4 and 5), but handle
poorly the sequential and hierarchical nature of linguistic data (requirements
2 and 3).
Here are some research projects which have extended the relational model to
deal with these requirements for handling text:
- Stonebraker, Michael, Heidi Stettner, Nadene Lynn, Joseph Kalash, and
Antonin Guttman. (1983) Document processing in a relational database
system, ACM Transactions on Office Information Systems,
1(2):143-188.
- Text/Relational
Database Management System Project, Centre for the New Oxford English
Dictionary and Text Research, University of Waterloo. Provides postscript
versions of many publications.
Some leading relational database systems:
An important notion from relational database
theory is normalization. This is the process of organizing a database in
such a way that no piece of information occurs more than once in the database.
- Database
Normalization from Reid Software Development
- Normalization,
from Database Management Services, University of Texas at Austin
- Stages of
Normalization, by Oliver Burmeister, Swinburne University of Technology
- Smith, Henry C. (1985) Database design: composing fully normalized
tables from a rigorous dependency diagram. Communications of the
ACM, 28(8):826-838 (online
review) describes an easy-to-use methodology.
Some journals:
Object-oriented databases
Object-oriented databases are a relatively recently development. They have
the advantage of inherently supporting all of requirements 2 through 5. This section first offers
definitions of some of the concepts of object-oriented databases, with pointers
to resources where you can learn more. Finally, a general-purpose
object-oriented database system named CELLAR, which has
been specifically built to support requirements 1 and
6 as well, is introduced.
These are some of the key concepts of object-oriented databases:
object-oriented database
- A database system which models entities in the real world as objects and
follows the object-oriented paradigm of programming.
object-oriented
- A modern paradigm of programming which models information in terms of
objects. Computation occurs when one object receives a
message from another asking it to perform one of its built-in operations. The
object-oriented approach, in which the data and the program behavior are
encapsulated in the objects, contrasts with the conventional approach to
programming, in which a program operates on data which is completely separate.
- The fundamental unit of information modeling in the object-oriented
paradigm. There is a one-to-one correspondence between objects in the data
model and the entities in the real world which are being modeled. (This is not
true of data modeling in a relational database system; all of the information
about a single entity in the real world may be scattered throughout many tables
of a normalized database.) An object stores state
information (variously called properties, attributes, or instance variables;
these are like the fields of a database record). It also stores behavioral
information (typically called methods) about what computations can be performed
on an instance of the object. The information stored in an object is
encapsulated in that it is not visible directly; it can only be seen by sending
a message to the object which asks it to perform one of its methods.
object-oriented analysis
- The process of analyzing a problem domain in order to build a formal model
that can serve as the basis for an object-oriented implementation of it. The
main outcome is a description of the classes of objects in the problem domain,
along with the properties, behaviors, and relationships of each.
- Booch, Grady. (1994) Object-oriented analysis and design with
applications, 2nd ed. Benjamin/Cummings Publishing Co. (An
online overview.)
- Object Modeling Technique (OMT), described in Rumbaugh, James and others
(1991) Object-Oriented Modeling and Design, Prentice Hall. (An
online
overview.)
- UML (the Unified
Modeling Language) fuses the concepts of Booch and OMT.
CELLAR: A multilingual object-oriented database
system
CELLAR (Computing Environment for
Linguistic, Literary, and Anthropological Research) is a multilingual
object-oriented database system that has been developed by the
Summer Institute of Linguistics to
specifically meet the six requirements for a
linguistic computing environment.
CELLAR lies at the heart of SIL's LinguaLinks product, an electronic
performance support system (EPSS) for field linguists. It provides both the
object database for storing user data and the programming language for
implementing the applications to manage and otherwise manipulate the data.
CELLAR is not currently packaged as a product in its own right; rather, the
full data modeling system and programming language are included as part of the
LinguaLinks product.
These are some articles that have been published about CELLAR:
- Rettig, Marc, Gary F. Simons, and John V. Thomson. (1993) Extended
objects. Communications of the ACM, 36(8):19-24.
- Simons, Gary F. (1997) Conceptual modeling versus visual modeling: a
technological key to building consensus. Computers and the
Humanities, 30:303-319. (The
original working paper
is available online.)
- Simons, Gary F. and John V. Thomson. (1998) Multilingual data
processing in the CELLAR environment. In John Nerbonne (ed.),
Linguistic Databases. Stanford, CA: Center for the Study of Language and
Information, 203-224. (The
original working
paper is available online.)
|