Scholars

Gary F. Simons

Chief Research Officer

Using
Computers
in
Linguistics:
A Practical Guide

Chapter 1

The Nature of Linguistic Data and the Requirements of a Computing Environment for Linguistic Research

Gary F. Simons
Summer Institute of Linguistics

Online Appendix: Using Databases


Summary

Multilingual Computing

Text Encoding

Databases

 

Using Databases to Represent Linguistic Data

Database systems can be used to represent linguistic data. This appendix gives pointers regarding three kinds of databases:

Special-purpose linguistic databases

Some linguists have solved the problem of representing linguistic data by building special-purpose linguistic database programs. The following are three programs developed by the Summer Institute of Linguistics that specialize in interlinear text analysis and lexical data management:

  • IT, the Interlinear Text processor, for DOS (1987; version 1.2, 1992) and Macintosh (1988; version 1.01r7, 1992)
  • The linguist's Shoebox: an integrated data management and analysis tool, for DOS (1990; version 2.0, 1993), Windows (version 3.0, 1996), and Macintosh (version 3.0, 1997)
  • LinguaLinks, an electronic performance support system for field language workers (see Linguistics Workshop for data management tools), for Windows (1996)

Relational databases

Relational databases are the most popular kind of database in use today.

In spite of their popularity, they are found wanting for handling linguistic data. In terms of the requirements proposed in this chapter of the book, relational databases handle well the multidimensional and integrated nature of linguistic data (requirement 4 and 5), but handle poorly the sequential and hierarchical nature of linguistic data (requirements 2 and 3).

Here are some research projects which have extended the relational model to deal with these requirements for handling text:

  • Stonebraker, Michael, Heidi Stettner, Nadene Lynn, Joseph Kalash, and Antonin Guttman. (1983) ‘Document processing in a relational database system,’ ACM Transactions on Office Information Systems, 1(2):143-188.
  • Text/Relational Database Management System Project, Centre for the New Oxford English Dictionary and Text Research, University of Waterloo. Provides postscript versions of many publications.

Some leading relational database systems:

An important notion from relational database theory is normalization. This is the process of organizing a database in such a way that no piece of information occurs more than once in the database.

  • Database Normalization from Reid Software Development
  • Normalization, from Database Management Services, University of Texas at Austin
  • Stages of Normalization, by Oliver Burmeister, Swinburne University of Technology
  • Smith, Henry C. (1985) ‘Database design: composing fully normalized tables from a rigorous dependency diagram.’ Communications of the ACM, 28(8):826-838 (online review) describes an easy-to-use methodology.

Some journals:

Object-oriented databases

Object-oriented databases are a relatively recently development. They have the advantage of inherently supporting all of requirements 2 through 5. This section first offers definitions of some of the concepts of object-oriented databases, with pointers to resources where you can learn more. Finally, a general-purpose object-oriented database system named CELLAR, which has been specifically built to support requirements 1 and 6 as well, is introduced.

These are some of the key concepts of object-oriented databases:

object-oriented database

A database system which models entities in the real world as objects and follows the object-oriented paradigm of programming.

object-oriented

A modern paradigm of programming which models information in terms of objects. Computation occurs when one object receives a message from another asking it to perform one of its built-in operations. The object-oriented approach, in which the data and the program behavior are encapsulated in the objects, contrasts with the conventional approach to programming, in which a program operates on data which is completely separate.

object

The fundamental unit of information modeling in the object-oriented paradigm. There is a one-to-one correspondence between objects in the data model and the entities in the real world which are being modeled. (This is not true of data modeling in a relational database system; all of the information about a single entity in the real world may be scattered throughout many tables of a normalized database.) An object stores state information (variously called properties, attributes, or instance variables; these are like the fields of a database record). It also stores behavioral information (typically called methods) about what computations can be performed on an instance of the object. The information stored in an object is encapsulated in that it is not visible directly; it can only be seen by sending a message to the object which asks it to perform one of its methods.

object-oriented analysis

The process of analyzing a problem domain in order to build a formal model that can serve as the basis for an object-oriented implementation of it. The main outcome is a description of the classes of objects in the problem domain, along with the properties, behaviors, and relationships of each.
  • Booch, Grady. (1994) Object-oriented analysis and design with applications, 2nd ed. Benjamin/Cummings Publishing Co. (An online overview.)
  • Object Modeling Technique (OMT), described in Rumbaugh, James and others (1991) Object-Oriented Modeling and Design, Prentice Hall. (An online overview.)
  • UML (the Unified Modeling Language) fuses the concepts of Booch and OMT.

CELLAR: A multilingual object-oriented database system

CELLAR (Computing Environment for Linguistic, Literary, and Anthropological Research) is a multilingual object-oriented database system that has been developed by the Summer Institute of Linguistics to specifically meet the six requirements for a linguistic computing environment.

CELLAR lies at the heart of SIL's LinguaLinks product, an electronic performance support system (EPSS) for field linguists. It provides both the object database for storing user data and the programming language for implementing the applications to manage and otherwise manipulate the data. CELLAR is not currently packaged as a product in its own right; rather, the full data modeling system and programming language are included as part of the LinguaLinks product.

These are some articles that have been published about CELLAR:

  • Rettig, Marc, Gary F. Simons, and John V. Thomson. (1993) ‘Extended objects.’ Communications of the ACM, 36(8):19-24.
  • Simons, Gary F. (1997) ‘Conceptual modeling versus visual modeling: a technological key to building consensus.’ Computers and the Humanities, 30:303-319. (The original working paper is available online.)
  • Simons, Gary F. and John V. Thomson. (1998) ‘Multilingual data processing in the CELLAR environment.’ In John Nerbonne (ed.), Linguistic Databases. Stanford, CA: Center for the Study of Language and Information, 203-224. (The original working paper is available online.)

 


Up to Chapter Page | Up to Book Page
Summary | Multilingual Computing | Text Encoding | Databases


This page is part of an online appendix for the book Using Computers in Linguistics: A Practical Guide, edited by John M. Lawler and Helen Aristar Dry (Routledge, 1998).

Last modified: January 9, 1999