An expanded version of a paper originally presented at
the:
E-MELD Symposium on "Endangered Data vs. Enduring
Practice,"
Linguistic Society of America annual meeting
8-11 January 2004, Boston, MA
Incomplete
Draft:
Half of this has been turned into a paper,
but the rest is still glorified speaker’s notes.
Not ready to be linked to (though the oral
presentation can be cited).
Gary F. Simons
SIL International
One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records from antiquity are those that were carved into stone or pressed into kiln-baked clay tablets. Writing on velum and papyrus was a great advance in that the process was faster and the resulting product was much less bulky; but it was also a step backwards on the durability scale since the medium could be destroyed by fire or by water or even by microbes. With the modern use of paper, writing has advanced further, but it has become less durable still as the chemicals used in the manufacture of paper can cause the medium to deteriorate from within, even in the best of storage conditions.
To complete the trend, digital word processing, which is our most advanced writing technology to date, is also the most ephemeral. Whereas ink on acid-free paper will endure for centuries, the longevity of digital storage media is an order of magnitude shorter. The industry’s early answer to long-term digital storage was magnetic tape, but this has proved to have a life expectancy of only 10 to 20 years (Van Bogart 1995). The current answer, CD-R, fares better but is still ephemeral from an archival point of view. Manufacturers report that CD-R discs should have a life expectancy of 100 to 200 years, but independent tests conducted at the National Institute of Standards and Technology found the life expectancy of the CD-R discs they tested to be 30 years (Byers 2003:13). The CD-RW medium is significantly less stable; the manufacturers predict a life expectancy of only 25 years. (Note: If you want to understand how CD and DVD technologies work and how the media deteriorate over time, Byers 2003 is an excellent source.)
But the problem is even worse than this, because the hardware devices that read these media become obsolete long before the media reach the end of their life expectancy. For instance, in the last 25 years we have seen removable media on personal computers advance from 8-inch floppies to 5.25-inch floppies to 3.5-inch floppies to Zip drives to CD-Rs to DVD-Rs. Unless one is diligent about migrating all of one’s legacy data to new media each time a new technology takes hold, those data will soon become trapped on media that no available hardware can read.
And the problem is worse yet, because software is changing, too. Though software technology is not advancing as quickly as hardware technology, the effect of software change is more devastating since the migration strategy that works for keeping data files accessible on the latest media cannot ensure that the files remain usable. This is because the functionality associated with those files is tied to particular software, and when the hardware that ran the needed software ceases to be available, then the functionality associated with those files ceases to exist. The fact that software vendors may change the file formats and functionality with each new version of software only exacerbates the problem.
When the results of our word processing are entrusted to the proprietary formats of a single software vendor, then we are completely at the mercy of that vendor as to whether our work will survive into the future. For instance, the author has a number of books and articles that were produced with Microsoft Word in the 1980s. The data files have been faithfully migrated over the years so that they remain readable today. However, current versions of Word no longer support the file format, so that the documents can no longer be rendered. The text stream can still be retrieved with any plain text editor since the characters are encoded with the ASCII standard, but the formatting and layout are encoded in a proprietary binary format and thus are completely lost in the absence of software that understands that format.
The phenomenon of digital data loss has become so prevalent that many are beginning to warn of an impending “digital dark age”—the idea that historians of the future will look back to our present age as another Dark Ages since so much important information documenting our current civilization is recorded digitally and will have vanished (Bergeron 2002; Deegan and Tanner 2002). The Long Now Foundation (n.d.) maintains a library of well-publicized stories of digital data loss in high-profile institutions like the BBC and NASA. A recent Associated Press story quotes a technologist in the MIT library to relate a state of affairs that hits closer to home for the typical academic (Jesdanun 2003):
Every now and then, a faculty member would come in in tears having some boxes of completely unreadable tapes—they've lost their life's work.
The bottom line is that in these days of short-lived computer media, hardware, and software, linguists need to be particularly careful about the way they use digital technologies lest their work be lost within a decade or two. In the absence of such diligence, our digital data records are even more endangered than the languages we are seeking to document.
A linguist should do two things in order to ensure that digital data endure: (1) the materials must be put into an enduring file format, and (2) the materials must be deposited with an archive that will make a practice of migrating them to new storage media as needed. The paper addresses the first of these issues.
When considering various file formats, we can contrast three classes of forms by their functions.
Working form
The form in which information is stored as it is created and edited.
Presentation form
The form in which information is presented to the public.
Archival form
The form in which information is stored for access long into the future.
Armed with these definitions we can address the fundamental problem, namely, that popular working forms (like Microsoft Word and database applications) are not suitable archival forms. Nor are popular presentation forms (like dynamic web pages) suitable archival forms. But linguists tend to focus on working form and presentation form when they think about using digital technologies; instead, they must look beyond these forms to the archival form if they want to create work that will endure.
I now define and illustrate three levels of practice with respect to archival form.
First, there is Unacceptable Practice, in which the form that is archived is a binary working form that requires a specificpiece of software. This could include a favorite commercial format, like a Word document, an Excel spreadsheet, a PowerPoint presentation, or an Access database. Equally as problematic is a format that is supported only by homemade software. In either case, the information will cease to be available when the required software ceases to work on the hardware that is currently in use.
Next there is Minimally Acceptable Practice in which the form that is archived is a presentation form based on an open format supported by multiple vendors. HTML is probably the best known example of this; PDF is another. Even though the latter is Adobe’s format, they have openly published the specification so that multiple vendors (including open source projects) have written tools that support it.
The good news with this approach is that a snapshot of how you presented the information will persist well into the future since the multiple software vendors are likely to keep the format alive. However, the bad news is that it is a dead end format since you have just enshrined one particular presentation of the information—the information is not repurposeable. It is not in a form that can be loaded into another program for fresh analysis. It is not in a form that can be used to create a different way of presenting the same information.
In contrast to these we have Best Practice, in which the
form that is archived preserves
all of the information (including its structure) in such a way that it is
portable and repurposeable. The best format available for achieving this purpose
is “Descriptive XML markup.” An XML archival form is not a dead end since it
may be reloaded into a working form, and it may be used to generate new
presentation forms.
Now I will illustrate these three kinds of practice with a sample from a dictionary of Sikaiana, in Solomon Islands. The slide shows a typical presentation form for three entries from the dictionary.
First we illustrate Unacceptable Practice. If we had developed this dictionary with Word as the working tool, and then archived the .DOC file, future generations who no longer have our current version of Word will see this when they open the file with a plain text editor:
Next is Minimally Acceptable Practice. If we developed the dictionary with a database tool, and then archived an HTML presentation of it, future generations would see something like this when they open the file with a plain text editor.
You can see that this is a vast improvement. It uses XML-style markup to identify paragraphs (with the <P> tag), boldface (with the <B> tag), and italics (with the <I> tag). Future generations could easily decipher this markup to recreate the presentation.
But Best Practice would be to archive the information in our dictionary database in descriptive XML markup. In descriptive markup, the markup tags describe not the presentation formatting, but the structure and function of the information elements. If we use this approach, this is what future generations will see:
It is clear that future generations (even though they lack our current working tools) will be able to see and understand all of the information that was in our dictionary database, to transform it into whatever form is needed to load it into their database or other working tools, and to reuse the information to create new and up-to-date presentation forms
After hearing this pitch, a number of colleagues have responded with a question like this: “Isn’t XML just another one of those ephemeral file formats?” To this I answer, “No! It’s as rock solid as ASCII.” ASCII (or the American Standard Code for Information Interchange) is the standard that says, for instance, that the number 65 will be used to represent a capital A in a digital data stream. ASCII was adopted in 1963; and 40 years later it is at the heart of operating systems, email, the web, and much more — this standard is not going to change.
XML uses an ASCII-based notation to essentially extend ASCII by solving two of its inherent limitations. Via Unicode XML can encode text in virtually any language, not just English, and via tags XML encodes the structure of information, not just the stream of characters
Another indicator for the near certain longevity of XML is the answer to another frequently asked question, “Is XML really practical, or is it just another nice theory?” Again, the answer to the latter is a resounding “No!” XML has become part of the fabric of the global information infrastructure. It is the centerpiece of a whole family of open standards from the World Wide Web Consortium that has been embraced by all the major software vendors (like Microsoft, IBM, Sun, and Oracle). On top of this, hundreds of small vendors and open-source projects have developed tools that implement the XML family of standards. In short, XML is fueling a level of information interchange and reuse that is unprecedented.
In conclusion, I want to go beyond the question of what a single linguist should do, to ask “What’s linguistics to do?” The EMELD project believes that the community as a whole needs to recognize the fleeting value of digital working forms and presentation forms and to embrace enduring archival forms. Three steps in this direction would be:
· Grants should require best practice archiving of results, not just “dissemination” of ephemeral presentation forms.
· Our systems of peer review and of tenure and promotion should reward the production of good archival language documentation.
· We need to get into league with libraries and archives.
Only by taking steps like these can we ensure that our digital data will truly endure.
Bergeron, Bryan. 2002. Dark ages II: When the digital data die. Upper Saddle River, NJ: Prentice-Hall.
Brandel, Mary. 1999. 1963: ASCII debuts. Computerworld, 12 April 1999. Available online: http://www.computerworld.com/news/1999/story/0,11280,35241,00.html
Byers, Fred R. 2003. Care and handling of CDs and DVDs: A guide for librarians and archivists. Washington, DC: Council on Library and Information Resources and Gaithersburg, MD: National Institute of Standards and Technology. Available online: http://www.clir.org/pubs/abstract/pub121abst.html and http://www.itl.nist.gov/div895/carefordisc/CDandDVDCareandHandlingGuide.pdf
Deegan, Marilyn and Simon Tanner. 2002. The digital dark ages. Library and Information Update, May 2002 issue. Available online: http://www.cilip.org.uk/update/issues/may02/article2may.html
Jesdanun, Anick. 2003. Coming soon: A digital dark age? Associated Press, New York, 21 January 2003. Available online: http://www.cbsnews.com/stories/2003/01/21/tech/main537308.shtml
Long Now Foundation. n.d. Digital dark age: digital data loss and preservation resources. Available online: http://www.longnow.org/10klibrary/darkage.htm
Van Bogart, John W. C. 1995. Magnetic tape storage and handling: A guide for libraries and archives. Washington, DC: Commission on Preservation and Access and St. Paul, MN: National Media Laboratory. Available at http://www.clir.org/pubs/reports/pub54/index.html.