A Digital Library Case Study
University of South Florida
Assignment Number 3
July 10, 2002
Harvey Richmond (Rich) Ackerman
This is the final report for a project to partially satisfy requirements for LIS5937.321C02, “Digital Libraries,” at the University of South Florida. The final assignment was a paper “...dealing with an issue of your choice concerning access to information via the Web.” I chose to explore the technology used to create digital libraries: specifically, the various metadata formats and programming tools used in the creation of digital libraries. The project was envisioned as converting an existing website into a small digital library. As work progressed, focus moved towards technical issues relating to metadata transformation and harvesting rather than simply rebuilding the website. This report describes the components of the projects, tools evaluated and used, and the end results. The final website is at http://www.hray.com/aa
In 1996, I created a website featuring letters written by my grandfather during World War I. He served two tours of duty driving ambulances in France and Italy. His mother typed copies of the letters and saved them in three albums. She also made a fourth album with photographs taken in Italy. (Ackerman 1996) The website consists of an introductory page, an index page, 46 letters from France, 45 letters from Italy, and a couple of miscellaneous pages with images of passports and maps. Some images were scanned and included in web pages. While representative of web design six years ago, the website lacks quite a few elements one would find in a presentation of historical material today:
There was no search capability, nor comprehensive collection of the photographs.
While the chronological index is still quite useful, other approaches to the collection could also be envisioned. Organized threads along themes like weapons, place names, battles, and a photo tour could present subsets of the collection to communities interested in these subjects.
Enhancement with outside historical material to provide context could clearly enhance the educational value of the collection.
During the course of the project, many (but not all) of the initial shortcomings were addressed in the letters collection itself. An online photograph album was created with the entire collection of images. A variety of digital library metadata was created describing the collection. The metadata includes:
Encoded Archival Description Finding Aid for the collection written in XML
EAD Finding Aid in HTML
ASCII files of original website pages (letters) used in search engine
Dublin Core metadata was created for each item in the collection
OAI-MHP 1.1 repository registration
OIA-MHP 2.0 repository and mirror interface
Brief keyword descriptions of many photographs, again for search engine use.
All this is organized in a new website called “Ackerman Archives” online at http://www.hray.com/aa . (Ackerman 2002)
The primary material for this collection is contained in four albums. Three contain chronologically arranged copies of letters written during World War I. The originals of the letters have not been located and are presumed lost. A fourth album contains photographs taken in 1918 at the Italian Front. They are also arranged chronologically, starting with ship transportation to Europe and ending with a photograph of the Statue of Liberty in New York Harbor. Some of the photographs are labeled but many are not.
Secondary material was also used in this project. During an earlier project in 1996, the pages of the correspondence albums were scanned, run through OCR software, edited, and converted to HTML. Some photographs were selected from the photo album and combined with the appropriate text to make more interesting web pages. These HTML files were reused in the new website, referenced from the Encoded Archival Description (EAD) finding aid. That is, the EAD describes each page and has an URL pointing to the corresponding web page. The pages were also redesigned and presented as a separate collection from the “Archives” page. Finally, they were also stripped of their HTML and raw text produced. This was input into a postgreSQL database, along with appropriate file names and entry descriptions, and used in the collection's search engine.
Images were prepared on an Epson 610 color flatbed scanner using Adobe Photoshop Version 4.0. Scanning was performed at 300 dots per inch with 24 bit color. A higher resolution could be used for true archiving, but I have limited disk space available on my own servers. Full size display JPEG copies of the images were prepared at 72 dpi. Thumbnails were produced as well, shrinking the images to 300 pixels wide, also at 72 dpi. For the photograph album exhibit, individual images were extracted from the page scans.
Two observations were made during this phase that would be applicable in an institutional setting. First, naming conventions for objects is an issue in the field of digital libraries, and it crops up in a project like this. Establishing an institutional guideline for object identification would assure that consistent naming conventions are used across projects. My convention used book and page number in the filename of the object. (i.e. b2-p34.txt for letters, bph-p12.img for images) I realized later that this was not ideal, as any changes in ordering (due to a missing page being located, for instance, or a simple error when going through the book, or content found on the back of a page) would either introduce ambiguity into the system, or require rework of existing names. Truly, name and object independence is desired.
Second, time for editing is essential. In spite of previous best efforts, errors in transcription and ordering were found in the secondary material used. Digital archive project plans must include sufficient time for editing and quality control. It took significantly more time for this that I had anticipated.
The next project component was a Finding Aid following the Encoded Archival Description standard. (EAD) EAD is an SGML-based encoding standard for the description of archival material used to create registers, indexes, catalogs, and other finding aids at museums, archives, and libraries.. The EAD DTD was defined in the mid-1990's, and Version 1.0 of the standard was published in August, 1998. Even deciding that EAD was the appropriate format took some time but I believe the conclusion is correct. A course in cataloging and archiving would have simplified the decision making process. (Getty 1998, IST 2002)
EAD is a rich and well supported standard. To develop finding aids in this format you need a variety of support files, all of which are available via ftp from their website. “EAD Applications Guidelines for Version 1.0” is a lengthy but extremely well written and complete document outlining the steps to creating an EAD finding aid. After assembling all the components and reading the documentation, creating an EAD finding aid is still a significant piece of work. EAD is a very flexible format, and elements can be used in many different ways. This is appropriate due to the wide range of materials it will be used to describe. However it is a challenge to pick out just the pieces needed for a given collection. I happened to find the New York University page on finding aids (http://www.nyu.edu/library/bobst/collections/findingaids/ead/) and it served as a guide for much of this section's work. (SAA 1998, 1999)
An additional complexity is in finding and following semantic conventions to use in entering data into your EAD. Again, this was a totally new world. I managed to find and follow content standards for names, dates, and places. This is an area where more research and consensus building is required to permit greater interoperability in the future. Lack of consistency will hinder future usability of objects, making them harder to find. (Stedfield)
EAD can be written in either SGML or XML. I had planned on using SGML until I discovered that open source tools were not readily available. At that point I switched over to XM; a wide variety of good tools are available for XML processing. The differences between SGML and XML EAD's are relatively minor and it was easy to make this change. (Getty 2000)
Once you have an EAD encoded, you need to validate it. For this I turned to a tool written by Jim Clark called “nsgmls”. (Clark) This is a parser that validates against a specified DTD, in this case, “ead.dtd”. The EAD DTD is supported by and available from the Library of Congress. I used it from inside XEMACS, a text editor with excellent XML support. Validation ensures that the structure of your XML follows the guidelines allowed in the EAD specification. Note that it does not deal with content, just structure. (Miller)
Finally, once the XML-EAD is written, it needs to be processed and a copy rendered in HTML for viewing in a browser. XLST is the usual way of doing this.
Several XLST stylesheets are available on the web to use as guidelines, but they all require a large degree of customization. XLST is a node traversal language. If you are familiar with a variety of programming languages then XLST is not to hard to work with. Indeed, any stylesheet you find will need to be customized to match the node design you use in your XML. There are not as yet any “best practices” that I found regarding this aspect of the work. Jim Clark's “xt” is one of many programs that can be used to do this transformation. (Clark) After debugging all the many problems, you end up with an HTML representation of your finding aid. (Fouvry, Pawson) A program in XLST, “hlw-ead.xsl,” was written to convert EAD-XML to HTML. This was based on a version from the New York Historical Society, which in turn was based on a version written by Michael Fox. (Steadfield) NEEDS WORK BIGTIME
After creating one EAD finding aid, subsequent ones will be easier to make. Again, institutional standardization would greatly simplify this task: it should be a content challenge, not a programming challenge, to create a new finding aid. I'm sure authoring products are available offering a form based input system to that librarians and archivists can focus on creating excellent metadata content rather than dealing with the technical issues listed above.
The next major component of a digital library metadata repository is an interface to Open Archive Initiative Metadata Harvesting Protocols (OAI-MHP) as defined by the Open Archive Initiatives (OAI), a major research effort centered at Cornell University and Los Alamos National Laboratories. (Lagoze) It is a content-independent protocol for metadata harvesting. OAI evolved out of the e-journal world, where academics were working on ways of publishing papers electronically. Version 2 of OAI was released on June 14, 2002. For the first half of the semester I worked with Version 1.1; I had this fully implemented when 2.0 was released. Once Version 2 was released I added support for that.
A protocol is a language used for communications between computing processes. A widely known protocol is HyperText Transfer Protocol (HTTP), the protocol web browsers use to communicate with web servers. The OAI protocol was invented to enable communications between servers containing archives of different sorts of data and computers that want to get access to those archives. Thus it is a means for an archive to expose their collection to a separate set of protocols. A native user interface will be provided by the archive; OAI-MHP is a secondary interface into their metadata and hence the collection overall. Version 1.1 had six verbs. Three relate to archival metadata: Identify, ListMetadataFormats, and ListSets. They provide information about the archive itself. Three others are harvesting verbs: ListIdentifiers, ListRecords, and GetRecord. They provide information about objects in the collection.
Creating an OAI-registered archive consists of several steps:
implement a server to handle OAI-MHP
create metadata in an OAI compliant form
test the implementation
registering the archive
Implementing a server to implement OAI-MDH is a reasonable size software project. For instance, Lagoze requires his students in his CS class at Cornell to do this as a final project, with many components of it being built as earlier projects. Fortunately, a variety of server components are available from several universities involved in OAI research and I did not have to create one from scratch.
For my Version 1.1 repository server, I chose to use the “XML File-Based OAI Data Provider” provided by the Virginia Tech, a perl module that installs onto a UNIX web server. Any library, museum, or archive's system administrator with a working knowledge of perl would be able to get this service running. In this implementation, a unique Dublin Core file is created for each object in the collection. Other designs serve metadata from a database, so a different implementation of the metadata creation process would be required.
After entering archive description information I was able to get the program running. A test harvester available at http://oai.dlib.vt.edu/cgi-bin/Explorer/oai1.1/testoai was then run to validate the installation. (Hussein) There were a few minor problems, but quite quickly my repository was responding to the harvester requests and serving out sample metadata included in the installation. Once the server was up, I had to create metadata for it to serve.
While Version 1.1 metadata registered with OAI was required to be Dublin Core (DC), additional metadata types are now supported. Since all the elements of DC are optional, this gives wide flexibility to the archiving institution. Furthermore, DC may be created dynamically upon request of an OAI harvester; there is no requirement that a collection of DC metadata be maintained. This allows institutions to use basically whatever metadata they choose, and to dynamically serve DC back in response to OAI-MHP requests.
A number of parsing techniques are available for parsing XML. In general, they fall into two categories: SAX (event-driven) and DOM (tree-based). A tree based algorithm reads the entire XML structure and builds an in-memory model of it, which is then available for traversing through an application interface (API). XLST and W3C's DOM model, for instance, both take a tree-based approach. An event based algorithm, on the other hand, reads the XML file and triggers an event for each state change in the XML stream. The programmer writes event handlers to recognize the types of nodes and process the information accordingly. SAX (“Simple API for XML”) is useful in many cases, including that of XML files too large to fit into memory.
To produce metadata for my collection, I wrote a perl SAX program that parses my EAD metadata and automatically produces a set of DC files, one file per “item” in the collection (actually, per “c02” level in the EAD). This means that each letter in the collection is represented by a DC description, as is each page from the photograph album. I used the perl XML::Parser module which is an event-driven XML parser. I wrote the event handlers and other code required to build the DC files. In the end, I had a collection of about 150 files containing Dublin Core describing the contents of my digital collection. The program “ead2dc.pl” is available on the website.
One of the interesting issues in doing this conversion is the mapping of fields from EAD to DC. There are several guidelines available for this conversion, talking about where each EAD field should be stored in DC. (Bos) However, if you do not think about this when you make your EAD, you end up with some difficult parsing problems, as you may not be able to get the descriptive data you need at the time that you need it. With my sample size of one EAD file it was easy enough to make a few small modifications in my EAD to generate higher quality DC. I also took the step of making separate DC files by hand to describe overall elements of the collection. This used information in the EAD, but it was simpler to simply do it by hand than to write the code to do it. An institution with a large collection of EAD finding aids would probably have to invest some time into successfully automating this process.
Testing the process is an interactive affair, ongoing as one juggles between data sources, conversion programs, harvesters, and display functions. There are quite a few moving pieces and different technologies, but overall it was a manageable process. I used standard techniques of software engineering to build, debug, and document the software processes. That standardization of process led to a successful outcome.
After one has created a responsive repository, one registers it with the OAI folks. I was hesitant to do so until I read Arm's paper “A Spectrum of Interoperability: The Site for Science Prototype for the NSDL." D-Lib Magazine 8 (January 2002).cite). He wants 10,000 archives within the next five years. I figured I should do my job, so I registered!
On June 15, 2002, Version 2.0 of the OAI-MHP Specification was released. All Version 1.1 repositories will be declared obsolete on December 1, 2002, and removed from the repository registry. Since this is a new standard, the tools are still somewhat immature. I originally explored a tool from Virginia Tech called VTOAI -3.05, but a variety of installation and operational problems drove me to investigate other solutions. (DLRL) I successfully implemented a “personal OAI” solution called Kepler, which could be very useful to individuals wishing to run their own repositories. (DLG) It was not appropriate for institutional use, however, as it lacked the automation required when dealing with hundreds or thousands of objects. Finally, I installed a copy of Celestial 1.1.1. (Brody) It provides two features. Its OAI-MHP 2.0 compliant repository export interface makes one's data available to OAI harvesters. Its harvest function compiles information from other repositories. As an example, I have successfully harvested DC records from Ackerman Archives, USF, MIT, and CalTech. When I attempted to archive arXiv, a massive electronic journal archive, I ran into problems. First I ran out of space on my server. Then while cleaning up that situation, a table in the Celestial database was corrupted. Mysql's repair utilities were unable to recover it so I was forced to reinstall the entire database. These problems make me reluctant to use Celestial until it has matured and I have significantly more disk space available.
The new website “Ackerman Archives” and the photographic collection included in the archives were created for this project. Dreamweaver 3.0 was used as a production tool for website development. An open source project, Gallery 1.3, was used for the framework of the photographic collection. Written in php, this is an excellent tool for building online photograph albums. With a working knowledge of php it is easy to install and customize. (Gallery)
The work delivered for this project include a new OAI repository, “Ackerman Archives”, metadata in OAI 1.1, OAI 2.0, and EAD compatible formats, and a new photographic collection mounted on the web. Over the course of the semester, a significant amount of development work was done in XML, XLST, mysql, postgreSQL, perl, and PHP to implement all the various components. The work reinforced the class notes and readings quite nicely and I appreciate having the opportunity to have done this.
Ackerman, Rich. “Grandpa's World War I Service” 26 September 1996 http://www.hray.com/wwi/title.htm
Ackerman, Rich. “Ackerman Archives.” 1 July 2002 http://www.hray.com/aa 10 July 2002
Bos, Bert. “Interpretation of
Brody, Tim. “oai-perl library.” http://oai-perl.sourceforge.net/ 30 June 2002
Clark, Jim. “NSGMLS.” http://www.jclark.com/sp/nsgmls.htm
Clark, Jim. “XT.” http://www.blnz.com/xt/index.html
Digital Library Research Laboratory (DLRL). “VTOAI OAI-PMH2 PERL Implementation.” 11 June 2002 http://www.dlib.vt.edu/projects/OAI/software/vtoai/vtoai.html
28 June 2002
Digital Library Group at Old Dominion “Kepler – A Digital Library for Individuals” http://kepler.cs.odu.edu/
EAD “Encoded Archival Description” home page. http://www.loc.gov/ead/ 6 August 2001
Fouvry, Frederik “Fixing errors with SGML tools” http://www.coli.uni-sb.de/~fouvry/kde/sgml-errors.html
Gallery. “Gallery – your photos on your website” 2002 http://gallery.jacko.com/ 1 July 2002
Getty Information Institute.
“Descriptive Standards for Catalog Records” 1998
http://www.schistory.org/getty/6_1.html 12 June 2002
Getty Research Institute. “Getty Thesaurus of Geographic Names.” 2000 http://www.getty.edu/research/tools/vocabulary/tgn/ 12 June 2002
Hussein. “Open Archives Initiative - Repository Explorer” August 2001 http://oai.dlib.vt.edu/cgi-bin/Explorer/oai1.1/testoai
IMLS, “A Framework of Guidance for Building Good Digital Collections”
6 November 2001 http://www.imls.gov/scripts/text.cgi?/pubs/forumframework.htm
Information Society Technologies (IST). “Guide to Archiving” http://www.diffuse.org/archive_guide.html May 2002
IST, “Metadata Interchange Standards” http://www.diffuse.org/meta.html May 2002
Lagoze, Carl and Herber Van de Sompel “The Open Archives Initiative: Building a low-barrier interoperability framework” http://www.openarchives.org/documents/oai.pdf
Miller, Stephen “Parsing EAD with NSGMLS (EAD Version 1.0)” http://scriptorium.lib.duke.edu/findaids/ead/nsgmls.html 20 October 1999
Open Archives Institute. (OAI) www.openarchives.org
Pawson, Dave “XSL Frequently Asked Questions” 20 May 2002 http://www.dpawson.co.uk/xsl/xslfaq.html
Society of American Archivists (SAA). “EAD Tag Library for Version 1.0” http://www.loc.gov/ead/tglib/tlelem.htm 1998
Society of American Archivists (SAA). “EAD Crosswalks” http://www.loc.gov/ead/ag/agappb.html 1999
Stedfield, Eric and Leslie Myrick “EAD Production Guide - New York University” 2002 February http://www.nyu.edu/library/bobst/collections/findingaids/ead/
Appendix A: List of source files
These files are available from the “Tools” page of the Ackerman Archives website:
hlw-ead.xml XML source code of HLW finding aid
hlw-ead.xsl XSLT source code transforms hlw-ead.xml into HTML
ead2dc.pl perl program to create Dublin Core files from EAD XML
proc.pl strip tags from letters