Using a Native XML Database for Encoded Archival Description Search and Retrieval Alan Cornish Information Technology and Libraries; Dec 2004; 23, 4; ProQuest pg. 181 Communications Using a Native XML Database for Encoded Archival Description Search and Retrieval Alan Cornish The Northwest Digital Archives (NWDA) is a National Endowment for the Humanities-funded effort by fifteen insti- tutions in the Pacific Northwest to create a finding-aids repository. Approximately 2,300 finding aids that follow the Encoded Archival Description (EAD) standard are being contributed to a union catalog by aca- demic and archival institutions in Idaho, Montana, Oregon, and Washington. This paper provides some information on the EAD standard and on search and retrieval issues for EAD XML documents. It describes native XML technology and the issues that were considered in the selection of a native XML database, Ixiasoft's TextML, to support the NWDA project. Pitti, one of the founders of the EAD standard, noted the primary motiva- tion behind the creation of EAD: "To provide a tool to help mitigate the fact that the geographic distribution of collections severely limits the abil- ity of researchers, educators, and oth- ers to locate and use primary sources."' Pitti expanded on this need for EAD in a 1999 D-Lib article: The logical components of archival description and their relations to one another need to be accurately identified in a machine-readable form to sup- port sophisticated indexing, navigation, and display that provide thorough and accurate access to, and description and control of, archival materials.' In a more recent publication, Pitti and Duff noted a key advantage offered by EAD that relates to the focus of this article, the development of an EAD union catalog: EAD makes it possible to pro- vide union access to detailed archival descriptions and resour- ces in repositories distributed throughout the world. . . . Libraries and archives will be able to easily share information about complementary records and collections, and to "virtu- ally" integrate collections related by provenance, but dispersed geographically or administra- tively.' In a 2001 American Archivist article, Roth examined EAD history and deployment methods used up to the 2001 time period. Importantly, two of the most prominent delivery systems described by Roth-DynaText (a server-side solution) and Panorama (a client-side solution)-were, by 2003, obsolete products for EAD delivery. This is indicative of the rapid pace of change in EAD deployment, in part due to the migration from SGML to XML technologies. Roth described survey results obtained on EAD deployment that underscore the rec- ognized need at that time for a "cost- effective server-side XML delivery system." The lack of such a solution motivated institutions to choose HTML as a delivery method for EAD finding aids.4 Articles like Roth's that describe specific EAD search-and-retrieval implementation options are in short supply. One such option, the Univer- sity of Michigan DLXS XPAT soft- ware, is employed for the search and retrieval of EAD and other metadata in the University of Illinois at Urbana- Champaign (UIUC) Cultural Heritage Repository. 5 Another option, harvest- ing EAD records into machine-read- able cataloging (MARC) to establish search and retrieval access in an inte- grated library system, was described by Fleck and Seadle in a 2002 Coalition for Networked Information Task Force briefing. Using an XML Harvester product created by Innova- tive Interfaces, MARC records are generated based upon MARC encod- ing analogs included in the EAD markup and loaded into an Innova- tive Interfaces INNOPAC system. 6 This product has been used to create access to EAD finding aids in the cat- alog for Michigan State University's Vincent Voice Library. In a 2001 article, Gilliland- Swetland recommended several desirable features for an EAD search- and-retrieval system. She emphasized the challenge of EAD search and retrieval by noting the nature of find- ing aids themselves: Archivists have historically been materials-centric rather than user-centric in their descriptive practices, resulting in the find- ing aid assuming a form quite unlike the concise bibliographic description with name and subject access most users are accustomed to using in other information systems such as library catalogs, abstracts, and indexes.' Without describing specific soft- ware tools, Gilliland-Swetland argued for a user-centric approach to the search and retrieval of finding aids by examining the needs of specific user communities such as genealogists, K-12 teachers, and historians. 8 Several initiatives similar to the NWDA effort are described in the professional literature. The Online Archive of California (OAC), which was founded in the mid-1990s, is a consortium of California special- collections repositories. A number of key consortium functions are central- ized, including "monitoring to ensure consistency of EAD encoding across all OAC finding aids" according to agreed-upon best practices, a critical need in the creation of a union cata- log.9 Brown and Schottlaender also describe the integration of the OAC into the California Digital Library, which enables linkages between EAD Alan Cornish ( is Sys- tems Librarian, Washington State Univer- sity Libraries, Pullman. USING A NATIVE XML DATABASE FOR ENCODING ARCHIVAL DESCRIPTION SEARCH AND RETRIEVAL I CORNISH 181
Finally, one import ant develop- ment area is the po ssibilit y of inte - grating EAD docum ent s into Open Archives Initiative (OAI) services in order to enh ance resourc e discovery. A 2002 paper written by Prom and Habin g, both of whom work with th e UIUC Cultural Herit age Repository, explored th e possibility of mapping EAD to OAI, the latt er of w hich is based up on th e fifteen- eleme nt Dub- lin Cor e Metadata Set (unqualified) . While no ting, "w e do n ot propose that th e full capabiliti es of EAD find- ing aids could be subsum ed by OAI," Prom and Habing sug gested that it is possible to map the top-l eve l and co mpon e nt portions of EA D int o OAI, res ul tin g in multipl e OAI records from a singl e EAD finding aid. In thi s scenario, a sin gle OAI record is created from th e collection- level information and multipl e records from component-level infor- mation in an EAD docum en t.11 Evaluation of EAD Search and Retrieval Products In order to iden tify a software solution for supporting a union catalog of EAD findin g aids, the con so rtium con- ducted a product evaluation. The strengths and weakn esses of the native XML technology em ployed by the consortiu m can be best understood by lookin g at alternative XML prod- uct s and product categor ies . Table 1 shows the products con sidere d during an evaluati on period th at consisted of both product research and actual tri- als. In approaching the eva luation, the consortium and its union -catalog host institut ion , the Washin gton Stat e University Libraries, had seve ral spe- cific need s in mind. First, the licensing an d support costs for the product needed to fit w ithin the consortium's budget. Second , th e sea rch-and- retrieval softw are had to sup port sev- eral basic fun ctions: Keywo rd search- ing across all union-cat alog finding aids; specific field searching based upon elements or attribut es in the EAD docum ent ; an abilit y to cus- tomize the look and feel of the inter- face and search-results screens; and the ability to display search term(s) in the conte xt of the finding aid . As not ed in the tabl e, three of the ev aluated products are n ativ e XML databases. Cyrenne provid es a defi- nition of native XML as a database with the se features: • The XML document is stored intact: "t he XML d ocum ent is preserv ed as a separat e, unique entity in its entirety ." • "Schema independenc e," that is, "a ny we ll-formed XML docu- ment can be stored and queried." • The qu ery language is XML- based: "na ti ve XML d ata base vendors typically u se a quer y langua ge d es igned sp eci fically for XML" as opposed to SQL.12 Of the thr ee native XML products, only the licensi ng costs of Ixiasoft's TextML and the open-sourc e XIndice so ftware fell within the available proj- ec t fundin g. Both pack ages were extensively tested, with Text ML prov- ing superio r at handlin g th e large (sometimes in the MB-size range) and structurall y complex EAD documents crea ted by consortium memb ers. One key strength of TextML that m et an NWDA consortium-need involved field sea rching. In TextML, it is possibl e to m ap a search field to one or m ore XPath s ta tements , enabling th e crea tion of sea rch fields b ase d upon the precise us e of an ele- ment or attribute in EAD d ocuments. The importanc e of this capability is show n with th e EAD ele- ment, which can appear at the collec- tion lev el and at the sub or dinate component level in a docu men t. With TextML, usin g its limited XPath sup- port, it is p ossib le to refer ence a spe- cific, contextual use of . In addition to the native XML sol utions , seve ral oth er product 182 INFORMATION TECHNOLOGY AND LIBRARIES I DECEMBER 2004 types were considered. An XML qu ery engine, Verity Ultra seek, was te sted and produced good results whe n u sed for the search and retrieval of consortium docum en ts. 13 Ultraseek can be used to search dis- crete XML files , supports th e creation of custom int erfaces for th e searc h- and-r etrie va l sys tem, and ha s strong documentation . Pro bably th e most obvious limit a tion in thi s XML qu ery- engin e product conc erned the crea tion of search fields. To contras t U ltr asee k with a native XML solu- tion : Ultras eek 5 .0 (used du r ing the product trial) lacked XPath support. Inste ad, it requir ed a uniqu e eleme nt- attr ibute combin ation for the crea tion of a databa se sea rch field . Returning to th e exa mple , cont extual u ses of could n ot be indexed with o ut recoding consor- tium docum ent s to create a unique eleme nt-attribut e combination on which to ind ex. An XML-enabled databa se, DLXS XPAT, has b een successfully used in se veral EAD projects, including OAC. One d isadv antage of this product is th at it re quir es a UNIX operating sys tem for th e se rver. A dditionall y, XPAT, as a supporting toolse t for di gital-library collection building, provid es functionalit y that duplicates other media tool s at the ho st institution (specifically, OCLC/ DiM eMa CONTENTdm). The use of a Relational Dat abase Management System (RDBMS) to es tablish sea rch and retri eva l for EAD XML d ocume nts was con sid- ered as well. Th e advantage to thi s approac h is th at it would ena ble the u se of codin g techniques built up through other Web-based media d elivery proj ects at the ho st institu- tion. The mo st obvio us negati ve issue is th e need to map XML elements or attributes to tables and field s in an RDBMS, which , as Cyrenne notes, "is often expensiv e and will m ost likely res ult in the loss of some dat a suc h as processing in stru ctions , and com- ments as well as the noti on of ele- me nt and attribut e orderin g." 14
Table 1. NWDA project---€valuated search and retrieval products Product Vendor Product category License MySQUPHP N/ A Relationa l database management system Open so urce Tamino XML Server Software AG Native XML database Nat ive XML database XML query engine Native XML database Integrated library system XML enabled database Commercial Commercial Commercial Open source Commercial Commercial Textml lxiasoft Ultraseek Verity Xindice N/A XML Harvester Innovative Interfaces XPAT DLXS use of native XML avoids the task of explodin g XML data in to the tabl e and field struc ture s of an RDBMS. Finally, another approac h consid- ered was the use of an integrated library sys tem product. This was a realistic option for NWDA becaus e consortium member institutions had decid ed to include MARC encoding catalogs for selected elements in union-catalog findin g ai ds. Inn o- vative Int er faces produces an XML Harve ste r that can be u sed to gen er- ate MARC records from EAD findin g aids th a t include MARC encoding analo gs. For this proj ect, a local ( or self-cont aine d) cat alog could hav e been created and p opulated with MARC records containing metadat a for th e EAD docum en ts, includin g a URL for online access. This approach offers important strengths and weak- nesses . On the positiv e side, it is a relati ve ly eas y meth od for enablin g search-and-retrieval access to EAD findin g aids. In contrast to the int er - face coding requirement s for TextML, the XML Harvest er provided an almost tu rn key approach to XML search and retrieval. On the negativ e side, tw o factors stood out during th e evaluation . First, it would be difficult to fully custom ize sea rch-and- retrieval interfaces as needed for th e proj ec t. Second, u sing the XML Harv ester, there is no ability to dis- play searc h terms in the context of the findin g aid. Search and retrieval is bas ed upon the m etadata extract ed from th e finding aid usin g th e MARC analogs. In Michigan State's Voice Librar y impl eme ntation of thi s so lu- tion , th e finding aid is an external resource with no hi ghlighting of search ter ms . Strengths and Weaknesses of the TextML Approach Each p roject has it s ow n specific n eed s; thu s, th ere is no correct approach to establishing searc h and retrieval for EAD XML documents. In taking th e needs and resources of th e NWDA conso rtium into account, Ixiaso ft's TextML, a nati ve XML prod - uct, pr ovi ded the best fit and was licens ed for u se. The use of TextML enables the creation of customized interfac es for an XML d atabase (or Docum en t Base, u sing the TextML terminol ogy) and pro vides support for ke yword and field sea rching of consortium documents. The qualified XPath support in TextML enables search fields to be built up on precis e element or attribute combinations wi thin EAD document s. The existence of a major finding- aids Int erne t site empl oy ing TextML was a factor in the proj ect's selection of the sof tware . Th e Acces s to Archive s (A2A) site, access ible from URL www / , provid es an excell ent model for a publicly sea rchabl e finding-aid sit e. Th e A2A site supp orts keyw ord searching and sea rchin g b y arc hival facility; pro- vides multiple views of sea rch results (a summary recor ds scree n, sea rch ter ms in cont ext, and th e full rec ord); highlights searc h term(s) in the dis- played findin g aid ; and supp or ts the presentation of lar ge findin g -aid doc - ument s. While A2A u ses Ge neral Internation al Standard Arc hival Description, or ISAD(G), as op posed to EAD for its description standard, the similaritie s between th e two stan- d a rd s m akes th e A2A site a va luable example for d eve lopment. '5 One w eakness of TextML is the implementation model supported by Ixiasoft, whi ch assumes significant local de velopme nt of the app lication or Web int er face. Th e rela tionship b etween sof tw are cap abiliti es and local dev elopme nt was con sidered with each of the produ cts listed in tab le 1. As no ted , th e Innovative Interfaces so lution was th e most straightfor wa rd approach , assu ming the existenc e of the MARC analogs in EAD marku p, but provid ed the least flexibility in terms of customization an d establishing a tru e linkage between the searc h system and the actual document. In contra st, while Ixiasoft m akes available a base set of active server pages using visual basic script (ASP / VBScript) code for TextML app lication de velop ment and provides very goo d trainin g and sup- port ser vices, the resp onsi bility for USING A NATIVE XML DATABASE FOR ENCODING ARCHIVAL DESCRIPTION SEARCH AND RETRIEVAL I CORNISH 183
that d evelopm ent rests with the loca l site . For the NWDA consortium, this development, using the co de base, ha s been manag ea ble. The curr ent state of interface dev elopment for the NWDA proj ect can b e reviewed at http: // nwd ulibs .ws / project_info /. Conclusion In se le cting a n EAD se arch- an d- retr ieva l sy s te m, on e important qu es ti on for th e con so rtium was, Whi ch software so lution had the be st prosp ects for migration in the futur e ? Becau se of th e inherent strength s of native XML tec hnolog y in compari- son to the other product catego r ies listed in table 1, a nativ e XML d a ta- base appeared to be the be s t approach , and Tex tML pro v ided the best combination of lic ensing costs, softw are capabilities, and support. It is import a nt to not e that the di s- tinctions betw ee n nativ e XML d ata- bas es and databases that supp or t XML throu g h extensions (XML- enabl ed datab ases) 1nay b eco me more difficult to di scern ov er time, in part due to the ex isting expertise and inv es tments in RDBMS techn o lo - gies.16 Ne verthel ess, capabilities cen- tral to native XML, such as the us e of an XML-bas ed query language, are integral to th e success of such h ybrid systems . References and Notes 1. Daniel Pitti, "Enc oded Archi va l Description: The Dev elop ment of an Encoding Standard for Archival Findin g Aids," The American Archivist 60, no. 3 (Summer 1997): 269. 2. Daniel Pitti, "Encod ed Archi va l Descrip tion: An Introducti on and Over- view," D-Lib Magazine 5, no. 11 (Nov. 1999). Accessed Nov. 2, 2004, www.dlib. org / dlib / novemb er99 / 11 pi tti.h tml. 3. Daniel V. Pitti and Wendy M. Duff (eds.), "Introdu ction ," in Encoded Archival Description on the Internet (Binghamton, N.Y.: Haworth, 2001), 3. 4. James M. Roth, "Serving Up EAD: An Exploratory Stud y on the Deploy- ment and Utili zation of Encoded Arch- iva l Description Findin g Aids, " The American Archivist 64, n o. 2 (Fall/ Winter 2001): 226. 5. Sarah L. Shreeves et al., "H arvest- ing Cultural Her itage Metada ta Using the OAI Protocol," Library Hi Tech 21, no. 2 (2003): 161. 6. Nanc y Fleck and Michael Seadle, "EAD Harv es ting for the Na tional Gallery of the Spoken Word" (pap er pre- sent ed at th e Coalition for Netw orked Information fall 2002 Task Force meeting, San Antoni o, Tex., Dec. 2002). Accessed Nov. 2, 2004, tfms/20 02b. fall/ handout s/ H-EAD-FleckSeadl e.doc . 7. Anne J. Gilliland -Swetland, "Popu- larizing the Finding Aid : Exploiting EAD to Enhance Online Discovery and Retrie- val," in Encoded Archival Description on the Internet (Binghamton, N.Y.: H aw orth, 2001), 207. 8. Ibid , 210-14. 9. Charlotte B. Brown and Brian E. C. Schottlaender, "The Online Archive of Cal- ifornia: A Consor tia! Approach to Encoded Archival Description ," in Encoded Archival Description on the Internet (Binghamton, N .Y.: Haworth , 2001), 99. 10. Ibid, 103-5. OAC available at: www. lib. o rg/. Accessed Nov . 2, 2004 . 11. Christ ophe r J. Prom and Thomas Habing, "Using the Op en Archives Initia- tive Protocols with EAD," in Proceed ings of the Second ACM/ IEEE-CS Joint Confer- ence on Digital Libraries (Portland, Ore., July 2002). Accessed Nov. 2, 2004, http:// dli .grainger. ui uc.ed u / publ ications/ jcdl20 02/ pl4prom .pdf . 12 . Marc Cyre nn e, "Going N ative : Wh en Should You Use a Nativ e XML Database?" AIM E-DOC Magazine 16, no. 6 (Nov./ Dec. 2002), 16. Accessed Nov. 2, 2004, www .edocmag az / ar ticle_ new.asp?ID=25421. 13. Product categor y decisions based up on definitions and classifications avail- able from : Ronald Bourret, "XML Database Products." Accessed Nov. 2, 2004, www. rp / xml / XMLD a t a b a se Prods.htm. 14. Cyrenne, "Going Native, " 18. 15. Bill Stockting, "EAD in A2A," Microsoft PowerPoint present at ion. Accessed N ov. 2, 2004, www.agad .archiwa. ead / stocking.ppt. 16. Uw e Ho henst ein, "Supporting XML in Oracl e9i," in Akmal B. Chaudhri, 184 INFORMATION TECHNOLOGY AND LIBRARIES I DECEMBER 2004 Awais Rashid, and Roberto Zicari (eds.), XML Data Management: Native XML and XML-Enabled Database Systems (Boston: Add ison -Wesley, 2003), 123-4 . 