Batch Loading coLLections into dspace | WaLsh 117 Maureen P. Walsh Batch Loading Collections into DSpace: Using Perl Scripts for Automation and Quality Control colleagues briefly described batch loading MARC meta- data crosswalked to DSpace Dublin Core (DC) in a poster session.2 Mishra and others developed a Perl script to create the DSpace archive directory for batch import of electronic theses and dissertations (ETDs) extracted with a Java program from an in-house bibliographic database.3 Mundle used Perl scripts to batch process ETDs for import into DSpace with MARC catalog records or Excel spreadsheets as the source metadata.4 Brownlee used Python scripts to batch process comma-separated values (CSV) files exported from Filemaker database software for ingest via the DSpace item importer.5 More in-depth descriptions of batch loading are pro- vided by Thomas; Kim, Dong, and Durden; Proudfoot et al.; Witt and Newton; Drysdale; Ribaric; Floyd; and Averkamp and Lee. However, irrespective of reposi- tory software, each describes a process to populate their repositories dissimilar to the workflows developed for the Knowledge Bank in approach or source data. Thomas describes the Perl scripts used to convert MARC catalog records into DC and to create the archive directory for DSpace batch import.6 Kim, Dong, and Durden used Perl scripts to semiauto- mate the preparation of files for batch loading a University of Texas Harry Ransom Humanities Research Center (HRC) collection into DSpace. The XML source metadata they used was generated by the National Library of New Zealand Metadata Extraction Tool.7 Two subsequent proj- ects for the HRC revisited the workflow described by Kim, Dong, and Durden.8 Proudfoot and her colleagues discuss importing meta- data-only records from departmental RefBase, Thomson Reuters EndNote, and Microsoft Access databases into ePrints. They also describe an experimental Perl script written to scrape lists of publications from personal web- sites to populate ePrints.9 Two additional workflow examples used citation databases as the data source for batch loading into repositories. Witt and Newton provide a tutorial on trans- forming EndNote metadata for Digital Commons with XSLT (Extensible Stylesheet Language Transformations).10 Drysdale describes the Perl scripts used to convert Thomson Reuters Reference Manager files into XML for the batch loading of metadata-only records into the University of Glascow’s ePrints repository.11 The Glascow ePrints batch workflow is additionally described by Robertson and Nixon and Greig.12 Several workflows were designed for batch loading ETDs into repositories. Ribaric describes the automatic This paper describes batch loading workflows developed for the Knowledge Bank, The Ohio State University’s institutional repository. In the five years since the incep- tion of the repository approximately 80 percent of the items added to the Knowledge Bank, a DSpace repository, have been batch loaded. Most of the batch loads utilized Perl scripts to automate the process of importing meta- data and content files. Custom Perl scripts were used to migrate data from spreadsheets or comma-separated values files into the DSpace archive directory format, to build collections and tables of contents, and to provide data quality control. Two projects are described to illus- trate the process and workflows. T he mission of the Knowledge Bank, The Ohio State University’s (OSU) institutional repository, is to col- lect, preserve, and distribute the digital intellectual output of OSU’s faculty, staff, and students.1 The staff working with the Knowledge Bank have sought from its inception to be as efficient as possible in adding content to DSpace. Using batch loading workflows to populate the repository has been integral to that efficiency. The first batch load into the Knowledge Bank was August 29, 2005. Over the next four years, 698 collections con- taining 32,188 items were batch loaded, representing 79 percent of the items and 58 percent of the collections in the Knowledge Bank. These batch loaded collections vary from journal issues to photo albums. The items include articles, images, abstracts, and transcripts. The majority of the batch loads, including the first, used custom Perl scripts to migrate data from Microsoft Excel spreadsheets into the DSpace batch import format for descriptive meta- data and content files. Perl scripts have been used for data cleanup and quality control as part of the batch load pro- cess. Perl scripts, in combination with shell scripts, have also been used to build collections and tables of contents in the Knowledge Bank. The workflows using Perl scripts to automate batch import into DSpace have evolved through an iterative process of continual refinement and improvement. Two Knowledge Bank projects are pre- sented as case studies to illustrate a successful approach that may be applicable to other institutional repositories. ■■ Literature Review Batch ingesting is acknowledged in the literature as a means of populating institutional repositories. There are examples of specific batch loading processes mini- mally discussed in the literature. Branschofsky and her Maureen p. Walsh (walsh.260@osu.edu) is Metadata Librarian/ Assistant Professor, The Ohio State University Libraries, Colum- bus, Ohio. 118 inFoRMation technoLogY and LiBRaRies | septeMBeR 2010 relational database PostgreSQL 8.1.11 on the Red Hat Enterprise Linux 5 operating system. The structure of the Knowledge Bank follows the hierarchical arrangement of DSpace. Communities are at the highest level and can be divided into subcommunities. Each community or subcommunity contains one or more collections. All items—the basic archival elements in DSpace—are con- tained within collections. Items consist of metadata and bundles of bitstreams (files). DSpace supports two user interfaces: the original interface based on JavaServer Pages (JSPUI) and the newer Manakin (XMLUI) interface based on the Apache Cocoon framework. At this writing, the Knowledge Bank continues to use the JSPUI interface. The default metadata used by DSpace is a Qualified DC schema derived from the DC library application profile.18 The Knowledge Bank uses a locally defined extended version of the default DSpace Qualified DC schema, which includes several additional element quali- fiers. The metadata management for the Knowledge Bank is guided by a Knowledge Bank application profile and a core element set for each collection within the reposi- tory derived from the application profile.19 The metadata librarians at OSUL create the collection core element sets in consultation with the community representatives. The core element sets serve as metadata guidelines for sub- mitting items to the Knowledge Bank regardless of the method of ingest. The primary means of adding items to collections in DSpace, and the two ways used for Knowledge Bank ingest, are (1) direct (or intermediated) author entry via the DSpace Web item submission user inter- face and (2) in batch via the DSpace item importer. Recent enhancements to DSpace, not yet fully explored for use with the Knowledge Bank, include new ingest options using Simple Web-service Offering Repository Deposit (SWORD), Open Archives Initiative Object Reuse and Exchange (OAI-ORE), and DSpace package import- ers such as the Metadata Encoding and Transmission Standard Submission Information Package (METS SIP) preparation of ETDs from the Internet Archive (http:// www.archive.org/) for ingest into DSpace using PHP utilities.13 Floyd describes the processor developed to automate the ingest of ProQuest ETDs via the DSpace item importer.14 Also using ProQuest ETDs as the source data, Averkamp and Lee described using XSLT to transform the ProQuest data to Bepress’ (The Berkeley Electronic Press) schema for batch loading into a Digital Commons repository.15 The Knowledge Bank workflows described in this paper use Perl scripts to generate DC XML and create the archive directory for batch loading metadata records and content files into DSpace using Excel spreadsheets or CSV files as the source metadata. ■■ Background The Knowledge Bank, a joint initiative of the OSU Libraries (OSUL) and the OSU Office of the Chief Information Officer, was first registered in the Registry of Open Access Repositories (ROAR) on September 28, 2004.16 As of December 2009 the repository held 40,686 items in 1,192 collections. The Knowledge Bank uses DSpace, the open-source Java-based repository software jointly developed by the Massachusetts Institute of Technology Libraries and Hewlett-Packard.17 As a DSpace reposi- tory, the Knowledge Bank is organized by communities. The fifty-two communities currently in the Knowledge Bank include administrative units, colleges, departments, journals, library special collections, research centers, symposiums, and undergraduate honors theses. The com- monality of the varied Knowledge Bank communities is their affiliation with OSU and their production of knowl- edge in a digital format that they wish to store, preserve, and distribute. The staff working with the Knowledge Bank includes a team of people from three OSUL areas—Technical Services, Information Technology, and Preservation—and the contracted hours of one systems developer from the OSU Office of Information Technology (OIT). The OSUL team members are not individually assigned full-time to the repository. The current OSUL team includes a librarian reposi- tory manager, two metadata librarians, one systems librarian, one systems developer, two technical services staff members, one preservation staff mem- ber, and one graduate assistant. The Knowledge Bank is cur- rently running DSpace 1.5.2 and the Figure 1. DSpace simple archive format archive_directory/ item_000/ dublin_core.xml--qualified Dublin Core metadata contents --text file containing one line per filename file_l.pdf --files to be added as bitstreams to the item file_2.pdf item_001/ dublin_core.xml file_1.pdf ... Batch Loading coLLections into dspace | WaLsh 119 ■■ Case Studies the issues of the Ohio Journal of Science OJS was jointly published by OSU and the Ohio Academy of Science (OAS) until 1974, when OAS took over sole control of the journal. The issues of OJS are archived in the Knowledge Bank with a two year rolling wall embargo. The issues for 1900 through 2003, a total of 639 issues containing 6,429 articles, were batch loaded into the Knowledge Bank. Due to rights issues, the retrospec- tive batch loading project had two phases. The project to digitize OJS began with the 1900–1972 issues that OSU had the rights to digitize and make publicly available. OSU later acquired the rights for 1973–present, and (accounting for the embargo period) 1973–2003 became phase 2 of the project. The two phases of batch loads were the most complicated automated batch loading processes developed to date for the Knowledge Bank. To batch load phase 1 in 2005 and phase 2 in 2006, the systems devel- opers working with the Knowledge Bank wrote scripts to build collections, generate DC XML from the source metadata, create the archive directory, load the metadata and content files, create tables of contents, and load the tables of contents into DSpace. The OJS community in the Knowledge Bank is orga- nized by collections representing each issue of the journal. The systems developers used scripts to automate the building of the collections in DSpace because of the number needed as part of the retrospective project. The individual articles within the issues are items within the collections. There is a table of contents for the articles in each issue as part of the collection homepages.21 Again, due to the number required for the retrospective project, the systems developers used scripts to automate the cre- ation and loading of the tables of contents. The tables of contents are contained in the HTML introductory text sec- tion of the collection pages. The tables of contents list title, authors, and pages. They also include a link to the item record and a direct link to the article PDF that includes the file size. For each phase of the OJS project, a vendor con- tracted by OSUL supplied the article PDFs and an Excel spreadsheet with the article-level metadata. The metadata format. This paper describes ingest via the DSpace batch item importer. The DSpace item importer is a command-line tool for batch ingesting items. The importer uses a simple archive format diagramed in figure 1. The archive is a directory of items that contain a subdirectory of item metadata, item files, and a contents file listing the bitstream file names. Each item’s descriptive metadata is contained in a DC XML file. The format used by DSpace for the DC XML files is illustrated in figure 2. Automating the process of creating the Unix archive directory has been the main function of the Perl scripts written for the Knowledge Bank batch loading workflows. A systems developer uses the test mode of the DSpace item importer tool to validate the item directories before doing a batch load. Any significant errors are corrected and the process is repeated. After a successful test, the batch is loaded into the staging instance of the Knowledge Bank and quality checked by a metadata librarian to identify any unexpected results and script or data problems that need to be corrected. After a successful load into the staging instance the batch is loaded into the production instance of the Knowledge Bank. Most of the Knowledge Bank batch loading work- flows use Excel spreadsheets or CSV files as the source for the descriptive item metadata. The creation of the metadata contained in the spreadsheets or files has var- ied by project. In some cases the metadata is created by OSUL staff. In other cases the metadata is supplied by Knowledge Bank communities in consultation with a metadata librarian or by a vendor contracted by OSUL. Whether the source metadata is created in-house or exter- nally supplied, OSUL staff are involved in the quality control of the metadata. Several of the first communities to join the Knowledge Bank had very large retrospective collection sets to archive. The collection sets of two of those early adopt- ers, the journal issues of the Ohio Journal of Science (OJS) and the abstracts of the OSU International Symposium on Molecular Spectroscopy currently account for 59 percent of the items in the Knowledge Bank.20 The successful batch loading workflows developed for these two com- munities—which continue to be active content suppliers to the repository—are presented as case studies. Figure 2. DSpace Qualified Dublin Core XML Notes on the Bird Life of Cedar Point 1901-04 Griggs, Robert F. 120 inFoRMation technoLogY and LiBRaRies | septeMBeR 2010 article-level metadata to Knowledge Bank DC, as illus- trated in table 1. The systems developers used the mapping as a guide to write Perl scripts to transform the vendor metadata into the DSpace schema of DC. The workflow for the two phases was nearly identical, except each phase had its own batch loading scripts. Due to a staff change between the two phases of the project, a former OSUL systems developer was responsible for batch loading phase 1 and the OIT systems developer was responsible for phase 2. The phase 1 scripts were all writ- ten in Perl. The four scripts written for phase 1 created the archive directory, performed database operations to build the collections, generated the HTML introduction table of contents for each collection, and loaded the tables of contents into DSpace via the database. For phase 2, the OIT systems developer modified and added to the phase 1 batch processing scripts. This case study focuses on phase 2 of the project. Batch processing for phase 2 of OJS The annotated scripts the OIT systems developer used for phase 2 of the OJS project are included in appen- dix A, available on the ITALica weblog (http://ital-ica .blogspot.com/). A shell script (mkcol.sh) added collec- tions based on a listing of the journal issues. The script performed a login as a selected user ID to the DSpace Web interface using the Web access tool Curl. A subsequent simple looping Perl script (mkallcol.pl) used the stored credentials to submit data via this channel to build the collections in the Knowledge Bank. The metadata.pl script created the archive directory for each collection. The OIT systems developer added the PDF file for each item to Unix. The vendor-supplied meta- data was saved as Unicode text format and transferred to Unix for further processing. The developer used vi com- mands to manually modify metadata for characters illegal in XML (e.g., “<” and “&”). (Although manual steps were used for this project, the OIT systems developer improved the Perl scripts for subsequent projects by add- ing code for automated transformation of the input data to help ensure XML validity.) The metadata.pl script then processed each line of the metadata along with the cor- responding data file. For each item, the script created the DC XML file and the contents file and moved them and the PDF file to the proper directory. Load sets for each col- lection (issue) were placed in their own subdirectory, and a load was done for each subdirectory. The items for each collection were loaded by a small Perl script (loaditems. pl) that used the list of issues and their collection IDs and called a shell script (import.sh) for the actual load. The tables of contents for the issues were added to the Knowledge Bank after the items were loaded. A Perl script (intro.pl) created the tables of contents using the meta- data and the DSpace map file, a stored mapping of item received from the vendor had not been customized for the Knowledge Bank. The OJS issues were sent to a vendor for digitization and metadata creation before the Knowledge Bank was chosen as the hosting site of the digitized jour- nal. The OSU Digital Initiatives Steering Committee 2002 proposal for the OJS digitization project had predated the Knowledge Bank DSpace instance. OSUL staff performed quality-control checks of the vendor-supplied metadata and standardized the author names. The vendor supplied the author names as they appeared in the articles—in direct order, comma separated, and including any “and” that appeared. In addition to other quality checks per- formed, OSUL staff edited the author names in the spreadsheet to conform to DSpace author-entry conven- tion (surname first). Semicolons were added to separate author names, and the extraneous ands were removed. A former metadata librarian mapped the vendor-supplied Table 1. Mapping of vendor metadata to Qualified Dublin Core Vendor-Supplied Metadata Knowledge Bank Dublin Core File [n/a: PDF file name] Cover Title dc.identifier.citation* ISSN dc.identifier.issn Vol. dc.identifier.citation* Iss. dc.identifier.citation* Cover Date dc.identifier.citation* Year dc.date.issued Month dc.date.issued Fpage dc.identifier.citation* Lpage dc.identifier.citation* Article Title dc.title Author Names dc.creator Institution dc.description Abstract dc.description.abstract n/a dc.language.iso n/a dc.rights n/a dc.type *format: [Cover Title]. v[Vol.], n[Iss.] ([Cover Date]), [Fpage]-[Lpage] Batch Loading coLLections into dspace | WaLsh 121 directories to item handles created during the load. The tables of contents were added to the Knowledge Bank using a shell script (installintro.sh) similar to what was used to create the collections. Installintro.sh used Curl to simulate a user adding the data to DSpace by performing a login as a selected user ID to the DSpace Web interface. A simple looping Perl script (ldallintro.pl) called installintro.sh and used the stored credentials to submit the data for the tables of contents. the abstracts of the osU international symposium on Molecular spectroscopy The Knowledge Bank contains the abstracts of the papers presented at the OSU International Symposium on Molecular Spectroscopy (MSS), which has met annually since 1946. Beginning with the 2005 Symposium, the complete presentations from authors who have autho- rized their inclusion are archived along with the abstracts. The MSS community in the Knowledge Bank currently contains 17,714 items grouped by decade into six col- lections. The six collections were created “manually” via the DSpace Web interface prior to the batch loading of the items. The retrospective years of the Symposium (1946–2004) were batch loaded in three phases in 2006. Each Symposium year following the retrospective loads was batch loaded individually. Retrospective Mss Batch Loads The majority of the abstracts for the retrospective loads were digitized by OSUL. A vendor was contracted by OSUL to digitize the remainder and to supply the meta- data for the retrospective batch loads. The files digitized by OSUL were sent to the vendor for metadata capture. OSUL provided the vendor a metadata template derived from the MSS core element set. The metadata taken from the abstracts comprised author, affiliation, title, year, session number, sponsorship (if applicable), and a full transcription of the abstract. To facilitate searching, the formulas and special characters appearing in the titles and abstracts were encoded using LaTeX, a document prepara- tion system used for scientific data. The vendor delivered the metadata in Excel spreadsheets as per the spreadsheet template provided by OSUL. Quality-checking the meta- data was an essential step in the workflow for OSUL. The metadata received for the project required revisions and data cleanup. The vendor originally supplied incomplete files and spreadsheets that contained data errors, includ- ing incorrect numbering, data in the wrong fields, and inconsistency with the LaTeX encoding. The three Knowledge Bank batch load phases for the retrospective MSS project corresponded to the staged receipt of metadata and digitized files from the vendor. The annotated scripts used for phase 2 of the project, which included twenty years of the OSU International Symposium between 1951 and 1999, are included in appendix B, available on the ITALica weblog. The OIT systems developer saved the metadata as a tab-separated file and added it to Unix along with the abstract files. A Perl script (mkxml2.pl) transformed the metadata into DC XML and created the archive directories for load- ing the metadata and abstract files into the Knowledge Bank. The script divided the directories into separate load sets for each of the six collections and accounted for the inconsistent naming of the abstract files. The script added the constant data for type and language that was not included in the vendor-supplied metadata. Unlike the OJS project, where multiple authors were on the same line of the metadata file, the MSS phase 2 script had to code for authors and their affiliations on separate lines. Once the load sets were made, the OIT systems devel- oper ran a shell script to load them. The script (import_ collections.sh) was used to run the load for each set so that the DSpace item import command did not need to be constructed each time. annual Mss Batch Loads A new workflow was developed for batch loading the annual MSS collection additions. The metadata and item files for the annual collection additions are supplied by the MSS community. The community provides the Symposium metadata in a CSV file and the item files in a Tar archive file. The Symposium uses a Web form for LaTeX–formatted abstract submissions. The community processes the electronic Symposium submissions with a Perl script to create the CSV file. The metadata delivered in the CSV file is based on the template created by the author, which details the metadata requirements for the project. The OIT systems developer borrowed from and modi- fied earlier Perl scripts to create a new script for batch processing the metadata and files for the annual Symposium collection additions. To assist with the development of the new script, I provided the developer a mapping of the community CSV headings to the Knowledge Bank DC fields. I also provided a sample DC XML file to illustrate the desired result of the Perl transformation of the com- munity metadata into DC XML. For each new year of the Symposium, I create a sample DC XML result for an item to check the accuracy of the script. A DC XML example from a 2009 MSS item is included in appendix C, available on the ITALica weblog. Unlike the previous retrospective MSS loads in which the script processed multiple years of the Symposium, the new script processes one year at a time. The annual Symposiums are batch loaded indi- vidually into one existing MSS decade collection. The new script for the annual loads was tested and refined by load- ing the 2005 Symposium into the staging instance of the 122 inFoRMation technoLogY and LiBRaRies | septeMBeR 2010 ■■ Summary and Conclusion Each of the batch loads that used Perl scripts had its own unique features. The format of content and associ- ated metadata varied considerably, and custom scripts to convert the content and metadata into the DSpace import format were created on a case-by-case basis. The differ- ences between batch loads included the delivery format of the metadata, the fields of metadata supplied, how metadata values were delimited, the character set used for the metadata, the data used to uniquely identify the files to be loaded, and how repeating metadata fields were identi- fied. Because of the differences in supplied metadata, a separate Perl script for generating the DC XML and archive directory for batch loading was written for each project. Each new Perl script borrowed from and modified earlier scripts. Many of the early batch loads were firsts for the Knowledge Bank and the staff working with the reposi- tory, both in terms of content and in terms of metadata. Dealing with community- and vendor-supplied metadata and various encodings (including LaTeX), each of the early loads encountered different data obstacles, and in each case solutions were written in Perl. The batch loading code has matured over time, and the progression of improvements is evident in the example scripts included in the appendixes. Batch loading can greatly reduce the time it takes to add content and metadata to a repository, but successful Knowledge Bank. Problems encountered with character encoding and file types were resolved by modifying the script. The metadata and files for the Symposium years 2005, 2006, and 2007 were made available to OSUL in 2007, and each year was individually loaded into the existing Knowledge Bank col- lection for that decade. These first three years of community-supplied CSV files contained author metadata inconsistent with Knowledge Bank author entries. The names were in direct order, upper- case, split by either a semicolon or “and,” and included extraneous data, such as an address. The OIT systems developer wrote a Perl script to correct the author metadata as part of the batch loading workflow. An annotated section of that script illustrating the author modifica- tions is included in appendix D, available on the ITALica weblog. The MSS com- munity revised the Perl script they used to generate the CSV files by including an edited version of this author entry cor- rection script and were able to provide the expected author data for 2008 and 2009. The author entries received for these years were in inverted order (surname first) and mixed case, were semicolon separated, and included no extraneous data. The receipt of consistent data from the community for the last two years has facilitated the stan- dardized workflow for the annual MSS loads. The scripts used to batch load the 2009 Symposium year are included in appendix E, which appears at the end of this text. The OIT systems developer unpacked the Tar file of abstracts and presentations into a directory named for the year of the Symposium on Unix. The Perl script written for the annual MSS loads (mkxml. pl) was saved on Unix and renamed mkxml2009.pl. The script was edited for 2009 (including the name of the CSV file and the location of the directories for the unpacked files and generated XML). The CSV headings used by the community in the new file were checked and verified against the extract list in the script. Once the Perl script was up-to-date and the base directory was created, the OIT systems developer ran the Perl script to gener- ate the archive directory set for import. The import.sh script was then edited for 2009 and run to import the new Symposium year into the staging instance of the Knowledge Bank as a quality check prior to loading into the live repository. The brief item view of an example MSS 2009 item archived in the Knowledge Bank is shown in figure 3. Figure 3. MSS 2009 archived item example Batch Loading coLLections into dspace | WaLsh 123 Proceedings of the 2003 International Conference on Dublin Core and Metadata Applications: Supporting Com- munities of Discourse and Practice—Metadata Research & Applications, Seattle, Washington, 2003, http://dcpapers .dublincore.org/ojs/pubs/article/view/753/749 (accessed Dec. 21, 2009). 3. R. Mishra et al., “Development of ETD Repository at IITK Library using DSpace,” in International Conference on Semantic Web and Digital Libraries (ICSD-2007), ed. A. R. D. Prasad and Devika P. Madalli (2007), 249–59. http://hdl.handle .net/1849/321 (accessed Dec. 21, 2009). 4. Todd M. Mundle, “Digital Retrospective Conversion of Theses and Dissertations: An In House Project” (paper presented to the 8th International Symposium on Electronic Theses & Dis- sertations, Sydney, Australia, Sept. 28–30, 2005), http://adt.caul .edu.au/etd2005/papers/080Mundle.pdf (accessed Dec. 21, 2009). 5. Rowan Brownlee, “Research Data and Repository Meta- data: Policy and Technical Issues at the University of Sydney Library,” Cataloging & Classification Quarterly 47, no. 3/4 (2009): 370–79. 6. Steve Thomas, “Importing MARC Data into DSpace,” 2006, http://hdl.handle.net/2440/14784 (accessed Dec. 21, 2009). 7. Sarah Kim, Lorraine A. Dong, and Megan Durden, “Auto- mated Batch Archival Processing: Preserving Arnold Wesker’s Digital Manuscripts,” Archival Issues 30, no. 2 (2006): 91–106. 8. Elspeth Healey, Samantha Mueller, and Sarah Ticer, “The Paul N. Banks Papers: Archiving the Electronic Records of a Digitally-Adventurous Conservator,” 2009, https://pacer .ischool.utexas.edu/bitstream/2081/20150/1/Paul_Banks_ Final_Report.pdf (accessed Dec. 21, 2009); Lisa Schmidt, “Pres- ervation of a Born Digital Literary Genre: Archiving Legacy Macintosh Hypertext Files in DSpace,” 2007, https://pacer .ischool.utexas.edu/bitstream/2081/9007/1/MJ%20WBO%20 Capstone%20Report.pdf (accessed Dec. 21, 2009). 9. Rachel E. Proudfoot et al., “JISC Final Report: IncReASe (Increasing Repository Content through Automation and Ser- vices),” 2009, http://eprints.whiterose.ac.uk/9160/ (accessed Dec. 21, 2009). 10. Michael Witt and Mark P. Newton, “Preparing Batch Deposits for Digital Commons Repositories,” 2008, http://docs .lib.purdue.edu/lib_research/96/ (accessed Dec. 21, 2009). 11. Lesley Drysdale, “Importing Records from Reference Man- ager into GNU EPrints,” 2004, http://hdl.handle.net/1905/175 (accessed Dec. 21, 2009). 12. R. John Robertson, “Evaluation of Metadata Workflows for the Glasgow ePrints and DSpace Services,” 2006, http://hdl .handle.net/1905/615 (accessed Dec. 21, 2009); William J. Nixon and Morag Greig, “Populating the Glasgow ePrints Service: A Mediated Model and Workflow,” 2005, http://hdl.handle .net/1905/387 (accessed Dec. 21, 2009). 13. Tim Ribaric, “Automatic Preparation of ETD Material from the Internet Archive for the DSpace Repository Platform,” Code4Lib Journal no. 8 (Nov. 23, 2009), http://journal.code4lib.org/ articles/2152 (accessed Dec. 21, 2009). 14. Randall Floyd, “Automated Electronic Thesis and Disser- tations Ingest,” (Mar. 30, 2009), http://wiki.dlib.indiana.edu/ confluence/x/01Y (accessed Dec. 21, 2009). 15. Shawn Averkamp and Joanna Lee, “Repurposing Pro- batch loading workflows are dependent upon the quality of data and metadata loaded. Along with testing scripts and checking imported metadata by first batch loading to a development or staging environment, quality control of the supplied metadata is an integral step. The flexibility of Perl allowed testing and revising to accommodate prob- lems encountered with how the metadata was supplied for the heterogeneous collections batch loaded into the Knowledge Bank. However, toward the goal of standard- izing batch loading workflows, the staff working with the Knowledge Bank iteratively refined not only the scripts but also the metadata requirements for each project and how those were communicated to the data suppliers with mappings, explicit metadata examples, and sample desired results. The efficiency of batch loading workflows is greatly enhanced by consistent data and basic stan- dards for how metadata is supplied. Batch loading is not only an extremely efficient means of populating an institutional repository, it is also a value- added service that can increase buy-in from the wider campus community. It is hoped that by openly sharing examples of our batch loading scripts we are contributing to the development of an open library of code that can be borrowed and adapted by the library community toward future institutional repository success stories. ■■ Acknowledgments I would like to thank Conrad Gratz, of OSU OIT, and Andrew Wang, formerly of OSUL. Gratz wrote the shell scripts and the majority of the Perl scripts used for auto- mating the Knowledge Bank item import process and ran the corresponding batch loads. The early Perl scripts used for batch loading into the Knowledge Bank, including the first phase of OJS and MSS, were written by Wang. Parts of those early Perl scripts written by Wang were borrowed for subsequent scripts written by Gratz. Gratz provided the annotated scripts appearing in the appendixes and consulted with the author regarding the description of the scripts. I would also like to thank Amanda J. Wilson, a for- mer metadata librarian for OSUL, who was instrumental to the success of many of the batch loading workflows created for the Knowledge Bank. References and Notes 1. The Ohio State University Knowledge Bank, “Institu- tional Repository Policies,” 2007, http://library.osu.edu/sites/ kbinfo/policies.html (accessed Dec. 21, 2009). The Knowledge Bank homepage can be found at https://kb.osu.edu/dspace/ (accessed Dec. 21, 2009). 2. Margret Branschofsky et al., “Evolving Meta- data Needs for an Institutional Repository: MIT’s DSpace,” 124 inFoRMation technoLogY and LiBRaRies | septeMBeR 2010 Appendix E. MSS 2009 Batch Loading Scripts -- mkxml2009.pl -- #!/usr/bin/perl use Encode; # Routines for UTF encoding use Text::xSV; # Routines to process CSV files. use File::Basename; # Open and read the comma separated metadata file. my $csv = new Text::xSV; #$csv->set_sep(' '); # Use for tab separated files. $csv->open_file("MSS2009.csv"); $csv->read_header(); # Process the CSV column headers. # Constants for file and directory names. $basedir = "/common/batch/input/mss/"; $indir = "$basedir/2009"; $xmldir= "./2009xml"; $imagesubdir= "processed_images”; $filename = "dublin_core.xml"; # Process each line of metadata, one line per item. $linenum = 1; while ($csv->get_row()) { # This divides the item's metadata into fields, each in its own variable. my ( $identifier, $title, $creators, $description_abstract, $issuedate, $description, $description2, Appendixes A–D available at http://ital-ica.blogspot.com/ Quest Metadata for Batch Ingesting ETDs into an Institutional Repository,” Code4Lib Journal no. 7 (June 26, 2009), http://journal .code4lib.org/articles/1647 (accessed Dec. 21, 2009). 16. Tim Brody, Registry of Open Access Repositories (ROAR), http://roar.eprints.org/ (accessed Dec. 21, 2009). 17. DuraSpace, DSpace, http://www.dspace.org/ (accessed Dec. 21, 2009). 18. Dublin Core Metadata Initiative Libraries Working Group, “DC-Library Application Profile (DC-Lib),” http://dublincore .org/documents/2004/09/10/library-application-profile/ (accessed Dec. 21, 2009). 19. The Ohio State University Knowledge Bank Policy Com- mittee, “OSU Knowledge Bank Metadata Application Profile,” http://library.osu.edu/sites/techservices/KBAppProfile.php (accessed Dec. 21, 2009). 20. Ohio Journal of Science (Ohio Academy of Sci- ence), Knowledge Bank community, http://hdl.handle .net/1811/686 (accessed Dec. 21, 2009); OSU International Sym- posium on Molecular Spectroscopy, Knowledge Bank commu- nity, http://hdl.handle.net/1811/5850 (accessed Dec. 21, 2009). 21. Ohio Journal of Science (Ohio Academy of Science), Ohio Journal of Science: Volume 74, Issue 3 (May, 1974), Knowledge Bank collection, http://hdl.handle.net/1811/22017 (accessed Dec. 21, 2009). Batch Loading coLLections into dspace | WaLsh 125 $abstract, $gif, $ppt, ) = $csv->extract( "Talk_id", "Title", "Creators", "Abstract", "IssueDate", "Description", "AuthorInstitution", "Image_file_name", "Talk_gifs_file", "Talk_ppt_file" ); $creatorxml = ""; # Multiple creators are separated by ';' in the metadata. if (length($creators) > 0) { # Create XML for each creator. @creatorlist = split(/;/,$creators); foreach $creator (@creatorlist) { if (length($creator) > 0) { $creatorxml .= '' .$creator.’’.”\n “; } } } # Done processing creators for this item. # Create the XML string for the Abstract. $abstractxml = ""; if (length($description_abstract) > 0) { # Convert special metadata characters for use in xml/html. $description_abstract =~ s/\&/&/g; $description_abstract =~ s/\>/>/g; $description_abstract =~ s/\' .$description_abstract.''; } # Create the XML string for the Description. $descriptionxml = ""; if (length($description) > 0) { # Convert special metadata characters for use in xml/html. $description=~ s/\&/&/g; $description=~ s/\>/>/g; $description=~ s/\' .$description.''; } Appendix E. MSS 2009 Batch Loading Scripts (cont.) 126 inFoRMation technoLogY and LiBRaRies | septeMBeR 2010 # Create the XML string for the Author Institution. $description2xml = ""; if (length($description2) > 0) { # Convert special metadata characters for use in xml/html. $description2=~ s/\&/&/g; $description2=~ s/\>/>/g; $description2=~ s/\' .'Author Institution: '.$description2.''; } # Convert special characters in title. $title=~ s/\&/&/g; $title=~ s/\>/>/g; $title=~ s/\:encoding(UTF-8)", "$basedir/$subdir/$filename"); print fh <<"XML"; $identifier $title $issuedate $abstractxml $descriptionxml $description2xml Article en $creatorxml XML close($fh); # Create contents file and move files to the load set. # Copy item files into the load set. if (defined($abstract) && length($abstract) > 0) { system "cp $indir/$abstract $basedir/$subdir"; } $sourcedir = substr($abstract, 0, 5); if (defined($ppt) && length($ppt) > 0 ) { system "cp $indir/$sourcedir/$sourcedir/*.* $basedir/$subdir/"; } if (defined($gif) && length($gif) > 0 ) { system "cp $indir/$sourcedir/$imagesubdir/*.* $basedir/$subdir/"; } # Make the 'contents' file and fill it with the file names. Appendix E. MSS 2009 Batch Loading Scripts (cont.) Batch Loading coLLections into dspace | WaLsh 127 system "touch $basedir/$subdir/contents"; if (defined($gif) && length($gif) > 0 && -d "$indir/$sourcedir/$imagesubdir" ) { # Sort items in reverse order so they show up right in DSpace. # This is a hack that depends on how the DB returns items # in unsorted (physical) order. There are better ways to do this. system "cd $indir/$sourcedir/$imagesubdir/;" . " ls *[0-9][0-9].* | sort -r >> $basedir/$subdir/contents"; system "cd $indir/$sourcedir/$imagesubdir/;" . " ls *[a-zA-Z][0-9].* | sort -r >> $basedir/$subdir/contents"; } if (defined($ppt) && length($ppt) > 0 && -d "$indir/$sourcedir/$sourcedir" ) { system "cd $indir/$sourcedir/$sourcedir/;" . " ls *.* >> $basedir/$subdir/contents"; } # Put the Abstract in last, so it displays first. system "cd $basedir/$subdir; basename $abstract >>" . " $basedir/$subdir/contents"; $linenum++; } # Done processing an item. -------------------------------------------------------------------------------------------------- -- import.sh –- #!/bin/sh # # Import a collection from files generated on dspace # COLLECTION_ID=1811/6635 EPERSON=[name removed]@osu.edu SOURCE_DIR=./2009xml BASE_ID=`basename $COLLECTION_ID` MAPFILE=./map-dspace03-mss2009.$BASE_ID /dspace/bin/dsrun org.dspace.app.itemimport.ItemImport --add --eperson=$EPERSON --collection=$COLLECTION_ID --source=$SOURCE_DIR --mapfile=$MAPFILE Appendix E. MSS 2009 Batch Loading Scripts (cont.)