The Code4Lib Journal

The Code4Lib Journal
	Editorial: For Pandemic Times Such as This
A pandemic changes the world and changes libraries.
	Open Source Tools for Scaling Data Curation at QDR
This paper describes the development of services and tools for scaling data curation services at the Qualitative Data Repository (QDR). Through a set of open-source tools, semi-automated workflows, and extensions to the Dataverse platform, our team has built services for curators to efficiently and effectively publish collections of qualitatively derived data. The contributions we seek to make in this paper are as follows: 

1. We describe ‘human-in-the-loop’ curation and the tools that facilitate this model at QDR;

2. We provide an in-depth discussion of the design and implementation of these tools, including applications specific to the Dataverse software repository, as well as standalone archiving tools written in R; and

3. We highlight the role of providing a service layer for data discovery and accessibility of qualitative data.

Keywords: Data curation; open-source; qualitative data
	From Text to Map: Combing Named Entity Recognition and Geographic Information Systems
This tutorial shows readers how to leverage the power of named entity recognition (NER) and geographic information systems (GIS) to extract place names from text, geocode them, and create a public-facing map. This process is highly useful across disciplines. For example, it can be used to generate maps from historical primary sources, works of literature set in the real world, and corpora of academic scholarship. In order to lead the reader through this process, the authors work with a 500 article sample of the COVID-19 Open Research Dataset Challenge (CORD-19) dataset. As of the date of writing, CORD-19 includes 45,000 full-text articles with metadata. Using this sample, the authors demonstrate how to extract locations from the full-text with the spaCy library in Python, highlight methods to clean up the extracted data with the Pandas library, and finally teach the reader how to create an interactive map of the places using ArcGIS Online. The processes and code are described in a manner that is reusable for any corpus of text
	Using Integrated Library Systems and Open Data to Analyze Library Cardholders
The Harrison Public Library in Westchester County, New York operates two library buildings in Harrison: The Richard E. Halperin Memorial Library Building (the library’s main building, located in downtown Harrison) and a West Harrison branch location. As part of its latest three-year strategic plan, the library sought to use existing resources to improve understanding of its cardholders at both locations. 

To do so, we needed to link the circulation data in our integrated library system, Evergreen, to geographic data and demographic data. We decided to build a geodemographic heatmap that incorporated all three aforementioned types of data. Using Evergreen, American Community Survey (ACS) data, and Google Maps, we plotted each cardholder’s residence on a map, added census boundaries (called tracts) and our town’s borders to the map, and produced summary statistics for each tract detailing its demographics and the library card usage of its residents. In this article, we describe how we acquired the necessary data and built the heatmap. We also touch on how we safeguarded the data while building the heatmap, which is an internal tool available only to select authorized staff members. Finally, we discuss what we learned from the heatmap and how libraries can use open data to benefit their communities.
	Update OCLC Holdings Without Paying Additional Fees: A Patchwork Approach
Accurate OCLC holdings are vital for interlibrary loan transactions. However, over time weeding projects, replacing lost or damaged materials, and human error can leave a library with a catalog that is no longer reflected through OCLC. While OCLC offers reclamation services to bring poorly maintained collections up-to-date, the associated fee may be cost prohibitive for libraries with limited budgets. This article will describe the process used at Austin Peay State University to identify, isolate, and update holdings using OCLC Collection Manager queries, MarcEdit, Excel, and Python. Some portions of this process are completed using basic coding; however, troubleshooting techniques will be included for those with limited previous experience.
	Data reuse in linked data projects: a comparison of Alma and Share-VDE BIBFRAME networks
This article presents an analysis of the enrichment, transformation, and clustering used by vendors Casalini Libri/@CULT and Ex Libris for their respective conversions of MARC data to BIBFRAME. The analysis considers the source MARC21 data used by Alma then the enrichment and transformation of MARC21 data from Share-VDE partner libraries. The clustering of linked data into a BIBFRAME network is a key outcome of data reuse in linked data projects and fundamental to the improvement of the discovery of library collections on the web and within search systems.
	CollectionBuilder-CONTENTdm: Developing a Static Web ‘Skin’ for CONTENTdm-based Digital Collections
Unsatisfied with customization options for CONTENTdm, librarians at University of Idaho Library have been using a modern static web approach to creating digital exhibit websites that sit in front of the digital repository. This "skin" is designed to provide users with new pathways to discover and explore collection content and context. This article describes the concepts behind the approach and how it has developed into an open source, data-driven tool called CollectionBuilider-CONTENTdm. The authors outline the design decisions and principles guiding the development of CollectionBuilder, and detail how a version is used at the University of Idaho Library to collaboratively build digital collections and digital scholarship projects.
	Automated Collections Workflows in GOBI: Using Python to Scrape for Purchase Options
The NC State University Libraries has developed a tool for querying GOBI, our print and ebook ordering vendor platform, to automate monthly collections reports. These reports detail purchase options for missing or long-overdue items, as well as popular items with multiple holds. GOBI does not offer an API, forcing staff to conduct manual title-by-title searches that previously took up to 15 hours per month. To make this process more efficient, we wrote a Python script that automates title searches and the extraction of key data (price, date of publication, binding type) from GOBI. This tool can gather data for hundreds of titles in half an hour or less, freeing up time for other projects.

This article will describe the process of creating this script, as well as how it finds and selects data in GOBI. It will also discuss how these results are paired with NC State’s holdings data to create reports for collection managers. Lastly, the article will examine obstacles that were experienced in the creation of the tool and offer recommendations for other organizations seeking to automate collections workflows.
	Testing remote access to e-resource with CodeceptJS
At the Badische Landesbibliothek Karlsruhe (BLB) we offer a variety of e-resources with different access requirements. On the one hand, there is free access to open access material, no matter where you are. On the other hand, there are e-resources that you can only access when you are in the rooms of the BLB. We also offer e-resources that you can access from anywhere, but you must have a library account for authentication to gain access. To test the functionality of these access methods, we have created a project to automatically test the entire process from searching our catalogue, selecting a hit, logging in to the provider's site and checking the results. For this we use the End 2 End Testing Framework CodeceptJS.
	Editorial
An abundance of information sharing.
	Leveraging Google Drive for Digital Library Object Storage
This article will describe a process at the University of Kentucky Libraries for utilizing an unlimited Google Drive for Education account for digital library object storage.  For a number of recent digital library projects, we have used Google Drive for both archival file storage and web derivative file storage.  As a part of the process, a Google Drive API script is deployed in order to automate the gathering of of Google Drive object identifiers.  Also, a custom Omeka plugin was developed to allow for referencing web deliverable files within a web publishing platform via object linking and embedding.

For a number of new digital library projects, we have moved toward a small VM approach to digital library management where the VM serves as a web front end but not a storage node.  This has necessitated alternative approaches to storing web addressable digital library objects.  One option is the use of Google Drive for storing digital objects.  An overview of our approach is included in this article as well as links to open source code we adopted and more open source code we produced.
	Building a Library Search Infrastructure with Elasticsearch
This article discusses our implementation of an Elastic cluster to address our search, search administration and indexing needs, how it integrates in our technology infrastructure, and finally takes a close look at the way that we built a reusable, dynamic search engine that powers our digital repository search. We cover the lessons learned with our early implementations and how to address them to lay the groundwork for a scalable, networked search environment that can also be applied to alternative search engines such as Solr.
	How to Use an API Management platform to Easily Build Local Web Apps
Setting up an API management platform like DreamFactory can open up a lot of possibilities for potential projects within your library. With an automatically generated restful API, the University Libraries at Virginia Tech have been able to create applications for gathering walk-in data and reference questions, public polling apps, feedback systems for service points, data dashboards and more. This article will describe what an API management platform is, why you might want one, and the types of potential projects that can quickly be put together by your local web developer.
	Git and GitLab in Library Website Change Management Workflows
Library websites can benefit from a separate development environment and a robust change management workflow, especially when there are multiple authors.  This article details how the Oakland University William Beaumont School of Medicine Library use Git and GitLab in a change management workflow with a serverless development environment for their website development team.  Git tracks changes to the code, allowing changes to be made and tested in a separate branch before being merged back into the website. GitLab adds features such as issue tracking and discussion threads to Git to facilitate communication and planning. Adoption of these tools and this workflow have dramatically improved the organization and efficiency of the OUWB Medical Library web development team, and it is the hope of the authors that by sharing our experience with them others may benefit as well.
	Experimenting with a Machine Generated Annotations Pipeline
The UCLA Library reorganized its software developers into focused subteams with one, the Labs Team, dedicated to conducting experiments. In this article we describe our first attempt at conducting a software development experiment, in which we attempted to improve our digital library’s search results with metadata from cloud-based image tagging services. We explore the findings and discuss the lessons learned from our first attempt at running an experiment.
	Leveraging the RBMS/BSC Latin Place Names File with Python
To answer the relatively straight-forward question “Which rare materials in my library catalog were published in Venice?” requires an advanced knowledge of geography, language, orthography, alphabet graphical changes, cataloging standards, transcription practices, and data analysis. The imprint statements of rare materials transcribe place names more faithfully as it appears on the piece itself, such as Venetus, or Venetiae, rather than a recognizable and contemporary form of place name, such as Venice, Italy. Rare materials catalogers recognize this geographic discoverability and selection issue and solve it with a standardized solution. To add consistency and normalization to imprint locations, rare materials catalogers utilize hierarchical place names to create a special imprint index. However, this normalized and contemporary form of place name is often missing from legacy bibliographic records. This article demonstrates using a traditional rare materials cataloging aid, the RBMS/BSC Latin Place Names File, with programming tools, Jupyter Notebook and Python, to retrospectively populate a special imprint index for 17th-century rare materials. This methodology enriched 1,487 MAchine Readable Cataloging (MARC) bibliographic records with hierarchical place names (MARC 752 fields) as part of a small pilot project. This article details a partially automated solution to this geographic discoverability and selection issue; however, a human component is still ultimately required to fully optimize the bibliographic data.
	Tweeting Tennessee’s Collections: A Case Study of a Digital Collections Twitterbot Implementation
This article demonstrates how a Twitterbot can be used as an inclusive outreach initiative that breaks down the barriers between the web and the reading room to share materials with the public. These resources include postcards, music manuscripts, photographs, cartoons and any other digitized materials. Once in place, Twitterbots allow physical materials to converge with the technical and social space of the Web. Twitterbots are ideal for busy professionals because they allow librarians to make meaningful impressions on users without requiring a large time investment. This article covers the recent implementation of a digital collections bot (@UTKDigCollBot) at the University of Tennessee, Knoxville (UTK), and provides documentation and advice on how you might develop a bot to highlight materials at your own institution.
	Building Strong User Experiences in LibGuides with Bootstrapr and Reviewr
With nearly fifty subject librarians creating LibGuides, the LibGuides Management Team at Notre Dame needed a way to both empower guide authors to take advantage of the powerful functionality afforded by the Bootstrap framework native to LibGuides, and to ensure new and extant library guides conformed to brand/identity standards and the best practices of user experience (UX) design. To accomplish this, we developed an online handbook to teach processes and enforce styles; a web app to create Twitter Bootstrap components for use in guides (Bootstrapr); and a web app to radically speed the review and remediation of guides, as well as better communicate our changes to guide authors (Reviewr). This article describes our use of these three applications to balance empowering guide authors against usefully constraining them to organizational standards for user experience. We offer all of these tools as FOSS under an MIT license so that others may freely adapt them for use in their own organization.
	IIIF by the Numbers
The UCLA Library began work on building a suite of services to support IIIF for their digital collections. The services perform image transformations and delivery as well as manifest generation and delivery. The team was unsure about whether they should use local or cloud-based infrastructure for these services, so they conducted some experiments on multiple infrastructure configurations and tested them in scenarios with varying dimensions.
	Trust, But Verify: Auditing Vendor-Supplied Accessibility Claims
Despite a long-overdue push to improve the accessibility of our libraries’ online presences, much of what we offer to our patrons comes from third party vendors: discovery layers, OPACs, subscription databases, and so on. We can’t directly affect the accessibility of the content on these platforms, but rely on vendors to design and test their systems and report on their accessibility through Voluntary Product Accessibility Templates (VPATS). But VPATs are self-reported. What if we want to verify our vendors’ claims? We can’t thoroughly test the accessibility of hundreds of vendor systems, can we? In this paper, we propose a simple methodology for spot-checking VPATs. Since most websites struggle with the same accessibility issues, spot checking particular success criteria in a library vendor VPAT can tip us off to whether the VPAT as a whole can be trusted. Our methodology combines automated and manual checking, and can be done without any expensive software or complex training. What’s more, we are creating a repository to share VPAT audit results with others, so that we needn’t all audit the VPATs of all our systems.
	Editorial
on diversity and mentoring
	Scraping BePress: Downloading Dissertations for Preservation
This article will describe our process developing a script to automate downloading of documents and secondary materials from our library's BePress repository. Our objective was to collect the full archive of dissertations and associated files from our repository into a local disk for potential future applications and to build out a preservation system. 

Unlike at some institutions, our students submit directly into BePress, so we did not have a separate repository of the files; and the backup of BePress content that we had access to was not in an ideal format (for example, it included "withdrawn" items and did not effectively isolate electronic theses and dissertations). Perhaps more importantly, the fact that BePress was not SWORD-enabled and lacked a robust API or batch export option meant that we needed to develop a data-scraping approach that would allow us to both extract files and have metadata fields populated. Using a CSV of all of our records provided by BePress, we wrote a script to loop through those records and download their documents, placing them in directories according to a local schema. We dealt with over 3,000 records and about three times that many items, and now have an established process for retrieving our files from BePress. Details of our experience and code are included.
	Persistent identifiers for heritage objects
Persistent identifiers (PID’s) are essential for getting access and referring to library, archive and museum (LAM) collection objects in a sustainable and unambiguous way, both internally and externally. Heritage institutions need a universal policy for the use of PID’s in order to have an efficient digital infrastructure at their disposal and to achieve optimal interoperability, leading to open data, open collections and efficient resource management.

Here the discussion is limited to PID’s that institutions can assign to objects they own or administer themselves. PID’s for people, subjects etc. can be used by heritage institutions, but are generally managed by other parties.

The first part of this article consists of a general theoretical description of persistent identifiers. First of all, I discuss the questions of what persistent identifiers are and what they are not, and what is needed to administer and use them. The most commonly used existing PID systems are briefly characterized. Then I discuss the types of objects PID’s can be assigned to. This section concludes with an overview of the requirements that apply if PIDs should also be used for linked data.

The second part examines current infrastructural practices, and existing PID systems and their advantages and shortcomings. Based on these practical issues and the pros and cons of existing PID systems a list of requirements for PID systems is presented which is used to address a number of practical considerations. This section concludes with a number of recommendations.
	Dimensions & VOSViewer Bibliometrics in the Reference Interview
The VOSviewer software provides easy access to bibliometric mapping using data from Dimensions, Scopus and Web of Science. The properly formatted and structured citation data, and the ease in which it can be exported open up new avenues for use during citation searches and reference interviews. This paper details specific techniques for using advanced searches in Dimensions, exporting the citation data, and drawing insights from the maps produced in VOS Viewer. These search techniques and data export practices are fast and accurate enough to build into reference interviews for graduate students, faculty, and post-PhD researchers. The search results derived from them are accurate and allow a more comprehensive view of citation networks embedded in ordinary complex boolean searches.
	Automating Authority Control Processes
Authority control is an important part of cataloging since it helps provide consistent access to names, titles, subjects, and genre/forms. There are a variety of methods for providing authority control, ranging from manual, time-consuming processes to automated processes. However, the automated processes often seem out of reach for small libraries when it comes to using a pricey vendor or expert cataloger. This paper introduces ideas on how to handle authority control using a variety of tools, both paid and free. The author describes how their library handles authority control; compares vendors and programs that can be used to provide varying levels of authority control; and demonstrates authority control using MarcEdit.
	Managing Electronic Resources Without Buying into the Library Vendor Singularity
Over the past decade, the library automation market has faced continuing consolidation. Many vendors in this space have pushed towards monolithic and expensive Library Services Platforms. Other vendors have taken "walled garden" approaches which force vendor lock-in due to lack of interoperability. For these reasons and others, many libraries have turned to open-source Integrated Library Systems (ILSes) such as Koha and Evergreen. These systems offer more flexibility and interoperability options, but tend to be developed with a focus on public libraries and legacy print resource functionality. They lack tools important to academic libraries such as knowledge bases, link resolvers, and electronic resource management systems (ERMs). Several open-source ERM options exist, including CORAL and FOLIO. This article analyzes the current state of these and other options for libraries considering supplementing their open-source ILS either alone, hosted or in a consortial environment.
	Shiny Fabric: A Lightweight, Open-source Tool for Visualizing and Reporting Library Relationships
This article details the development and functionalities of an open-source application called Fabric. Fabric is a simple to use application that renders library data in the form of network graphs (sociograms). Fabric is built in R using the Shiny package and is meant to offer an easy-to-use alternative to other software, such as Gephi and UCInet. In addition to being user friendly, Fabric can run locally as well as on a hosted server. This article discusses the development process and functionality of Fabric, use cases at the New College of Florida's Jane Bancroft Cook Library, as well as plans for future development.
	Analyzing and Normalizing Type Metadata for a Large Aggregated Digital Library
The Illinois Digital Heritage Hub (IDHH) gathers and enhances metadata from contributing institutions around the state of Illinois and provides this metadata to the Digital Public Library of America (DPLA) for greater access. The IDHH helps contributors shape their metadata to the standards recommended and required by the DPLA in part by analyzing and enhancing aggregated metadata. In late 2018, the IDHH undertook a project to address a particularly problematic field, Type metadata. This paper walks through the project, detailing the process of gathering and analyzing metadata using the DPLA API and OpenRefine, data remediation through XSL transformations in conjunction with local improvements by contributing institutions, and the DPLA ingestion system’s quality controls.
	Scaling IIIF Image Tiling in the Cloud
The International Archive of Women in Architecture, established at Virginia Tech in 1985, collects books, biographical information, and published materials from nearly 40 countries that are divided into around 450 collections. In order to provide public access to these collections, we built an application using the IIIF APIs to pre-generate image tiles and manifests which are statically served in the AWS cloud. We established an automatic image processing pipeline using a suite of AWS services to implement microservices in Lambda and Docker. By doing so, we reduced the processing time for terabytes of images from weeks to days.

In this article, we describe our serverless architecture design and implementations, elaborate the technical solution on integrating multiple AWS services with other techniques into the application, and describe our streamlined and scalable approach to handle extremely large image datasets. Finally, we show the significantly improved performance compared to traditional processing architectures along with a cost evaluation.
	Where Do We Go From Here: A Review of Technology Solutions  for Providing Access to Digital Collections
The University of Toronto Libraries is currently reviewing technology to support its Collections U of T service. Collections U of T provides search and browse access to 375 digital collections (and over 203,000 digital objects) at the University of Toronto Libraries. Digital objects typically include special collections material from the university as well as faculty digital collections, all with unique metadata requirements. The service is currently supported by IIIF-enabled Islandora, with one Fedora back end and multiple Drupal sites per parent collection (see attached image). Like many institutions making use of Islandora, UTL is now confronted with Drupal 7 end of life and has begun to investigate a migration path forward. This article will summarise the Collections U of T functional requirements and lessons learned from our current technology stack. It will go on to outline our research to date for alternate solutions. The article will review both emerging micro-service solutions, as well as out-of-the-box platforms, to provide an overview of the digital collection technology landscape in 2019. Note that our research is focused on reviewing technology solutions for providing access to digital collections, as preservation services are offered through other services at the University of Toronto Libraries.