The Code4Lib Journal

The Code4Lib Journal
	Editorial: Closer to 100 than to 1
With the publication of Issue 51, the Code4Lib Journal is now closer to Issue 100 than we are to Issue 1.  Also, we are developing a name change policy.
	Adaptive Digital Library Services: Emergency Access Digitization at the University of Illinois at Urbana-Champaign During the COVID-19 Pandemic
This paper describes how the University of Illinois at Urbana-Champaign Library provided access to circulating library materials during the 2020 COVID-19 pandemic. Specifically, it details how the library adapted existing staff roles and digital library infrastructure to offer on-demand digitization of and limited online access to library collection items requested by patrons working in a remote teaching and learning environment. The paper also provides an overview of the technology used, details how dedicated staff with strong local control of technology were able to scale up a university-wide solution, reflects on lessons learned, and analyzes nine months of usage data to shed light on library patrons’ changing needs during the pandemic.
	Assessing High-volume Transfers from Optical Media at NYPL
NYPL’s workflow for transferring optical media to long-term storage was met with a challenge: an acquisition of a collection containing thousands of recordable CDs and DVDs. Many programs take a disk-by-disk approach to imaging or transferring optical media, but to deal with a collection of this size, NYPL developed a workflow using a Nimbie AutoLoader and a customized version of KBNL’s open-source IROMLAB software to batch disks for transfer. This workflow prioritized quantity, but, at the outset, it was difficult to tell if every transfer was as accurate as it could be. We discuss the process of evaluating the success of the mass transfer workflow, and the improvements we made to identify and troubleshoot errors that could occur during the transfer. A background of the institution and other institutions’ approaches to similar projects is given, then an in-depth discussion of the process of gathering and analyzing data. We finish with a discussion of our takeaways from the project.
	Better Together: Improving the Lives of Metadata Creators with Natural Language Processing
DC Public Library has long held digital copies of the full run of local alternative weekly, Washington City Paper, but had no official status as a rights grantor to enable use. That recently changed due to a full agreement being reached with the publisher. One condition of that agreement, however, was that issues become available with usable descriptive metadata and subject access in time to celebrate the upcoming 40th anniversary of the publication, which at that time was in six months.
 
One of the most time intensive tasks our metadata specialists work on is assigning description to digital objects. This paper details how we applied Python’s Natural Language Toolkit and OpenRefine’s reconciliation functions to the collection’s OCR text to simplify subject selection for staff with no background in programming.
	Choose Your Own Educational Resource: Developing an Interactive OER Using the Ink Scripting Language
Learning games are games created with the purpose of educating, as well as entertaining, players. This article describes the potential of interactive fiction (IF), a type of text-based game, to serve as learning games. After summarizing the basic concepts of interactive fiction and learning games, the article describes common interactive fiction programming languages and tools, including Ink, a simple markup language that can be used to create choice based text games that play in a web browser. The final section of the article includes code putting the concepts of Ink, interactive fiction, and learning games into action using part of an interactive OER created by the author in December of 2020.
	Enhancing Print Journal Analysis for Shared Print Collections
The Western Regional Storage Trust (WEST), is a distributed shared print journal repository program serving research libraries, college and university libraries, and library consortia in the Western Region of the United States. WEST solicits serial bibliographic records and related holdings biennially, which are evaluated and identified as candidates for shared print archiving using a complex collection analysis process. California Digital Library’s Discovery &#038; Delivery WEST operations team (WEST-Ops) supports the functionality behind this collection analysis process used by WEST program staff (WEST-Staff) and members.

For WEST, proposals for shared print archiving have been historically predicated on what is known as an Ulrich’s journal family, which pulls together related serial titles, for example, succeeding and preceding serial titles, their supplements, and foreign language parallel titles. Ulrich’s, while it has been invaluable, proves problematic in several ways, resulting in the approximate omission of half of the journal titles submitted for collection analysis.

Part of WEST’s effectiveness in archiving hinges upon its ability to analyze local serials data across its membership as holistically as possible. The process that enables this analysis, and subsequent archiving proposals, is dependent on Ulrich’s journal family, for which ISSN has been traditionally used to match and cluster all related titles within a particular family. As such, the process is limited in that many journals have never been assigned ISSNs, especially older publications, or member bibliographic records may lack an ISSN(s), though the ISSN may exist in an OCLC primary record.

Building a mechanism for matching on ISSNs that goes beyond the base set of primary, former, and succeeding titles, expands the number of eligible ISSNs that facilitate Ulrich’s journal family matching. Furthermore, when no matches in Ulrich’s can be made based on ISSN, other types of control numbers within a bibliographic record may be used to match with records that have been previously matched with an Ulrich’s journal family via ISSN, resulting in a significant increase in the number of titles eligible for collection analysis.

This paper will discuss problems in Ulrich’s journal family matching, improved functional methodologies developed to address those problems, and potential strategies to improve in serial title clustering in the future.
	How We Built a Spatial Subject Classification Based on Wikidata
From the fall of 2017 to the beginning of 2020 a project had been carried out to upgrade spatial subject indexing in North Rhine-Westphalian Bibliography (NWBib) from uncontrolled strings to controlled values. For this purpose, a spatial classification with around 4,500 entries was created from Wikidata and published as SKOS (Simple Knowledge Organization System) vocabulary. The article gives an overview over the initial problem and outlines the different implementation steps.
	Institutional Data Repository Development, a Moving Target
At the end of 2019, the Research Data Service (RDS) at the University of Illinois at Urbana-Champaign (UIUC) completed its fifth year as a campus-wide service. In order to gauge the effectiveness of the RDS in meeting the needs of Illinois researchers, RDS staff developed a five-year review consisting of a survey and a series of in-depth focus group interviews. As a result, our institutional data repository developed in-house by University Library IT staff, Illinois Data Bank, was recognized as the most useful service offering by our unit. When launched in 2016, storage resources and web servers for Illinois Data Bank and supporting systems were hosted on-premises at UIUC. As anticipated, researchers increasingly need to share large, and complex datasets. In a responsive effort to leverage the potentially more reliable, highly available, cost-effective, and scalable storage accessible to computation resources, we migrated our item bitstreams and web services to the cloud. Our efforts have met with success, but also with painful bumps along the way. This article describes how we supported data curation workflows through transitioning from on-premises to cloud resource hosting. It details our approaches to ingesting, curating, and offering access to dataset files up to 2TB in size--which may be archive type files (e.g., .zip or .tar) containing complex directory structures.
	On the Nature of Extreme Close-Range Photogrammetry: Visualization and Measurement of North African Stone Points
Image acquisition, visualization, and measurement are examined in the context of extreme close-range photogrammetric data analysis. Manual measurements commonly used in traditional stone artifact investigation are used as a starting point to better gauge the usefulness of high-resolution 3D surrogates and the flexible digital tool sets that can work with them. The potential of various visualization techniques are also explored in the context of future teaching, learning, and research in virtual environments.
	Optimizing Elasticsearch Search Experience Using a Thesaurus
The Belgian Art Links and Tools (BALaT) (http://balat.kikirpa.be/) is the continuously expanding online documentary platform of the Royal Institute for Cultural Heritage (KIK-IRPA), Brussels (Belgium). BALaT contains over 750,000 images of KIK-IRPA’s unique collection of photo negatives on the cultural heritage of Belgium, but also the library catalogue, PDFs of articles from KIK-IRPA’s Bulletin and other publications, an extensive persons and institutions authority list, and several specialized thematic websites, each of those collections being multilingual as Belgium has three official languages. All these are interlinked to give the user easy access to freely available information on the Belgian cultural heritage. During the last years, KIK-IRPA has been working on a detailed and inclusive data management plan. Through this data management plan, a new project HESCIDA (Heritage Science Data Archive) will upgrade BALaT to BALaT+, enabling access to searchable registries of KIK-IRPA datasets and data interoperability. BALaT+ will be a building block of DIGILAB, one of the future pillars of the European Research Infrastructure for Heritage Science (E-RIHS), which will provide online access to scientific data concerning tangible heritage, following the FAIR-principles (Findable-Accessible-Interoperable-Reusable). It will include and enable access to searchable registries of specialized digital resources (datasets, reference collections, thesauri, ontologies, etc.). In the context of this project, Elasticsearch has been chosen as the technology empowering the search component of BALaT+. An essential feature of this search functionality of BALaT+ is the need for linguistic equivalencies, meaning a term query in French should also return the matching results containing the equivalent term in Dutch. Another important feature is to offer a mechanism to broaden the search with elements of more precise terminology: a term like "furniture" could also match records containing chairs, tables, etc. This article will explain how a thesaurus developed in-house at KIK-IRPA was used to obtain these functionalities, from the processing of that thesaurus to the production of the configuration needed by Elasticsearch.
	Pythagoras: Discovering and Visualizing Musical Relationships Using Computer Analysis
This paper presents an introduction to Pythagoras, an in-progress digital humanities project using Python to parse and analyze XML-encoded music scores. The goal of the project is to use recurring patterns of notes to explore existing relationships among musical works and composers.

An intended outcome of this project is to give music performers, scholars, librarians, and anyone else interested in digital humanities new insights into musical relationships as well as new methods of data analysis in the arts.
	Editorial
Resuming our publication schedule
	Managing an institutional repository workflow with GitLab and a folder-based deposit system
Institutional Repositories (IR) exist in a variety of configurations and in various states of development across the country. Each organization with an IR has a workflow that can range from explicitly documented and codified sets of software and human workflows, to ad hoc assortments of methods for working with faculty to acquire, process and load items into a repository. The University of North Texas (UNT) Libraries has managed an IR called UNT Scholarly Works for the past decade but has until recently relied on ad hoc workflows. Over the past six months, we have worked to improve our processes in a way that is extensible and flexible while also providing a clear workflow for our staff to process submitted and harvested content. Our approach makes use of GitLab and its associated tools to track and communicate priorities for a multi-user team processing resources. We paired this Web-based management with a folder-based system for moving the deposited resources through a sequential set of processes that are necessary to describe, upload, and preserve the resource. This strategy can be used in a number of different applications and can serve as a set of building blocks that can be configured in different ways. This article will discuss which components of GitLab are used together as tools for tracking deposits from faculty as they move through different steps in the workflow. Likewise, the folder-based workflow queue will be presented and described as implemented at UNT, and examples for how we have used it in different situations will be presented.
	Customizing Alma and Primo for Home & Locker Delivery
Like many Ex Libris libraries in Fall 2020, our library at California State University, Northridge (CSUN) was not physically open to the public during the 2020-2021 academic year, but we wanted to continue to support the research and study needs of our over 38,000 university students and 4,000 faculty and staff. This article will explain our Alma and Primo implementation to allow for home mail delivery of physical items, including policy decisions, workflow changes, customization of request forms through labels and delivery skins, customization of Alma letters, a Python solution to add the “home” address type to patron addresses to make it all work, and will include relevant code samples in Python, XSL, CSS, XML, and JSON. In Spring 2021, we will add the on-site locker delivery option in addition to home delivery, and this article will include new system changes made for that option.
	GaNCH: Using Linked Open Data for Georgia’s Natural, Cultural and Historic Organizations’ Disaster Response
In June 2019, the Atlanta University Center Robert W. Woodruff Library received a LYRASIS Catalyst Fund grant to support the creation of a publicly editable directory of Georgia’s Natural, Cultural and Historical Organizations (NCHs), allowing for quick retrieval of location and contact information for disaster response. By the end of the project, over 1,900 entries for NCH organizations in Georgia were compiled, updated, and uploaded to Wikidata, the linked open data database from the Wikimedia Foundation. These entries included directory contact information and GIS coordinates that appear on a map presented on the GaNCH project website (https://ganch.auctr.edu/), allowing emergency responders to quickly search for NCHs by region and county in the event of a disaster. In this article we discuss the design principles, methods, and challenges encountered in building and implementing this tool, including the impact the tool has had on statewide disaster response after implementation.
	Archive This Moment D.C.: A Case Study of Participatory Collecting During COVID-19
When the COVID-19 pandemic brought life in Washington, D.C. to a standstill in March 2020, staff at DC Public Library began looking for ways to document how this historic event was affecting everyday life. Recognizing the value of first-person accounts for historical research, staff launched Archive This Moment D.C. to preserve the story of daily life in the District during the stay-at-home order. Materials were collected from public Instagram and Twitter posts submitted through the hashtag #archivethismomentdc. In addition to social media, creators also submitted materials using an Airtable webform set up for the project and through email. Over 2,000 digital files were collected. 

This article will discuss the planning, professional collaboration, promotion, selection, access, and lessons learned from the project; as well as the technical setup, collection strategies, and metadata requirements. In particular, this article will include a discussion of the evolving collection scope of the project and the need for clear ethical guidelines surrounding privacy when collecting materials in real-time.
	Advancing ARKs in the Historical Ontology Space
This paper presents the application of Archival Resource Keys (ARKs) for persistent identification and resolution of concepts in historical ontologies. Our use case is the 1910 Library of Congress Subject Headings (LCSH), which we have converted to the Simple Knowledge Organization System (SKOS) format and will use for representing a corpus of historical Encyclopedia Britannica articles. We report on the steps taken to assign ARKs in support of the Nineteenth-Century Knowledge Project, where we are using the HIVE vocabulary tool to automatically assign subject metadata from both the 1910 LCSH and the contemporary LCSH faceted, topical vocabulary to enable the study of the evolution of knowledge.
	Considered Content: a Design System for Equity, Accessibility, and Sustainability
The University of Minnesota Libraries developed and applied a principles-based design system to their Health Sciences Library website. With the design system at its center, the revised site was able to achieve accessible, ethical, inclusive, sustainable, responsible, and universal design. The final site was built with elegantly accessible semantic HTML-focused code on Drupal 8 with highly curated and considered content, meeting and exceeding WCAG 2.1 AA guidance and addressing cognitive and learning considerations through the use of plain language, templated pages for consistent page-level organization, and no hidden content. As a result, the site better supports all users regardless of their abilities, attention level, mental status, reading level, and reliability of their internet connection, all of which are especially critical now as an elevated number of people experience crises, anxieties, and depression.
	Robustifying Links To Combat Reference Rot
Links to web resources frequently break, and linked content can change at unpredictable rates. These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information. In this paper, we highlight the significance of reference rot, provide an overview of existing techniques and their characteristics to address it, and introduce our Robust Links approach, including its web service and underlying API. Robustifying links offers a proactive, uniform, and machine-actionable way to combat reference rot. In addition, we discuss our reasoning and approach aimed at keeping the approach functional for the long term. To showcase our approach, we have robustified all links in this article.
	Machine Learning Based Chat Analysis
The BYU library implemented a Machine Learning-based tool to perform various text analysis tasks on transcripts of chat-based interactions between patrons and librarians. These text analysis tasks included estimating patron satisfaction and classifying queries into various categories such as Research/Reference, Directional, Tech/Troubleshooting, Policy/Procedure, and others. An accuracy of 78% or better was achieved for each category. This paper details the implementation details and explores potential applications for the text analysis tool.
	Always Be Migrating
At the University of California, Los Angeles, the Digital Library Program is in the midst of a large, multi-faceted migration project. This article presents a narrative of migration and a new mindset for technology and library staff in their ever-changing infrastructure and systems. This article posits that migration from system to system should be integrated into normal activities so that it is not a singular event or major project, but so that it is a process built into the core activities of a unit.
	Editorial: For Pandemic Times Such as This
A pandemic changes the world and changes libraries.
	Open Source Tools for Scaling Data Curation at QDR
This paper describes the development of services and tools for scaling data curation services at the Qualitative Data Repository (QDR). Through a set of open-source tools, semi-automated workflows, and extensions to the Dataverse platform, our team has built services for curators to efficiently and effectively publish collections of qualitatively derived data. The contributions we seek to make in this paper are as follows: 

1. We describe ‘human-in-the-loop’ curation and the tools that facilitate this model at QDR;

2. We provide an in-depth discussion of the design and implementation of these tools, including applications specific to the Dataverse software repository, as well as standalone archiving tools written in R; and

3. We highlight the role of providing a service layer for data discovery and accessibility of qualitative data.

Keywords: Data curation; open-source; qualitative data
	From Text to Map: Combing Named Entity Recognition and Geographic Information Systems
This tutorial shows readers how to leverage the power of named entity recognition (NER) and geographic information systems (GIS) to extract place names from text, geocode them, and create a public-facing map. This process is highly useful across disciplines. For example, it can be used to generate maps from historical primary sources, works of literature set in the real world, and corpora of academic scholarship. In order to lead the reader through this process, the authors work with a 500 article sample of the COVID-19 Open Research Dataset Challenge (CORD-19) dataset. As of the date of writing, CORD-19 includes 45,000 full-text articles with metadata. Using this sample, the authors demonstrate how to extract locations from the full-text with the spaCy library in Python, highlight methods to clean up the extracted data with the Pandas library, and finally teach the reader how to create an interactive map of the places using ArcGIS Online. The processes and code are described in a manner that is reusable for any corpus of text
	Using Integrated Library Systems and Open Data to Analyze Library Cardholders
The Harrison Public Library in Westchester County, New York operates two library buildings in Harrison: The Richard E. Halperin Memorial Library Building (the library’s main building, located in downtown Harrison) and a West Harrison branch location. As part of its latest three-year strategic plan, the library sought to use existing resources to improve understanding of its cardholders at both locations. 

To do so, we needed to link the circulation data in our integrated library system, Evergreen, to geographic data and demographic data. We decided to build a geodemographic heatmap that incorporated all three aforementioned types of data. Using Evergreen, American Community Survey (ACS) data, and Google Maps, we plotted each cardholder’s residence on a map, added census boundaries (called tracts) and our town’s borders to the map, and produced summary statistics for each tract detailing its demographics and the library card usage of its residents. In this article, we describe how we acquired the necessary data and built the heatmap. We also touch on how we safeguarded the data while building the heatmap, which is an internal tool available only to select authorized staff members. Finally, we discuss what we learned from the heatmap and how libraries can use open data to benefit their communities.
	Update OCLC Holdings Without Paying Additional Fees: A Patchwork Approach
Accurate OCLC holdings are vital for interlibrary loan transactions. However, over time weeding projects, replacing lost or damaged materials, and human error can leave a library with a catalog that is no longer reflected through OCLC. While OCLC offers reclamation services to bring poorly maintained collections up-to-date, the associated fee may be cost prohibitive for libraries with limited budgets. This article will describe the process used at Austin Peay State University to identify, isolate, and update holdings using OCLC Collection Manager queries, MarcEdit, Excel, and Python. Some portions of this process are completed using basic coding; however, troubleshooting techniques will be included for those with limited previous experience.
	Data reuse in linked data projects: a comparison of Alma and Share-VDE BIBFRAME networks
This article presents an analysis of the enrichment, transformation, and clustering used by vendors Casalini Libri/@CULT and Ex Libris for their respective conversions of MARC data to BIBFRAME. The analysis considers the source MARC21 data used by Alma then the enrichment and transformation of MARC21 data from Share-VDE partner libraries. The clustering of linked data into a BIBFRAME network is a key outcome of data reuse in linked data projects and fundamental to the improvement of the discovery of library collections on the web and within search systems.
	CollectionBuilder-CONTENTdm: Developing a Static Web ‘Skin’ for CONTENTdm-based Digital Collections
Unsatisfied with customization options for CONTENTdm, librarians at University of Idaho Library have been using a modern static web approach to creating digital exhibit websites that sit in front of the digital repository. This "skin" is designed to provide users with new pathways to discover and explore collection content and context. This article describes the concepts behind the approach and how it has developed into an open source, data-driven tool called CollectionBuilider-CONTENTdm. The authors outline the design decisions and principles guiding the development of CollectionBuilder, and detail how a version is used at the University of Idaho Library to collaboratively build digital collections and digital scholarship projects.
	Automated Collections Workflows in GOBI: Using Python to Scrape for Purchase Options
The NC State University Libraries has developed a tool for querying GOBI, our print and ebook ordering vendor platform, to automate monthly collections reports. These reports detail purchase options for missing or long-overdue items, as well as popular items with multiple holds. GOBI does not offer an API, forcing staff to conduct manual title-by-title searches that previously took up to 15 hours per month. To make this process more efficient, we wrote a Python script that automates title searches and the extraction of key data (price, date of publication, binding type) from GOBI. This tool can gather data for hundreds of titles in half an hour or less, freeing up time for other projects.

This article will describe the process of creating this script, as well as how it finds and selects data in GOBI. It will also discuss how these results are paired with NC State’s holdings data to create reports for collection managers. Lastly, the article will examine obstacles that were experienced in the creation of the tool and offer recommendations for other organizations seeking to automate collections workflows.
	Testing remote access to e-resource with CodeceptJS
At the Badische Landesbibliothek Karlsruhe (BLB) we offer a variety of e-resources with different access requirements. On the one hand, there is free access to open access material, no matter where you are. On the other hand, there are e-resources that you can only access when you are in the rooms of the BLB. We also offer e-resources that you can access from anywhere, but you must have a library account for authentication to gain access. To test the functionality of these access methods, we have created a project to automatically test the entire process from searching our catalogue, selecting a hit, logging in to the provider's site and checking the results. For this we use the End 2 End Testing Framework CodeceptJS.