PAL: Toward a Recommendation System for Manuscripts Scott Ziegler and Richard Shrake INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2018 84 Scott Ziegler (sziegler1@lsu.edu) is Head of Digital Programs and Services, Louisiana State University Libraries. Prior to this position, Ziegler was the Head of Digital Scholarship and Technology, American Philosophical Society. Richard Shrake (shraker13@gmail.com) is a Library Technology Consultant based in Burlington, Vermont. ABSTRACT Book-recommendation systems are increasingly common, from Amazon to public library interfaces. However, for archives and special collections, such automated assistance has been rare. This is partly due to the complexity of descriptions (finding aids describing whole collections) and partly due to the complexity of the collections themselves (what is this collection about and how is it related to another collection?). The American Philosophical Society Library is using circulation data collected through the collection- management software package, Aeon, to automate recommendations. In our system, which we’re calling PAL (People Also Liked), recommendations are offered in two ways: based on interests (“You’re interested in X, other people interested in X looked at these collections”) and on specific requests (“You’ve looked at Y, other people who looked at Y also looked that these collections”). This article will discuss the development of PAL and plans for the system. We will also discuss ongoing concerns and issues, how patron privacy is protected, and the possibility of generalizing beyond any specific software solution. INTRODUCTION The American Philosophical Society Library (APS) is an independent research library in Philadelphia. Founded in 1743, the library houses a wide variety of material in early American history, history of science, and Native American linguistics. The majority of the library’s holdings are manuscripts, with a large amount of audio material, maps, and graphics, nearly all of which are described in finding aids created using Encoded Archival Description (EAD) standards. Like similar institutions, the APS has long struggled to find new ways to help library users discover material relevant to their research. In addition to traditional in-person, email, and phone reference, the APS has spent years creating search and browse interfaces, subject guides , and web exhibitions to promote the collections.1 As part of these ongoing efforts to connect users with collections, the APS is working on an automated recommendation system to reuse circulation data gathered through Aeon. Developed by Atlas Systems, Aeon is a “request and workflow management software specifically designed for special collections libraries and archives,” and it enables the APS to gather statistics on both the use of our manuscript collections and on aspects of the library’s users.2 The automated recommendation system, which we’re calling PAL, for “People Also Liked,” is an ongoing effort. This article presents a snapshot of current work. PAL: TOWARD A RECOMMENDATION SYSTEM FOR MANUSCRIPTS | ZIEGLER AND SHRAKE 85 https://doi.org/10.6017/ital.v37i3.10357 LITERATURE REVIEW The benefits of recommendations in library OPACs has long been recognized. Writing in 2008 about the library recommendation system BibTip, itself started in the early 2000s, Mönnich and Spiering observe that “library services are well suited for the adoption of recommendation systems, especially services that support the user in search of literature in the catalog.” By 2011 OCLC Research and the Information School at the University of Sheffield began exploring a recommendation system for OCLC’s Worldcat.3 Recommendations for library OPACs commonly fall into one of two categories, content-based or collaborative filtering. Content-based recommendations pair specific users to library items based on the metadata of the item and what is known about the user. For example, if a user indicates in some way that they enjoy mystery novels, items identified as mystery novels might be recommended to them. Collaborative filtering combines users in some way and creates recommendations for one user based on the preferences of another user. There can be a dark side to recommendations. The algorithms that determine which users are similar and thus which recommendations to make are not often understood. Writing about algorithms in library discovery systems broadly, Reidsma points out that “in librarianship over the past few decades, the profession has had to grapple with the perception that computers are better at finding relevant information then people.”4 The algorithms that are doing the finding, however, often carry the same hidden biases that their programmers have. Reidsma encourages a broader understanding of algorithms in general and deeper understanding of recommendation algorithms in particular. The history of recommendation systems in libraries has informed the ongoing development of PAL. We use both the content-based and the collaborative filtering approach to offering recommendations to users. For the purposes of communicating them to nontechnical patrons, we refer to them as “interest-based” and “request-based,” respectively. Furthermore, we are cautious about the role algorithms play in determining which recommendations users see. Our help text reinforces the continued importance of working directly with in-house experts, and we promote PAL as one tool among the many offered by the library. We are not aware of any literature on the development of recommendation tools for archives or special-collections libraries. The nature of the material held in these institutions presents special challenges. For example, unlike book collections, many manuscript and archival collections are described in aggregate: one description might refer to many letters. These issues are discussed in detail below. PUTTING DATA TO USE: RECOMMENDATIONS BASED ON INTERESTS AND REQUESTS The use of Aeon allows the APS to gather and store data, including both data that users supply through the registration form and data concerning which collections are requested. PAL use both types of data to create recommendations. Interest-Based Recommendations The first type of recommendation uses self-identified research interest data that researchers supply when creating an Aeon account. When registering, a user has the option to select from a list of sixty-four topics grouped into seven broad categories (figure 1). The APS selected these INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2018 86 interests based on suggestions from researchers as well as categories common in the field of academic history. Upon signing in, a registered user sees a list of links (figure 2); each link leads to a full-page view of collection recommendations (figure 3). These recommendations follow the model, “You’re interested in X, other people interested in X looked at these collections.” Request-Based Recommendations Using the circulation data that Aeon collects, we are able to automate recommendations in PAL based on request information. Upon clicking a request link in a finding aid, the user is presented with a list of recommendations on the sidebar in Aeon (figure 4). Each link opens the finding aid for the collection listed. Figure 1. List of interests a user sees when registering for the first time. A user can also revisit this list to modify their choices at any point by following links through the Aeon interface. The selected interests generate recommendations. PAL: TOWARD A RECOMMENDATION SYSTEM FOR MANUSCRIPTS | ZIEGLER AND SHRAKE 87 https://doi.org/10.6017/ital.v37i3.10357 Figure 2. List of links appearing on the right-hand sidebar, based on interests that users select. Figure 3. Recommended collections, based on interest, showing collection name (with a link to finding aid), call number, number of requests, and number of users who have requested from the collections. The user sees this list after clicking on option from sidebar, as shown in figure 2. INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2018 88 Figure 4. Request-based recommendation links appearing on the right-hand sidebar after a patron requests an item from a finding aid. THE PROCESS Currently, the data that drives these two functions is obtained from a semidynamic process via daily, automated SQL query exports. Usernames are employed to tie together requests and interests but are subsequently purged from the data before the results are presented to users and staff. This section explains the process in detail and presents code snippets where available. All code is available on GitHub.5 Interest-Based Recommendations For interest-based recommendations, we employ two queries. The first query pulls every collection requested by a user for each topic for which that user has expressed an interest. The second aggregates the data for every user in the system. The following queries get data from the Microsoft SQL database, via a Microsoft Access intermediary, that Aeon uses to store data. Because of the number of interest options in the registration form, and the character length of some of them (“Early America - Colonial History,” for example) we encode the interests in shortened form. “Early America - Colonial History” becomes “EA-ColHist” so as not to run into character limits in the database. This section explores each of these queries in more detail and provides example code. PAL: TOWARD A RECOMMENDATION SYSTEM FOR MANUSCRIPTS | ZIEGLER AND SHRAKE 89 https://doi.org/10.6017/ital.v37i3.10357 The first query gathers research topics for all users who are not staff (user status is ‘Researcher’), and where at least one research topic is chosen (‘ResearchTopics’ is not null). The data is exported into an XML file that we call “aeonMssReg.” SELECT AeonData.dbo.Users.ResearchTopics, AeonData.dbo.Transactions.CallNumber, AeonData.dbo.Transactions.Location FROM AeonData.dbo.Transactions INNER JOIN AeonData.dbo.Users ON (AeonData.dbo.Users.UserName = AeonData.dbo.Transactions.Username) AND (AeonData.dbo.Transactions.Username = AeonData.dbo.Users.UserName) WHERE (((AeonData.dbo.Users.ResearchTopics) Is Not Null) AND ((AeonData.dbo.Transactions.CallNumber) Like 'mss%' Or (AeonData.dbo.Transactions.CallNumber) Like 'aps.%') AND ((AeonData.dbo.Users.Status)='Researcher')) FOR XML RAW ('aeonMssReq'), ROOT ('dataroot'), ELEMENTS; The second query combines all data for all users and exports an XML file ‘aeonMssUsers.’ SELECT DISTINCT AeonData.dbo.Users.ResearchTopics, AeonData.dbo.Transactions.CallNumber, AeonData.dbo.Transactions.Location, AeonData.dbo.Transactions.Username FROM AeonData.dbo.Transactions INNER JOIN AeonData.dbo.Users ON (AeonData.dbo.Users.UserName = AeonData.dbo.Transactions.Username) AND (AeonData.dbo.Transactions.Username = AeonData.dbo.Users.UserName) WHERE (((AeonData.dbo.Users.ResearchTopics) Is Not Null) AND ((AeonData.dbo.Transactions.CallNumber) Like 'mss%' Or (AeonData.dbo.Transactions.CallNumber) Like 'aps.%') AND ((AeonData.dbo.Users.Status)='Researcher')) FOR XML RAW ('aeonMssUsers'), ROOT ('dataroot'), ELEMENTS; Each query produces an XML file. These files are parsed using XSL stylesheets into subsets for each research interest. The stylesheets also generate counts of users requesting a collection and number of total requests for a collection by users sharing an interest. An example is the following stylesheet for the topic “Early America - Colonial History,” which pulls from the XML file “aeonMssReg”: INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2018 90 This process is repeated for each interest. The data from the query that we modify with XSLT is presented as HTML that we insert into Aeon templates. This HTML includes the collection name (linked to finding aid), call number, number of requests, and number of users in a table. See figure 3 for how this appears to the user. The following shows how XSL is wrapped in HTML.

The collections most frequently requested from researchers who expressed an interest in are listed below with links to each collection's finding aid and the number of times each collection has been requested.

Collection Call Number # of Requests # of Users
To ensure a user only sees the links that match the interests they have selected, we use JavaScript to determine the expressed interests of the current user and display the corresponding links to the HTML pages in a sidebar. This approach works well, but we must account for two quirks. The first is that many interests in the database do not conform to the current list of options because many users predate our current registration form and wrote in free-form interests. Secondly, Aeon stores the research information as an array rather than in a separate table, so we must account for the fact that the Aeon database contains an array of values that includes both controlled and uncontrolled vocabulary. First, we set the array as a variable so we can look for a value that matches our controlled vocabulary and separate the array into individual values for manipulation: // Use var message to check for presence of controlled list of topics var message = "<#USER field='ResearchTopics'>"; // Use var values to separate topics that are collected in one string var values = "<#USER field='ResearchTopics'>".split(","); PAL: TOWARD A RECOMMENDATION SYSTEM FOR MANUSCRIPTS | ZIEGLER AND SHRAKE 91 https://doi.org/10.6017/ital.v37i3.10357 We also create variables to generate the HTML entries and links out when we have extracted our research topics: var open = "" Next we set a conditional to determine if one of our controlled vocabulary terms appears in the array: //Determine if user has an interest topic from the controlled list if ((message.indexOf("EA-ColHis") > -1) || (message.indexOf("EA-AmRev") > -1) || (message.indexOf("EA-EarlyNat") > -1) || (message.indexOf("EA-Antebellum") > -1) || … If the array contains a value from our controlled vocabulary, we generate a link and translate our internal code back into a human-friendly research topic (“EA-ColHist,” for example, becomes once again “Early American - Colonial History”): for (var i = 0; i < values.length; ++i) { if (values[i]=="EA-ColHis"){ document.getElementById("topic").innerHTML += (open + values[i] + middle + "Early America-Colonial History" + close);} else if (values[i]=="EA-AmRev"){ document.getElementById("topic").innerHTML += (open + values[i] + middle + "Early America- American Revolution" + close);} else if (values[i]=="EA-EarlyNat"){ document.getElementById("topic").innerHTML += (open + values[i] + middle + "Early America- Early National" + close);} else if (values[i]=="EA-Antebellum"){ document.getElementById("topic").innerHTML += (open + values[i] + middle + "Early America- Antebellum" + close);} … See figure 2 for how this appears to the user. Users only see the links that correspond to their stated interest. If the array does not contain a value from our controlled vocabulary, we display the research-topic interests associated with the user account, note that we don’t currently have a recommendation, and provide a link to update the research topics for the account. Else {document.getElementById("notopic").innerHTML = "

You expressed interest in:

<#USER field='ResearchTopics'>

We are unable to provide a specific collection recommendation for you. Please visit our User Profile page to select from our list of research topics.

" } Request-Based Recommendations In addition to interest-based recommendations, PAL supplies recommendations based on past requests a user has made. This section details how these recommendations are generated. Aeon allows users to request materials directly from a finding aid (see figure 6). To generate our request-based recommendations we employ a query depicting the call number and user of every request in the system and export the results to an XML file called “aeonLikeCollections.” INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2018 92 SELECT subquery.CallNumber, subquery.Username, IIf(Right(subquery.trimLocation,1)='.',Left(subquery.trimLocation,Len(subquery.trimLocation)- 1),subquery.trimLocation) AS finallocation FROM ( SELECT DISTINCT AeonData.dbo.Transactions.CallNumber, AeonData.dbo.Transactions.Username, IIf(CHARINDEX(':',[Location])>0,Left([Location],CHARINDEX(':',[Location])-1),[Location]) AS trimLocation FROM AeonData.dbo.Transactions INNER JOIN AeonData.dbo.Users ON (AeonData.dbo.Users.UserName = AeonData.dbo.Transactions.Username) AND (AeonData.dbo.Transactions.Username = AeonData.dbo.Users.UserName) WHERE (((AeonData.dbo.Transactions.CallNumber) Like 'mss%' Or (AeonData.dbo.Transactions.CallNumber) Like 'aps.%') AND ((AeonData.dbo.Transactions.Location) Is Not Null) AND ((AeonData.dbo.Users.Status)='Researcher'))) subquery ORDER BY subquery.CallNumber FOR XML RAW ('aeonLikeCollections'), ROOT ('dataroot'), ELEMENTS; We then process the “aeonLikeCollections” file through a series of XSLT stylesheets, creating lists of every other collection that every user of the current collection has requested. First the stylesheets remove collections that have only been requested once. Then we count the number of times each collection has been requested: We sort on the collection name and username and then re-sort to combine groups of requested collections with users who have requested each collection. PAL: TOWARD A RECOMMENDATION SYSTEM FOR MANUSCRIPTS | ZIEGLER AND SHRAKE 93 https://doi.org/10.6017/ital.v37i3.10357 We then create a new XML file that is organized by our collection groupings. The following snippet shows a populated XML file generated by the XSLT stylesheet above. Mss.497.3.B63c Mss.497.3.B63c - American Council of Learned Societies … 94 Mss.Ms.Coll.200 Mss.Ms.Coll.200 - Miscellaneous Manuscripts Collection … 92 We use JavaScript to determine the call number of the user’s current request and display the list of other collections that users who have requested the current collection have also requested. See figure 4 for how these links appear to the user. All of the exports and processing are handled automatically through a daily scheduled task. The only personally identifiable data that is contained in these processes are usernames, which are used for counting purposes, but they are removed from the final products through the XSLT processing on an internal administrative server, are never stored in the Aeon web directory, and are never available for other library users or staff to see. POTENTIAL PITFALLS AND WHAT TO DO ABOUT THEM PAL allows us to see new things about our users, and we hope that our users are able to see new collections in the library. However, there are potential pitfalls to the way we’ve been working on this project. We’re calling the two biggest pitfalls the “bias toward well-described collections” and the “problem of aboutness.” INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2018 94 The Bias toward Well-Described Collections The bias toward well-described collections is best understood by examining how the APS integrates Aeon into our finding aids. We offer request links at every available level of description: collection, series, folder, and item. If a patron spends all day in our reading room and looks at the entirety of an item-level collection, they could have made between twenty and one hundred individual requests from that collection. For our statistics, each request will be counted as that collection being used. Figure 6 shows a collection described at the item level; each item can be individually requested, giving the impression that this collection is very heavily used even if it is only one patron doing all the requesting. Figure 6. Finding aid of collection described at the item level. A patron making their way through this collection could make as many as one hundred individual requests. For collections described at the collection level, however, the patron has only one link to click to see the entire collection. For PAL, however, it looks like that collection was only used once, as shown in figure 7. A patron sitting all day in our reading room looking at a collection with little description might use the collection more heavily than a patron clicking select items in a well-described collection. However, when we review the numbers, all we see is that the well-described collections get more clicks. PAL: TOWARD A RECOMMENDATION SYSTEM FOR MANUSCRIPTS | ZIEGLER AND SHRAKE 95 https://doi.org/10.6017/ital.v37i3.10357 Figure 7. Screenshot of finding aid with only collection-level description. This collection has only one request link, the “Special Request” link at the top right. A patron looking through the entirety of this collection will only log a single request from the point of view of our statistics. The Problem of Aboutness When we speak of the problem of aboutness, we draw attention to the fact that manuscript collections can be about many different things. One researcher might come to a collection for one reason, another researcher for another reason. A good example at the APS Library is the William Parker Foulke Papers.6 This collection contains approximately three thousand items and represents a wide variety of the interests of the eponymous Mr. Foulke. He discovered the first full dinosaur skeleton, promoted prison reform, worked toward abolition, and championed arctic exploration. A patron looking at this collection could be interested in any of these topics, or others. PAL, however, isn’t able to account for these nuances. If a researcher interested in prison reform requests items from the Foulke Papers, they’ll see the same suggestion as a researcher who came to the collection for arctic exploration. What to Do about This Identifying these pitfalls is a good first step to avoiding them, but it’s only a first step. There are technical solutions, and we’ll continue to explore them. For example, the bias toward well- described collections is mitigated by showing both the number of requests and the number of users who have requested from a collection (see figure 3). We hope that by presenting both numbers, we move a little toward overcoming this bias. However, we’re also interested in the nontechnical approaches to these issues. As mentioned in the introduction, the APS relies heavily on traditional reference service, both remote and in-house. Nontechnical solutions acknowledge the shortcomings of any constructed solution and injects a healthy amount of humility into our work. Additionally, the subject guides, search tools, and web exhibitions all form an ecosystem of discovery and access to supplement PAL. FUTURE STEPS Using Data Outside of Aeon We have begun exploring options for using the recommendation data outside of Aeon. One early prototype surfaces a link in our primary search interface. For example, searching for the William INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2018 96 Parker Foulke Papers shows a link of what people who requested from this collection also looked at. See figures 8 and 9. Generalizing for Other Repositories There are ways to integrate the use of Aeon with EAD finding aids. The systems that the APS has developed to collect data for automated recommendations takes advantage of our infrastructure. We’d like for other repositories to be able to use PAL. It is our hope that an institution using Aeon in a different way will help us generalize this system. Generalizing beyond Aeon PAL is currently configured to pull data out of the Microsoft SQL database used by Aeon. However, all the manipulation is done outside of Aeon and is therefore generalizable to data collected in other ways. Because archives and special collections have long-held statistics in different types of systems, we hope to be able to generalize beyond the Aeon use case if there is any interest in this from other repositories. Integrating PAL into Aeon Conversations with Atlas staff about PAL have been positive, and there is interest in building many of the features into future releases of Aeon. As of this writing, an open Uservoice forum topic is taking votes and comments about this integration.7 Figure 8. A link in the search returns that leads to recommendations based on finding aid search. Clicking on the link “PAL Recommendations: Patrons who used Henry Howard Houston, II Papers also used these collections” will open an HTML page with a list of links to finding aids. PAL: TOWARD A RECOMMENDATION SYSTEM FOR MANUSCRIPTS | ZIEGLER AND SHRAKE 97 https://doi.org/10.6017/ital.v37i3.10357 Figure 9. HTML link of recommended finding aids based on search. CONCLUSION The APS is trying to add to the already robust options for users to find relevant manuscript collections. In addition to traditional reference, web exhibitions, and online search and browse tools, we have started reusing circulation data and self-identified user interests to automate recommendations. This new system fits within the ecosystem of tools we already supply. This is a snapshot of where the PAL recommendation project is as of this writing, and we hope to work with other special collections libraries and archives to continue to grow the tool. If you are interested, we hope you reach out. ENDNOTES 1 “Subject Guides and Bibliographies,” American Philosophical Society, accessed February 27, 2018, https://amphilsoc.org/library/guides; “Exhibitions,” American Philosophical Society, accessed February 27, 2018, https://amphilsoc.org/library/exhibit; “Galleries,” American Philosophical Society, accessed February 27, 2018, https://diglib.amphilsoc.org/galleries. 2 “Aeon,” Atlas Systems, accessed February 27, 2018, https://www.atlas-sys.com/aeon/. https://amphilsoc.org/library/guides https://amphilsoc.org/library/exhibit https://diglib.amphilsoc.org/galleries https://www.atlas-sys.com/aeon/ INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2018 98 3 Michael Mönnich and Marcus Spiering, “Adding Value to the Library Catalog by Implementing a Recommendation System,” D-Lib Magazine 14, no. 5/6 (2008), https://doi.org/10.1045/may2008-monnich. 4 Matthew Reidsma, “Algorithmic Bias in Library Discovery Systems,” Matthew Reidsma (blog), March 11, 2016, https://matthew.reidsrow.com/articles/173. 5 “AmericanPhilosophicalSociety/PAL,” American Philosophical Society, last modified September 11, 2017, https://github.com/AmericanPhilosophicalSociety/PAL. 6 “William Parker Foulke Papers, 1840–1865,” American Philosophical Society, accessed February 27, 2018, https://search.amphilsoc.org/collections/view?docId=ead/Mss.B.F826-ead.xml. 7 “Recommendation System to Suggest Items to Researchers Based on Users with the Same Research Topic,” Atlas Systems, accessed February 27, 2018, https://uservoice.atlas- sys.com/forums/568075-aeon-ideas/suggestions/18893335-recommendation-system-to- suggest-items-to-research. https://doi.org/10.1045/may2008-monnich https://matthew.reidsrow.com/articles/173 https://github.com/AmericanPhilosophicalSociety/PAL http://amphilsoc.org/collections/view?docId=ead/Mss.B.F826-ead.xml https://uservoice.atlas-sys.com/forums/568075-aeon-ideas/suggestions/18893335-recommendation-system-to-suggest-items-to-research https://uservoice.atlas-sys.com/forums/568075-aeon-ideas/suggestions/18893335-recommendation-system-to-suggest-items-to-research https://uservoice.atlas-sys.com/forums/568075-aeon-ideas/suggestions/18893335-recommendation-system-to-suggest-items-to-research ABSTRACT Introduction Literature Review Putting Data to Use: Recommendations Based on Interests and Requests Interest-Based Recommendations Request-Based Recommendations The Process Interest-Based Recommendations Request-Based Recommendations Potential Pitfalls and What To Do About Them The Bias toward Well-Described Collections The Problem of Aboutness What to Do about This Future Steps Using Data Outside of Aeon Generalizing for Other Repositories Generalizing beyond Aeon Integrating PAL into Aeon Conclusion Endnotes