HathiTrust as a Data Source for Researching Early Nineteenth-Century Library Collections: Identification, Coverage, and Methods Articles HathiTrust as a Data Source for Researching Early Nineteenth-Century Library Collections: Identification, Coverage, and Methods Julia Bauder INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2019 14 Julia Bauder (bauderj@grinnell.edu) is Associate Professor and Social Studies and Data Services Librarian, Grinnell College. ABSTRACT An intriguing new opportunity for research into the nineteenth-century history of print culture, libraries, and local communities is performing full-text analyses on the corpus of books held by a specific library or group of libraries. Creating corpora using books that are known to have been owned by a given library at a given point in time is potentially feasible because digitized records of the books in several hundred nineteenth-century library collections are available in the form of scanned book catalogs: a book or pamphlet listing all of the books available in a particular library. However, there are two potential problems with using those book catalogs to create corpora. First, it is not clear whether most or all of the books that were in these collections have been digitized. Second, the prospect of identifying the digital representations of the books listed in the catalogs is daunting, given the diversity of cataloging practices at the time. This article will report on progress towards developing an automated method to match entries in early nineteenth-century book catalogs with digitized versions of those books, and will also provide estimates of the fractions of the library holdings that have been digitized and made available in the Google Books/HathiTrust corpus. INTRODUCTION Digital libraries such as Google Books and HathiTrust have created tantalizing opportunities for research into the history of American culture: automated analyses of the entire corpus of books published at a given point in time. The attraction of this prospect is most clearly demonstrated by the avalanche of papers written using the Google Books Ngram data, which provides counts over time of the words and phrases used in the works that make up the Google Books corpus. As soon as this data became available in 2009, it was used to make arguments about social, linguistic, and other changes over time as reflected in changes in the words used in print.1 However, for nearly as long, other researchers have been cautioning that the Google Books corpus is not a representative sample of publishing output, let alone of what the public at large was actually reading in a given year, and that its unrepresentativeness makes it dangerous to draw sweeping conclusion s from this data.2 One potentially feasible solution to the problem of unrepresentativeness in the Google Books corpus would be to use corpora based on the holdings of a specific library or a group of libraries. Using library holdings to form corpora helps to remedy some known issues with using the Google Books corpus as an indicator of social change, such as the fact that many books did not become mailto:bauderj@grinnell.edu HATHITRUST AS A DATA SOURCE | BAUDER 15 https://doi.org/10.6017/ital.v38i4.11251 popular and/or widely available until well after their official publication date, and that some prolific authors who contributed hundreds of thousands of words to the Google Books corpus were never as widely purchased and read as authors who wrote a single, short, best-selling work.3 Although using books held by a set of libraries at a given time as the corpus has its own problems of unrepresentativeness—particularly, for long-established libraries, the fact that the books on the shelf at a given time represent not only works of interest to current users but also those of interest to users from decades past—triangulating this data with that provided by the Google Books Ngram data would at least give some sense of whether and where these different corpora disagree.4 Creating corpora using books that are known to have been owned by a given library at a given point in time is potentially feasible because digitized records of the books in several hundred nineteenth-century library collections are available in the form of scanned book catalogs: a book or pamphlet listing all of the books available in a particular library. However, there are two potential problems with using those book catalogs to create corpora. First, it is not clear whether most or all of the books that were in these collections have been digitized, incorporated into Google Books and HathiTrust, and hence made available for Ngram analyses. Second, the prospect of identifying the digital representations of the books listed in the catalogs is daunting, as both widely agreed-upon cataloging standards and universal identifiers were not adopted until late in the nineteenth century. This article will report on progress towards developing a fully-automated method to match entries in early nineteenth-century book catalogs with digitized versions of those books, and will also provide estimates of the fractions of the library holdings that have been digitized and made available in the Google Books/HathiTrust corpus. METHODS Practical considerations dictated using data from HathiTrust rather than from Google Books for this research. The HathiTrust corpus, although not perfectly coextensive with the Google Books corpus, has very substantial overlap with it. The HathiTrust digital archive was founded in 2008, when a group of large academic libraries formed a collaboration to archive and disseminate their digitized books. The vast majority of those digitized books—around 95 percent, as of mid-2017— had originally been scanned as part of the Google Books project; the agreements that Google Books entered into with the libraries typically stipulated that Google had to provide the library with a digital copy of each book scanned from that library. 5 It was necessary to use HathiTrust rather than Google Books as the comparison corpus because the metadata for the titles in HathiTrust is readily available in ways that the Google Books metadata is not, including as bulk MARC-data downloads. The libraries included in this analysis are social libraries, which were a type of quasi-public library that predated the now-standard, tax-supported public library in the United States. These libraries were privately owned and operated, but were open to some large portion of the population of a particular area who were willing and able to pay a fee or buy a share to belong to the library. Although the presence or absence of a book in social library collections is not a perfect indicator of the book’s popularity—most social libraries pointedly refused to purchase the “trashy” but widely read sensational fiction of the day—it is a defensible proxy (although with some caveats, as noted above) for the popularity of the “serious” literature and nonfiction works that made up the bulk of these libraries’ collections. INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2019 16 Roughly one hundred social library book catalogs published between 1800 and 1860 can be found in HathiTrust.6 For the purposes of the present study, attention was focused on the thirteen library catalogs from ten different American libraries that were published between 1776 and 1825. (A list of these catalogs can be found in appendix A.) These catalogs were chosen because they are likely to present the worst-case scenario in terms of both of the challenges mentioned above: the highest percentage of rare and extremely old books, which Google’s partner libraries would have been least likely to permit to be scanned by Google, and, presumably, the most primitive and eclectic cataloging practices. To the extent that it was possible to do so, this analysis focused on book-length monographs. When serials or pamphlets were listed in a separate section of the catalog, those catalog pages were excluded from the process by which entries were extracted from the catalogs and parsed into CSV files. Serials present particularly intractable matching problems: not only are the original catalogs often unclear about which specific volumes were held, but also HathiTrust’s MARC data does not always clearly indicate which volumes are available in HathiTrust either. Pamphlets have limited coverage in HathiTrust. The selected catalogs were downloaded from HathiTrust as PDFs, and the pdftotext software was used to extract the OCR data from the relevant pages of the scans as hOCR (a file format for OCR that includes information about where each word is located on the page in addition to the words themselves). 7 Then cleaning scripts were created that parsed the hOCR data into CSV files for analysis, with one catalog entry per line of the CSV file.8 Given the widely varied cataloging practices of the early nineteenth century, several different cleaning scripts were written, each tailored to a particular catalog format. For example, many of the catalogs had entries that spanned multiple lines (see figures 1 and 2), so the scripts for those catalogs had to be able to identify when each new entry started. Many catalogs had extraneous information, such as the name of the donor of the book or the size of the book, that had to be filtered out (see figure 1; F, Q, O, and D refer to the size of the book: folio, quarto, octavo, or duodecimo). In addition, various forms of dittoes were frequently used in these catalogs (see figures 1, 2, and 3), so one of the tasks for the cleaning scripts was to identify the dittoes and replace them with the correct words from the previous entry. Figure 1. Library Company of Philadelphia, A Catalogue of the Books Belonging to the Library Company of Philadelphia: To Which Is Prefixed, A Short Account of the Institution, with the Charter, Laws, and Regulations (Philadelphia, PA: Printed by Bartram & Reynolds, 1807), 5. HATHITRUST AS A DATA SOURCE | BAUDER 17 https://doi.org/10.6017/ital.v38i4.11251 Figure 2. Library Company of Baltimore, A Catalogue of the Books, &c. Belonging to the Library Company of Baltimore: To Which Are Prefixed the Act for the Incorporation of the Company, Their Constitution, Their By-Laws, and an Alphabetical List of the Members (Baltimore, MD: printed by Eades and Leakin, 1809), 46. Figure 3. Washington Library Company, Catalogue of Books in the Washington Library (Washington, DC: printed by Anderson and Meehan, 1822), 17. Unfortunately, the horizontal-line dittoes seen in figures 1 and 2—a type of ditto which is used in seven of the thirteen catalogs—are represented inconsistently or not at all in the hOCR, so they cannot reliably be used to identify places where words need to be carried down from the previous entry. For the catalog of the Library of Company of Philadelphia, from which figure 1 was taken, the numbers after the horizontal-line dittoes (which identify the books’ locations on the shelves) can be used to distinguish between a line that is indented because it is a continuation of the entry above and a line that is indented but is the start of a new entry. In theory, a cleaning script for the catalog of the Library Company of Baltimore (figure 2) could use a similar process to identify the last line of an entry by watching for the right-justified count of volumes at the end of each entry. INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2019 18 When a right-justified digit was encountered, the script could then carry down the first word from that entry if the first word in the next entry was indented. However, these isolated digits were also not handled well by the OCRing process: many do not appear in the hOCR file at all, and those that do are as likely to be OCRed as a colon, an exclamation point, a capital I, etc., as they are to be a digit. Hence, the three catalogs of the Library Company of Baltimore, which use this format and have this OCR issue, were not analyzed for this project. Table 1. Results of verification Library Date founded if known, or inc. if not known9 Date catalog printed Number of spreadsheet entries Number of entries hand- verified Hand- verified entries that cannot be positively identified Hand- verified, positively identifiable entries that are not in HathiTrust Positively identifiable entries successfully matched when work was in HathiTrust Library Company of Philadelphia 1731 1807 7619 128 0% 16.9% 79.8% Horsham Library Company 1808 1810 143 143 28.4% 5.1% 79.8% Salem (MA) Athenaeum Inc. 1810 1811 1585 130 0.8% 11.3% 72.3% New York Society Library 1754 1813 4522 135 5.7% 17.9% 76.1% Providence Library Company 1753 1818 688 688 17.1% 9.4% 87.2% Apprentices’ Library (New York, NY)10 1820 1820 1811 124 34.4% 15.0% 69.7% Washington (DC) Library Company Inc. 1814 1822 900 124 12.9% 3.2% 83.7% Boston Library Inc. 1794 1824 2273 138 4.1% 11.1% 82.5% Mercantile Library (New York, NY) 1820 1825 1386 138 0% 11.3% 86.0% HATHITRUST AS A DATA SOURCE | BAUDER 19 https://doi.org/10.6017/ital.v38i4.11251 The catalogs of the other nine libraries could all be parsed with an acceptable success rate and, with one exception, were included. The exception was the Salem Athenaeum’s 1818 catalog, which was identical in format and nearly identical in content to the Athenaeum’s 1811 catalog. Given the overwhelming similarity it was decided to include only one of the catalogs; given that the goal of this analysis was to try to use the worst-case-scenario catalogs, the older of the two catalogs was chosen for inclusion. Once the catalogs were parsed into CSV files, they were run through another script that attempted to match each entry in the catalog against metadata from HathiTrust. In February 2019, MARC records containing metadata for 2,824,875 public-domain titles in HathiTrust were downloaded from HathiTrust via their OAI feed and ingested into a local Apache Solr index for searching and matching, using code from the SolrMarc and VuFind projects.11 Because of OCR errors in the catalog files and mistakes in the original catalogs, many of the words in the entries have one or more character-level errors. Therefore, Solr’s fuzzy searching option was used, which allows words to match as long as the Levenshtein distance between them is two or less. (The Levenshtein distance is the number of edits, such as changing one letter to another or adding or deleting a letter, it would take to turn one word into the other.) No attempt was made to match specific editions; as can be seen from the excerpts in figures 2 and 3, many of the catalogs do not contain sufficient detail to do so, even if it was desirable. The goal was merely to determine whether the text of that work, from any edition, was contained in the HathiTrust corpus. Once the catalogs had been checked against HathiTrust, a sample of the entries was hand-verified. For the two smallest catalogs, the Horsham Library Company and the Library Company of Providence, all entries were hand-verified. For the other catalogs, a random sample of approximately 130 items (+/- 10) was selected. Microsoft Excel’s random-number generator was used to assign each line in the CSV file a number between 0 and 1, and then the lowest 1.5 percent to 12.5 percent (depending on the number of items in the catalog) were examined. RESULTS Percentage of Works Included in HathiTrust A minimum of four of the books in every catalog examined was missing from HathiTrust. As can be seen in table 1, the fraction of books from the hand-verified sample that was missing from HathiTrust ranged from 3.2 percent for the Washington Library Company to just shy of 18 percent for the New York Society Library. The Library Company of Philadelphia, at 16.9 percent missing, had the second-highest missing number. It is not surprising that these two libraries, as two of the oldest and most venerable libraries in the United States at the time, owned the most books that are not represented in HathiTrust, as both have a high percentage of very old and rare works. However, not all of the books from these collections that are not represented in HathiTrust fall into that category. Only six of the twenty missing works from the Library Company of Philadelphia sample, and no more than eight of twenty-two from the New York Society Library, were published before 1700, for example.12 Percentage of Works That Cannot Be Positively Identified As can be seen in figures 1 through 3, some catalogs provided relatively full titles (figures 1 and 2), while others described the works in only two or three words each (figure 3). As might be expected, it is much easier to positively identify the works when fuller titles are provided, although two or three words proved to be enough to identify the work unambiguously the INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2019 20 majority of the time. (All of the titles shown in figure 3 can be positively identified, for example.) In the samples taken from the nine catalogs, the percentage of titles that were unidentifiably ambiguous ranged from 0 percent (Library Company of Philadelphia, Mercantile Library of New York) to more than one in four (Apprentices’ Library of New York, 34.4 percent; Horsham Library Company, 27.9 percent). The Apprentices’ Library of New York and the Horsham Library Company were particularly problematic because they frequently omitted the name of the author, in addition to greatly compressing the title; without an author name, titles such as Modern Geography (Apprentices’ Library) and History of Rome (Horsham Library Company) present far too many potential matches. However, even including the author’s name does not make all greatly compressed entries identifiable. One particularly egregious example comes from the Library Company of Providence’s 1818 catalog, which contains an entry reading “Bell’s Inquiry.” The list of candidates for this work includes A Practical Inquiry into the Authority, Nature, and Design of the Lord’s Supper, by William Bell; An Inquiry into the Causes Which Produce, and the Means of Preventing Diseases Among British Officers, Soldiers, and Others in the West Indies, by John Bell; and Inquiry into the Policy and Justice of the Prohibition of the Use of Grain in Distilleries, by Archibald Bell. Figure 4. New York Society Library, A Catalogue of the Books Belonging to the New-York Society Library (New York: printed by C. S. Van Winkle, 1813), 139. Success Rates for the Parsing and Matching Scripts When there was a single, identifiable work that matched the catalog entry, and that work was in HathiTrust, the matching scripts identified it at least 70 percent of the time for every individual catalog. Unsurprisingly, catalogs such as those of the Horsham Library Company and the Apprentices’ Library of New York that had entries that were difficult to positively identify were also more difficult for the script to properly match, although the matching script still succeeded between roughly 70 and 80 percent of the time. HATHITRUST AS A DATA SOURCE | BAUDER 21 https://doi.org/10.6017/ital.v38i4.11251 For two other libraries with below-average matching results (the Library Company of Philadelphia and the New York Society Library), many of the matching problems were caused by issues with the scanned catalogs that the data-cleaning scripts did not handle well. The New York Society Library catalog listed out the contents of multivolume sets in a way that was difficult for the cleaning script to identify and remove (see figure 4); instead, it was common for each volume of the set to end up with its own entry in the dataset. Since the HathiTrust records generally do not list out the contents of each volume, it was very rare for the cleaning script to correctly match a set based on an entry for one volume in the set. Twenty-seven percent (six out of 22) missed matches from that sample failed because of this table-of-contents issue. For the Library Company of Philadelphia, the problem lies with a quirk in the hOCR where the character heights for many of the horizontal-line dittoes are extremely high—around twenty pixels, when the text around those dittoes is typically around ten pixels high. It appears as if the OCR program may have treated each horizontal-line ditto as an em dash and assigned it a height that would be proportional for an em dash of that length. These extra-tall line heights for the first “word” on the line cause issues with the algorithm that processes the text line-by-line, causing some entries to be inappropriately divided across two entries in the data spreadsheets. Unsurprisingly, the matching script had difficulty correctly identifying the correct work in HathiTrust when it was trying to match based on only half of the book’s title. CONCLUSIONS Although not a complete success, the results of this study provide hope that it might be possible to create full-text corpora based on the works in individual libraries with minimal manual labor, with a few caveats. The first caveat is that the digitized catalogs of those libraries must meet certain specifications: 1) The catalog is formatted, and has been OCRed, in such a way that it is consistently possible to parse the catalog line-by-line and to identify algorithmically where each entry starts and ends. 2) The catalog provides at least the authors’ last names, if not their full names, plus a more-or- less complete and accurate transcription of the title proper. 3) Either the catalog contains minimal extraneous information (such as tables of contents or donors’ names), or the extraneous information is consistently formatted in a way that it can be algorithmically identified and removed. The second caveat is that even if all of these conditions are met, the full-text corpora that can be created will probably still be missing some small percentage of the books available in that library. One potential direction for future research could be more closely examining the books that are absent from HathiTrust to see if there are any commonalities among them that might bias research done using these corpora, or if the missing works can safely be treated as random omissions. On the other hand, as was noted above, the catalogs used in this study represent a likely worst-case scenario for being able to positively identify the works listed in the catalogs and for those works being present in HathiTrust. Another promising avenue for future research would be to repeat this analysis on catalogs from the mid-to-late nineteenth century to see if the works in those catalogs are in fact more likely to exist in the HathiTrust corpus. INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2019 22 APPENDIX A: AMERICAN LIBRARY CATALOGS FROM 1776 TO 1825 INCLUDED IN HATHITRUST Boston Library, Catalogue of Books in the Boston Library, June, 1824, Boston: Munroe and Francis, 1824, http://hdl.handle.net/2027/hvd.32044080249337. General Society of Mechanics and Tradesman of the City of New York, Catalogue of the Apprentices’ Library, Instituted by the Society of Mechanics and Tradesman of the City of New-York, on the 25th November, 1820: With the Names of the Donors: To Which Is Added, an Address Delivered on the Opening of the Institution by Thomas R. Mercein, a Member of the Society. New York: printed by William A. Mercein, no. 93 Gold-Street, 1820, http://hdl.handle.net/2027/nnc2.ark:/13960/t8md1cv2t. Horsham Library Company, The Constitution, Bye-Laws, and Catalogue of Books, of the Horsham Library Company. Philadelphia, PA: J. Rakestraw, 1810, http://hdl.handle.net/2027/nnc1.cu55910696. Library Company of Baltimore, A Catalogue of the Books, &c. Belonging to the Library Company of Baltimore: To Which Are Prefixed the Act for the Incorporation of the Company, Their Constitution, Their By-Laws, and an Alphabetical List of the Members. Baltimore, MD: printed by Eades and Leakin, 1809, http://hdl.handle.net/2027/nyp.33433069263907. Library Company of Baltimore, A Supplement to the Catalogue of Books, &c. Belonging to the Library Company of Baltimore. Baltimore, MD: printed by J. Robinson, 1816, http://hdl.handle.net/2027/nyp.33433069263899. Library Company of Baltimore, A Supplement to the Catalogue of Books, &c. Belonging to the Library Company of Baltimore. Baltimore, MD: printed by J. Robinson, 1823, http://hdl.handle.net/2027/nyp.33433069263899. Library Company of Philadelphia, A Catalogue of the Books Belonging to the Library Company of Philadelphia: To Which Is Prefixed, A Short Account of the Institution, with the Charter, Laws, and Regulations. Philadelphia, PA: Printed by Bartram & Reynolds, 1807, http://hdl.handle.net/2027/nyp.33433075914816. Mercantile Library Association of the City of New York, Catalogue of the Books Belonging to the Mercantile Library Association of the City of New-York: To Which Are Prefixed, the Constitution and the Rules and Regulations of the Same. New York: printed by Hopkins & Morris, 1825, http://hdl.handle.net/2027/nyp.33433057517090. New York Society Library, A Catalogue of the Books Belonging to the New-York Society Library. New York: printed by C. S. Van Winkle, 1813, http://hdl.handle.net/2027/mdp.39015023478822. Providence Library Company, Charter and By Laws of the Providence Library Company, and a Catalogue of the Books of the Library. Providence, RI: printed by Miller and Hutchens, 1818, http://hdl.handle.net/2027/nyp.33433059555346. Salem Athenaeum, Catalogue of the Books Belonging to the Salem Athenæum, with the By-Laws and Regulations. Salem, MA: Printed by Thomas C. Cushing, 1811, http://hdl.handle.net/2027/hvd.32044080252174. http://hdl.handle.net/2027/hvd.32044080249337 http://hdl.handle.net/2027/nnc2.ark:/13960/t8md1cv2t http://hdl.handle.net/2027/nnc1.cu55910696 http://hdl.handle.net/2027/nyp.33433069263907 http://hdl.handle.net/2027/nyp.33433069263899 http://hdl.handle.net/2027/nyp.33433069263899 http://hdl.handle.net/2027/nyp.33433075914816 http://hdl.handle.net/2027/nyp.33433057517090 http://hdl.handle.net/2027/mdp.39015023478822 http://hdl.handle.net/2027/nyp.33433059555346 http://hdl.handle.net/2027/hvd.32044080252174 HATHITRUST AS A DATA SOURCE | BAUDER 23 https://doi.org/10.6017/ital.v38i4.11251 Salem Athenaeum, Catalogue of the Books Belonging to the Salem Athenæum, with the By-Laws and Regulations. Salem, MA: Printed by W. Palfray, 1818, http://hdl.handle.net/2027/hvd.32044080252174. Washington Library Company, Catalogue of Books in the Washington Library, July 20, 1822. Washington, DC: printed by Anderson and Meehan, 1822, http://hdl.handle.net/2027/chi.098498263. REFERENCES 1 See, e.g., Jean-Baptiste Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science, 311, no. 6014 (January 11, 2011): 176-82, https://doi.org/10.1126/science.1199644; Jean M. Twenge, W. Keith Campbell, and Brittany Gentile, “Male and Female Pronoun Use in U.S. Books Reflects Women’s Status, 1900 -2008,” Sex Roles 67, nos. 9-10 (November 2012), 488-93, https://doi.org/10.1007/BF00287963; Patricia M. Greenfield, “The Changing Psychology of Culture from 1800 through 2000,” Psychological Science 24, no. 9, 1722-31, https://doi.org/10.1177/0956797613479387. 2 Eitan Adam Pechenick, Christopher M. Danforth, and Peter Sheridan Dodds, “Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-cultural and Linguistic Evolution,” PLOS One 10, no. 10 (October 7, 2015): e0137041. https://doi.org/10.1371/journal.pone.0137041; Alexander Koplenig, “The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII,” Digital Scholarship in the Humanities 32, no. 1 (April 2017): 169-88, https://doi.org/10.1093/llc/fqv037. 3 Pechenick et al., 2015; Lindsay DiCuirci, Colonial Revivals: The Nineteenth-Century Lives of Early American Books (Philadelphia: University of Pennsylvania Press, 2019). 4 Robert A. Gross, “Reconstructing Early American Libraries: Concord, Massachusetts, 1795 -1850,” Proceedings of the American Antiquarian Society, 97, no. 1 (January 1, 1987): p. 331-451. 5 Jennifer Howard, “What Ever Happened to Google’s Effort to Scan Millions of University Library Books?,” EdSurge, August 20, 2017, https://www.edsurge.com/news/2017-08-10-what- happened-to-google-s-effort-to-scan-millions-of-university-library-books. 6 Book catalogs fell out of favor in the latter half of the nineteenth century as library collections became larger and more dynamic, making book catalogs much more difficult and expensive to compile and to keep up to date. By the end of the nineteenth century, book catalogs had largely been replaced by the card catalog system that remained in use through most of the twentieth century. Although card catalogs were far superior for their primary purposes—maintaining an inventory of books presently owned by the library and allowing library users to locate the books that they wanted—they leave no permanent record of the books listed in the catalog at any particular point in time. 7 Information about pdftotext can be found at https://manpages.debian.org/testing/poppler- utils/pdftotext.1.en.html. http://hdl.handle.net/2027/hvd.32044080252174 http://hdl.handle.net/2027/chi.098498263 https://doi.org/10.1126/science.1199644 https://doi.org/10.1007/BF00287963 https://doi.org/10.1177/0956797613479387 https://doi.org/10.1371/journal.pone.0137041 https://doi.org/10.1093/llc/fqv037 https://www.edsurge.com/news/2017-08-10-what-happened-to-google-s-effort-to-scan-millions-of-university-library-books https://www.edsurge.com/news/2017-08-10-what-happened-to-google-s-effort-to-scan-millions-of-university-library-books https://manpages.debian.org/testing/poppler-utils/pdftotext.1.en.html https://manpages.debian.org/testing/poppler-utils/pdftotext.1.en.html INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2019 24 8 The cleaning scripts, as well as data and other code used in this project, are available in https://github.com/julia-bauder/library-catalog-analysis-public. 9 The founding and incorporation dates were taken from the prefatory texts in the book catalogs used in this analysis, as listed in appendix A. 10 The scan of this catalog that is available from HathiTrust is missing pages 3-6. 11 Apache Solr is a widely used, open-source search platform. SolrMarc is a utility that can be used to index MARC records into Solr. VuFind is an open-source library discovery layer built in part on Solr and SolrMarc. For more information, see http://lucene.apache.org/solr/, https://github.com/solrmarc/solrmarc, and https://vufind.org/vufind/, respectively. The HathiTrust OAI feed is available at https://www.hathitrust.org/oai. 12 Five of the missing works from the New York Society Library sample were undated in the catalog. https://github.com/julia-bauder/library-catalog-analysis-public http://lucene.apache.org/solr/ https://github.com/solrmarc/solrmarc https://vufind.org/vufind/ https://www.hathitrust.org/oai ABSTRACT INTRODUCTION Methods Results Percentage of Works Included in HathiTrust Percentage of Works That Cannot Be Positively Identified Success Rates for the Parsing and Matching Scripts Conclusions Appendix A: American Library Catalogs from 1776 to 1825 Included in HathiTrust References