GitHub - ericleasemorgan/htid2books: Given an access key, secret token, and a HathiTrust identifier, output plain text as well as PDF versions of a book. Skip to content Sign up Why GitHub? Features → Code review Project management Integrations Actions Packages Security Team management Hosting Mobile Customer stories → Security → Team Enterprise Explore Explore GitHub → Learn & contribute Topics Collections Trending Learning Lab Open source guides Connect with others Events Community forum GitHub Education GitHub Stars program Marketplace Pricing Plans → Compare plans Contact Sales Nonprofit → Education → In this repository All GitHub ↵ Jump to ↵ No suggested jump to results In this repository All GitHub ↵ Jump to ↵ In this repository All GitHub ↵ Jump to ↵ Sign in Sign up {{ message }} ericleasemorgan / htid2books Watch 2 Star 4 Fork 0 Given an access key, secret token, and a HathiTrust identifier, output plain text as well as PDF versions of a book. GPL-2.0 License 4 stars 0 forks Star Watch Code Issues 0 Pull requests 0 Actions Projects 0 Security Insights More Code Issues Pull requests Actions Projects Security Insights Dismiss Join GitHub today GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. Sign up GitHub is where the world builds software Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world. Sign up for free Dismiss master 1 branch 0 tags Go to file Code Clone HTTPS GitHub CLI Use Git or checkout with SVN using the web URL. Work fast with our official CLI. Learn more. Open with GitHub Desktop Download ZIP Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching Xcode If nothing happens, download Xcode and try again. Go back Launching Visual Studio If nothing happens, download the GitHub extension for Visual Studio and try again. Go back Latest commit   Git stats 16 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time bin     books     etc     pages     tmp     .gitignore     LICENSE     README.md     View code README.md htid2books Given an access key, secret token, and a HathiTrust identifier, output plain text as well as PDF versions of a book. Synopsis ./bin/htid2txt.sh ./bin/htid2pdf.sh ./bin/htid2books.sh ./bin/collection2books.sh Introduction The HathiTrust is a really cool, and in my humble opinion, under-utilized information resource. Nowhere else can one have so much free & full text access to such a large collection of books, except maybe from the Internet Archive. If I remember correctly, the 'Trust was created in response to the venerable Google Books Project. Google digitized huge swaths of partner libraries's collections, and the results were amalgamated into a repository, primarily for the purposes of preservation. "An elephant never forgets." [1] As the 'Trust matured, so did its services. It created a full text catalog of holdings. It enabled students & scholars the ability to create personal collections. It offered libraries the opportunity to deposit digitized materials into the centralized repository. It created the HathiTrust Research Center which "enables computational analysis" of works in the collection. And along the way it implemented a few application programmer interfaces (APIs). Given one of any number of identifiers, the Bibliographic API empowers a person to download metadata describing items in the collection. Given authentication credentials and one of any number of identifiers, the Data API empowers a person to download items of interest; the Data API enables a person to download HathiTrust items in plain (optical character recognition, OCR) text as well as image formats. This system -- htid2books -- is intended to make it easy to download HathiTrust items. It will output plain text files suitable for text mining as well as PDF files for printing & traditional reading. Requirements To use htid2books you will first need to acquire authentication credentials. These include an access token and "secret key" freely available from the 'Trust to just about any one and upon request. You will then need to download the code itself. The code requires quite a bit of infrastructure. First of all, it is implemented as a set of six Bash and Perl scripts. People using Linux and Macintosh computers will have no problem here, but folks using Windows will need to install Bash and Perl. ("Sorry, really!") Second, in order to authenticate, a Perl library called OAuth::Lite is needed. This is most easily installed with some variation of the cpan install OAuth::Lite command. Lastly, in order to build a PDF file from sets of downloaded images, you will need a suite of software called ImageMagick. Installing ImageMagick is best done with some sort of package manager. People using Linux will run a variation of the yum install imagemagick or apt-get install imagemagick commands. People using Macintoshes might "brew" the installation -- brew install imagemagick. Usage ./bin/htid2txt.sh To download a plain text version of a HathiTrust item, you first change directories to the system's root and run ./bin/htid2txt.sh. The script requires three inputs: token - your access token key - your secret key identifier - a HathiTrust... identifier For example, ./bin/htid2txt.sh 194dfe2bg3 xa5350f0c44548487778e942518a nyp.33433082524681 In this case, the script will do the tiniest bit of validation, repeatedly run a Perl script (htid2txt.pl) to get the OCR of an individual page, cache the result, and when there no more pages in the given book, concatenate the cache into a text file saved in the directory named ./books. ./bin/htid2pdf.sh Similarly, to create a PDF version of a given HathiTrust item run ./bin/htid2pdf.sh. It requires four inputs: token - your access token key - your secret key identifier - a HathiTrust... identifier length - the number of page images to download Like above, htid2pdf.sh will repeatedly run htid2pdf.pl, cache image files, and when done concatenate them into a PDF file saved in the ./books directory. For example, ./bin/htid2pdf.sh 194dfe2bg3 xa5350f0c44548487778e942518a nyp.33433082524681 28 ./bin/htid2books.sh Finally, ./bin/htid2books.sh is one script to rule them all. Given a token, secret, and identifier, htid2books.sh will sequentially run htid2txt.sh and htid2pdf.sh. Sample identifiers Some interesting HathiTrust identifiers include: hvd.32044018865758 - Henry D. Thoreau : Emerson's obituary by Ralph Waldo Emerson (1904) mdp.39015024076484 - More nonsense, pictures, rhymes, botany, etc. by Edward Lear (1872) nyp.33433075812416 - The pearl of the Andes; a tale of love and adventure by Gustave Aimard ([186-?]) nyp.33433082524681 - Three little kittens by R. M. Ballantyne (1859) uc1.b3322717 - The blessed damozel by Dante Gabriel Rossetti (1898) Advanced usage Given a delimited text file, such as a HathiTrust collection file, it is more than possible to loop through the file, feed HathiTrust identifiers to htid2books.sh and ultimately create a "library". A script named collection2books.sh is included just for this purpose. Usage: ./bin/collection2books.sh And a collection file named ./etc/collection.tsv can be used as sample input - four works by Charlotte Bronte. Discussion Again, the HathiTrust is a wonderful resource. As a person who is employed to provide text mining services to students, faculty, and staff of a university, the HathiTrust's collections are a boon to scholarship. The HathiTrust is a rich resource, but it is not easy to use at the medium scale. Reading & analyzing a few documents is easy. It is entirely possible to manually generate PDF files, download them, print them (gasp!), extract their underlying plain (OCR) text, and use both traditional as well as non-traditional methods (text mining) to read their content. At the other end of the scale I might be able to count & tabulate all of the adjectives used in the 19th Century or see when the word "ice cream" first appeared in the lexicon. On the other hand, I believe more realistic use cases exist: analyzing the complete works of Author X, comparing & contrasting Author X with Author Y, learning how the expression or perception of gender may have changed over time, determining whether or not there are themes associated with specific places, etc. I imagine the following workflow: create HathiTrust collection download collection as TSV file use something like Excel, a database program, or OpenRefine to create subsets of the collection programmatically download items' content & metadata update TSV file with downloaded & gleaned information do analysis against the result share results & analysis Creating the collection (#1) is easy. Search the 'Trust, mark items of interest, repeat until done (or tired). Downloading (#2) is trivial. Mash the button. Creating subsets (#3) is easier than one might expect. Yes, there are many duplicates in a collection, but OpenRefine is really great at normalizing ("clustering") data, and once it is normalized, duplicates can be removed confidently. In the end, a "refined" set of HathiTrust identifiers can be output. Given a set of identifiers & APIs, it ought to be easy to programmatically download (#4) the many flavors of 'Trust items: PDF, OCR plain text, bibliographic metadata, and the cool JSON files with embedded part-of-speech analysis. This is the part which is giving me the most difficulty. Slow; download speeds of 1000 bytes/minute. Access control & authentication, which I sincerely understand & appreciate. Multiple data structures. For example, the bibliographic metadata is presented as a stream of JSON, and embedded in it is an escaped XML file, which, in turn, is the manifestation of a MARC bibliographic record. Yikes! After the many flavors of content are downloaded, more interesting information can be gleaned: sentences, parts-of-speech, named entities, readability scores, sentiment measures, log-likelihood ratios, "topics" & other types of clusters, definitive characteristics of similarly classified documents, etc. In the end the researcher would have created a rich & thorough dataset (#5). Through traditional reading as well as through statistics, the researcher can then do #6 against the data set(s) and PDF files. Again, the HathiTrust is really cool, but getting content out of it is not easy. But maybe I'm trying to use it when my use case is secondary to the 'Trust's primary purpose. After all, maybe the 'Trust is primarily about preservation. "An elephant never forgets." Caveats There are a number of limitations to this system. First of all, not all HathiTrust items are available for downloading. In general, anything dated before 1923 is fair game, but even then Google "owns" some of the items and they are not accessible. Second, HathiTrust identifiers often contain reserved characters used by computer file systems. Most notably the slash (/) and period (.) characters. By default, this system saves files using the HathiTrust identifiers. The use of reserved characters may confuse your file system. Third, the collection files from the HathiTrust and the collection files from the HathiTrust Research Center manifest different data structures. The files from the 'Trust are tab-delimited (TSV) files, and collection2books.sh is designed to use them as input. On the other hand, the collection files from the Center are comma-separated value (CSV) files. Moreover, the number of fields in the respective collection files are different. I suppose collection2books.sh could be hacked so either type of input were acceptable, but all me lazy. On the other hand collection2books.sh is really intended to be a template for reading any number of delimited files and subsequent processing. [2] Fourth and most significantly, the system is not fast. This is not because htid2books is doing a lot work. It is not. It is not because the network is slow. It is not because of the volume of data being transfered. Instead, it is because content is distributed one page at a time; there does not seem to be any sort of bulk downloading option. Still, there are a couple of possible solutions. For example, maybe authentication needs to happen only once? If so, then the system could be refactored. Alternatively, multi-threading and/or parallel processing could be employed. Download a single page for each processor on a computer. Such improvements are left an an exercise to the reader. Notes [1] For more detail about the Google Books Project, see the Wikipedia article. [2] Software is never done. If it were, then it would be called "hardware". Eric Lease Morgan February 16, 2019 About Given an access key, secret token, and a HathiTrust identifier, output plain text as well as PDF versions of a book. Resources Readme License GPL-2.0 License Releases No releases published Packages 0 No packages published Languages Shell 62.4% Perl 37.6% © 2020 GitHub, Inc. Terms Privacy Cookie Preferences Security Status Help Contact GitHub Pricing API Training Blog About You can’t perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. We use optional third-party analytics cookies to understand how you use GitHub.com so we can build better products. Learn more. Accept Reject We use optional third-party analytics cookies to understand how you use GitHub.com so we can build better products. You can always update your selection by clicking Cookie Preferences at the bottom of the page. For more information, see our Privacy Statement. Essential cookies We use essential cookies to perform essential website functions, e.g. they're used to log you in. Learn more Always active Analytics cookies We use analytics cookies to understand how you use our websites so we can make them better, e.g. they're used to gather information about the pages you visit and how many clicks you need to accomplish a task. Learn more Accept Reject Save preferences