GitHub - ericleasemorgan/reader: Distant Reader, a tool for using & understanding a corpus Skip to content Sign up Why GitHub? Features → Code review Project management Integrations Actions Packages Security Team management Hosting Mobile Customer stories → Security → Team Enterprise Explore Explore GitHub → Learn & contribute Topics Collections Trending Learning Lab Open source guides Connect with others Events Community forum GitHub Education GitHub Stars program Marketplace Pricing Plans → Compare plans Contact Sales Nonprofit → Education → In this repository All GitHub ↵ Jump to ↵ No suggested jump to results In this repository All GitHub ↵ Jump to ↵ In this repository All GitHub ↵ Jump to ↵ Sign in Sign up {{ message }} ericleasemorgan / reader Watch 13 Star 18 Fork 7 Distant Reader, a tool for using & understanding a corpus GPL-2.0 License 18 stars 7 forks Star Watch Code Issues 49 Pull requests 0 Actions Projects 14 Wiki Security Insights More Code Issues Pull requests Actions Projects Wiki Security Insights Dismiss Join GitHub today GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. Sign up GitHub is where the world builds software Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world. Sign up for free Dismiss master 5 branches 2 tags Go to file Code Clone HTTPS GitHub CLI Use Git or checkout with SVN using the web URL. Work fast with our official CLI. Learn more. Open with GitHub Desktop Download ZIP Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching Xcode If nothing happens, download Xcode and try again. Go back Launching Visual Studio If nothing happens, download the GitHub extension for Visual Studio and try again. Go back Latest commit   Git stats 355 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time bin     css     etc     js     lib     www     .gitignore     .python-version     CONTRIBUTING.md     DEPENDENCIES.md     GUIDE.md     LICENSE     README.md     requirements.txt     solr_README     tutorial-04.txt     View code README.md Distant Reader CORD The Distant Reader CORD is a high performance computing (HPC) system which: 1) takes an almost arbitrary amount of unstructured data (text) as input and outputs a set of structured data for analysis, and 2) does this work against a specific data set called CORD-19. (Reader CORD is based on a different software suite called Distant Reader Classic which is designed for more generic sets of input.) To do this work, the Distant Reader CORD first caches the data set. It then transforms the content into a set of plain text files. Third, the Reader does text mining and natural language processing against the text files for the purpose of feature extraction: n-grams, parts-of-speech, named-entities, etc. The results of this process is a set of tab-delimited text files. The whole of the tab-delimited text files is then distilled into a relational database. A set of tabular and narrative reports is then generated against the database. The cache, transformed plain text files, tab-delimited files, relational database, and reports are then compressed ito a single (zip) file, and returned to the... reader. [1] The returned file is affectionately called a "study carrel". The student, researcher, or scholar is intended to peruse the study carrel for the purpose of supplementing the more traditional reading process. For more detail, links of possible interest include: home page - https://cord.distantreader.org fledgling study carrels - https://cord.distantreader.org/carrels/ Guide to the code - GUIDE.md blog postings - http://sites.nd.edu/emorgan/category/distant-reader/ Slack channel - http://bit.ly/distantreader-on-slack Twitter feed - http://twitter.com/readerdistant As an HPC, the Distant Reader CORD is not a single computer program but instead a suite of software comprised of many individual scripts and applications. Personally, I see the scripts and applications akin to collection of poems used to make the output of human expression more cogent. Really. Seroiusly. As a collection of scripts and applications, the Distant Reader has only been built by "standing on the shoulders of giants". Cited here in no particular order nor necessarily complete, they include these below and more: the Perl-based LWP modules - this software is a significant part of harvesting process Wget - an absolutely wonderful Internt spidering application Tika - a Java-based library which transforms just about any file into plain text Spacy - a Python module which simplifies natural language processing operations Gensim - another Python module for natural language processing Textacy - a Python module building on the good work of Spacey SQLite - a cross-platform, SQL-compliant relational database library/application OpenStack - a tool for building virtual machines Slurm - a tool for instantiating a cluster of computer nodes and what runs on them Airivata - a Web-based suite of software used to monitor computing jobs on a cluster Other Python Libraries - sqlalchemy, pandas, itertools, wordcloud, scipy, sklearn, networkx, textatistic, nltk Other Perl Modules - DBI, JSON, Archive::Zip, WebService::Solr, XML::XPath, CGI, File::Basename, File::Copy, HTML::Entities, HTML::Escape Javascript Libraries - bootstap, jquery Other Programs - csvstack If you have any questions, then please don't hesitate to ask. "Happy reading!" [1] Just like GNU, the Distant Reader's defintion is rather recursive Eric Lease Morgan Navari Family Center for Digital Scholarship Hesburgh Libraries University of Notre Dame 574/631-8604 Created: June 28, 2018 Updated: May 31, 2020 cord-19 This suite of software will prepare a data set called "CORD-19" for processing with the Distant Reader. CORD-19 is a set of more than 50,000 full text scholarly journal articles surrounding the topic of COVID-19. Each "article" is really a JSON file containing (very) rudimentary bibliographic information, a set of paragraphs, and bibliographic citations. As a pre-processing step for the Distant Reader, the suite processes the CORD-19 metadata and its associated JSON files. To get this software to work for you, pip install -r requirements.txt, configure ./bin/cache.sh, and the run ./bin/build.sh. The system will then: download a zip file and its associated metadata file uncompress the the zip file move all the JSON files to a single directory initialize a database pour the metadata into the the database output a simple narrative report summarizing the content of the metadata file Depending on the network connection, the build process takes less than 7 minutes. The next steps are the creation of two scripts: Given an SQL SELECT statement, return a list of keys, and use them to initialize a Distant Reader study carrel Given a JSON file, output a more human-readable version of the same Wish us luck. Eric Lease Morgan May 14, 2020 About Distant Reader, a tool for using & understanding a corpus Topics distant-reading text-mining natural-language-processing hpc-systems Resources Readme License GPL-2.0 License Releases 2 Zenodo Release Latest Apr 11, 2020 + 1 release Packages 0 No packages published Contributors 8 Languages Shell 23.9% JavaScript 21.2% Perl 20.1% HTML 18.5% Python 14.5% CSS 1.8% © 2020 GitHub, Inc. Terms Privacy Cookie Preferences Security Status Help Contact GitHub Pricing API Training Blog About You can’t perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. We use optional third-party analytics cookies to understand how you use GitHub.com so we can build better products. Learn more. Accept Reject We use optional third-party analytics cookies to understand how you use GitHub.com so we can build better products. You can always update your selection by clicking Cookie Preferences at the bottom of the page. For more information, see our Privacy Statement. Essential cookies We use essential cookies to perform essential website functions, e.g. they're used to log you in. Learn more Always active Analytics cookies We use analytics cookies to understand how you use our websites so we can make them better, e.g. they're used to gather information about the pages you visit and how many clicks you need to accomplish a task. Learn more Accept Reject Save preferences