Text Mining Previous Contents Next Issues in Science and Technology Librarianship Spring 2017 DOI:10.5062/F4K0729W Science and Technology Resources on the Internet Text Mining Kristen Cooper Plant Sciences Librarian University of Minnesota Libraries University of Minnesota Minneapolis, Minnesota coope377@umn.edu Table of Contents Overview Audience Scope and Methods Vocabulary Introductory Resources Sources of Text Library Databases Online Sources Tools Web-Based Desktop Apps Programming Visualization References Overview As defined by Bernard Reilly (2012), president of the Center for Research Libraries, text mining is "the automated processing of large amounts of digital data or textual content for the purpose of information retrieval, extraction, interpretation, and analysis." The first step is to find or build a corpus, or the collection of text that a researcher wishes to work with. Most often researchers will need to download this corpus to either their computers or an alternative storage platform. Once this has been done, different tools can be used to find patterns, biases, and other trends that are present in the text (Reilly 2012). Within higher education, text mining is most often found among the digital humanities and linguistics studies. However it is growing in popularity in the science and technology fields. It is possible to find many examples of how text mining is beginning to be utilized in the sciences. It allows users to search across a large set of documents to find connections that would be prohibitively expensive in terms of time to attempt to read individually. An example of this can be seen in the biomedical sciences where Frijters et al. (2010) used text mining to search in MedLine for drugs that could interfere with cell proliferation. Another example can be found with the works of the EXFOR library, which contains experimental nuclear reaction data. Hirdt and Brown (2016) used text mining to build a graph of the relationships between the reactions in the library. They were then able to use this information to identify reactions that are important to researchers but have been understudied. Text mining also has an ability to discover themes and relationships within a corpus through a technique called topic modeling. In a 2016 research study, the authors use topic modeling to determine the proportion of the analyzed text discussing a specific phenomenon, in this case forest fragmentation, and to determine the concepts that are most strongly associated with this phenomenon (Nunez-Mir et al. 2016). In an example from environmental science, Grubert and Siders (2016) use topic modeling to find empirical support for the theory that climate change has become an important topic in environmental lifecycle assessment over time, and revealed a secondary finding of this increase coming at the expense of attention to human health. Finally, the sheer amount of information available to researchers, educators, and scholars makes it increasingly difficult to stay current on a particular topic or field. Anne Okerson (2013) points out that text mining can be a useful and time saving factor in doing a systematic review. Text mining therefore presents librarians with the opportunity to develop skills in a new area that has the potential to be of great use to patrons. Audience This webliography is intended as an overview for those in higher education, such as academic librarians, researchers, educators, and graduate students, who are interested in text mining but have little to no experience with it. Scope and Methods Given the fact that text mining is flexible and can be tailored to fit the analysis needs of many different types of research, this work is not intended to be a comprehensive overview of text mining. Instead it is intended to be a select look at the information and tools available that are appropriate for those new to text mining. The following sections are included: vocabulary to introduce unfamiliar terms, introductory resources, sources of text, and available tools. Resources for the webliography were compiled from the author's previous research, communications with fellow librarians experienced in text mining, and text mining library guides from the Universities of Southern California (2016), California San Diego (2016), Illinois at Urbana-Champaign (2016), Duke (2016), and Chicago (2016) were consulted. In order to be included resources had to meet the following criteria: Available in English, Available via library subscription or freely available, Information is published, maintained, and/or hosted by reputable source, and Explanations and examples are clear and appropriate for beginners. Vocabulary Discussions of text mining and the tools used for it will most likely include several terms that may be unknown to those unfamiliar with the subject. The most common terms used in the resources of this webliography were drawn from are given below to aid reader understanding. Where possible, the definitions were pulled from the selected resources. If not found in the resources definitions were taken from Wikipedia. API: Application programing interface which can be used to access information in a machine readable format. Several open access and commercial databases have developed their own APIs for text mining, such as Elsevier. Collocation: Listing of words that commonly appear near each other. Concordance: The context of a given word or phrase. Corpus: A collection of text. CSV file: A format in which data are separated by commas in plain text. CSV is similar to Microsoft's Excel. Entity recognition: The identification of named entities such as people, places, or time periods. Extensibility: A design principle that takes future growth of the program into consideration. A program is extensible if its functionality can be added to or modified. Lemmatization: The combination of the various forms of a word back to its dictionary base. For example counting runs, ran, run, and running as instances of run. Natural language: Language used for daily human communication such as English or Spanish, vs the artificial language of programming. N-grams: Common two-, three-, n-word phrases which can be checked for frequency and context. Parts of speech: Any of the groups into which words are divided depending on their use, such as verbs, nouns, and adjectives. Stemming: The removal of affixes to reduce word to base or root form. Stop words: The most common words of a language such as the, a, is, etc. Many of the text mining tools available remove these words when doing analysis. Tokenization: Breaking text into individual words and punctuation. Topic modeling: Looking for patterns in the topics (words) of a corpus to try and find relationships among the corpus. Word frequency: How often each word appears in the text. XML: A language for encoding documents so that they are both human and machine readable. Introductory Resources The following are resources that provide an introduction into text mining. Examples of the information they include are discussions of the different ways text mining can be utilized, discussions of the tools used in text mining, and tutorials and lessons on how to use tools such as the programing language Python or the topic modeling tool MALLET. As text mining is at this time most commonly found in the digital humanities most of these resources were not created by those in the sciences. They have been included as they provide a solid introduction to the basics of text mining that can be applied across disciplines. John Laudun http://johnlaudun.org/ Dr. John Laudun is a faculty member in the department of English at the University of Louisiana at Lafayette. His blog covers his interest in both the shape of stories and the tools he uses in his research. He has several posts on text mining and natural language processing, as well a link to the Python scripts he has developed in his research.   LearnPython.org https://www.learnpython.org/ An online resource supported by DataCamp, this site provides a series of tutorials on how to use Python that build from basic through advanced skills. Each topic includes several interactive examples to help with user understanding, and each ends with an exercise that can be checked for accuracy. The site also has a public Facebook group to help with questions about Python, the link for which can be found on the site's welcome page.   Natural Language Processing with Python http://www.nltk.org/book/ Written by Dr. Steven Bird (University of Melbourne), Dr. Ewan Klein (University of Edinburgh) and Dr. Edward Loper (BBN Technologies), Natural Language Processing with Python is a freely available introductory text to using the natural language tool kit with the programing language Python. Intended as a practical introduction, the book contains numerous examples and exercises.   The Programing Historian http://programminghistorian.org/ This open access journal provides beginner-friendly, peer-reviewed tutorials on many different digital tools. The editorial board is made up of history and digital humanities professors and professionals from across the country. Examples of lesson topics include APIs and MALLET.   Sapping Attention http://sappingattention.blogspot.com/ Sapping Attention is written by Ben Schmidt, an assistant professor of history at Northeastern University. Topics covered in the blog include digital humanities, text mining, and data visualization. Dr. Schmidt frequently discusses the visualization tool Bookworm and the different ways it can be used.   The Stone and Shell https://tedunderwood.com/ The Stone and Shell is written by Ted Underwood, a teacher of English literature at the University of Illinois Urbana-Champaign. His blog focuses on his research into text analysis and digital humanities. The entry titled "Where to start with text mining" is an excellent overview of text mining and its potential uses.   Text Mining, Analytics & More http://text-analytics101.rxnlp.com/ Written by Dr. Kavita Ganesan, a senior Data Scientist at Github, this blog contains posts on text mining from the basics through advanced. In addition to her own advice, Dr. Ganesan includes a list of tutorials and articles related to text mining as well as a list of text mining resources.   Sources of Text Library Databases Elsevier https://www.elsevier.com/about/company-information/policies/text-and-data-mining Elsevier allows text mining of library subscribed content for non-commercial use. Its data and full text in XML format can be accessed through the ScienceDirect API which a researcher can register for through the developer's web site.   JSTOR Data for Research http://dfr.jstor.org/ Data for Research is a free service through JSTOR, the digital library of academic articles and texts from the non-profit ITHAKA. Through JSTOR's interface users can download metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents. If researchers have special requirements or would like to work with more than 1,000 documents, they can contact support@ithaka.org for more information.   ProQuest Historical Newspapers http://www.proquest.com/about/terms-and-conditions.html ProQuest allows for text mining of select newspapers in the historical newspapers collection within specific time ranges. Researchers that are interested in using these works should discuss it with their institution's library, as the library will need to mediate the access according to their license agreement with ProQuest.   SpringerLink http://www.springer.com/gp/rights-permissions/springer-s-text-and-data-mining-policy/29056 SpringerLink allows text mining of library subscribed content for non-commercial purposes. While SpringerLink does have a Metadata API, researchers are encouraged to download content directly from the SpringerLink platform.   Online Sources Historical Online Sources The following resources contain text that can be used with text mining to include historical views on topics. This allows researchers to see how topics of interest has changed over time, to identify when it became popular in the various literature, or to identify any increases or decrease in the popularity of a topic over time. The American Presidency Project http://www.presidency.ucsb.edu/ The American Presidency Project is a database of presidential documents that ranges from President Washington's inaugural address and state of the union speeches through President Obama. Additionally the database contains executive orders, proclamations, and many other documents of past American presidents. Users could potentially text mine these resources to discover when a particular science or technology entered the national dialogue.   Chronicling America http://chroniclingamerica.loc.gov/ Chronicling America is a database of American historical newspapers produced by the National Digital Newspaper Program. Included is the newspaper title directory, which contains information about papers published between 1690 and the present, as well as over 2,000 full text searchable digitized newspapers. Chronicling America has its own free-to-use API. Potential text mining projects could look for when science and technology topics begin appearing in newspapers as well as how discussions of that topic have changed over time.   Hathi Trust https://www.hathitrust.org/data Online repository that began in 2008 and provides long-term preservation and access to both public domain and in-copyright items from a variety sources that include Google, the Internet Archive, and partner institutions. The public domain works of the collection are available for research purposes and can be downloaded in small sets with Hathi Trust's data API, as the complete data set, or a custom data set. Examples of resources that could be of interest to science researchers include Proceedings of the Ocean Drilling Program and botany text books going back to 1840.   Project Gutenberg http://www.gutenberg.org/wiki/Main_Page Project Gutenberg was the first provider of free e-books. It contains over 50,000 works covering a broad range of topics that are out of copyright, with most dating before 1923. Under book categories, users can find the science and technology bookshelves which are further broken down by topic; example topics include botany and chemistry. There is also a periodicals bookshelf, which lists the different periodicals found at Project Gutenburg.   Current Online Sources The following resources are for those that are focused on more current information. arXiv http://arxiv.org/help/bulk_data ArXiv is an open access repository of e-prints in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, and statistics. The database has its own API that returns results in XML format. ArXiv is owned and operated by Cornell University.   BioMed Central http://old.biomedcentral.com/about/datamining BioMed Central is an open access collection of peer-reviewed research articles with a focus on STM (science, technology, and medicine). Its license includes free distribution and re-use of both full text articles and the highly structured XML versions.   Corpus of Contemporary American English (COCA) http://corpus.byu.edu/coca/ This database is "the largest freely-available corpus of English, and the only large and balanced corpus of American English. It contains more than 520 million words from fiction, popular magazines, newspapers, academic texts and spoken word." Potential uses for this corpus include seeing how a particular science or technology topic is being discussed in more popular media such as on TV and radio. Additionally, users could use the corpus to explore how science and technology are translated to newspapers and popular magazines. Words and Phrases (see Web Tools section) is designed to analyze the works of COCA.   Internet Archive - eBooks and Texts https://archive.org/details/texts A non-profit online repository, founded in 1996, holding over 10,000,000 texts and books available for download. The collection can be browsed by topic/subject, examples of which include pollution, the Environmental Protection Agency, agriculture, and natural history. It can also be browsed by collection, such as the Biodiversity Heritage Library Collection. Available download formats range depending on the item users wish to download, but can include plain text, zip file, torrent, PDF, or EPUB.   New York Times Developer Network http://developer.nytimes.com/ A list of APIs created for use with the New York Times. They require users to request an API key for each API. Examples include the article search API which allows for retrieval of headlines and lead paragraphs, and the community API which retrieves comments of registered users on New York Times articles.   Digital Public Library of America (DPLA) API https://dp.la/info/developers/codex/ The Digital Public Library of America began in 2010 with funding from Harvard's Berkman Center for Internet & Society and the Alfred P. Sloan foundation, and is based in the Boston Public Library. The DPLA API collects metadata on two types of resources, items and collections; items are single items indexed by the data provider and collections are logical groups of items. Use of the API requires requesting an API key from the DPLA.   Public Library of Science (PLOS) https://www.plos.org/ PLOS is an open access publisher of scientific research, with a focus on biological sciences and medicine. Additionally, PLOS aggregates and curates related content from across its journals in the PLOS Collections. PLOS has two APIs: a search API that works with the terms of the PLOS search function and another that works with article level metrics (ALM API).   PubMed Central Open Access Subset http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ PubMed Central is the free archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health's National Library of Medicine. Started in 2000 with two journals, it now holds several million articles. While access to PMC is free, use of the material is still subject to copyright and licensing terms and will vary for each article.   Tools Web-Based This section contains tools that are online and can be used for free. They are less powerful then downloadable tools and for the most part cannot handle large corpuses. Additionally they do not require users to have programming experience. Users will either copy and paste or upload selected texts in compatible formats and then run the program. These are helpful to users who wish to explore text mining or for simple projects. Types of functions that are well suited to web tools include word frequency and concordance. BYU Google Books Viewer http://googlebooks.byu.edu/x.asp An interface that works with the Google Books corpus. Functions include frequencies, synonyms, collocates, and parts of speech. Also allows for comparison from different periods of the corpora, such 1960s-2000s and 1800s -1900s.   Netlytic https://netlytic.org/home/ Open access tool with a focus on text and network analysis on data from social media. Currently works with data from Twitter, Facebook, Instagram, YouTube, RSS Feeds, and .csv or .text files from Dropbox and Google Drive.   TAPoRWare http://taporware.ualberta.ca/~taporware/about.shtml TAPoRWare is a web-based text analysis tool, developed with support from Canada Foundation for Innovation and McMaster University Faculty of Humanities. Supported file types include HTML, XML, and plain text. Functions include word frequency, concordance, collocation, and tokenization.   Voyant http://voyant-tools.org/ Voyant is an open-source web based tool for text mining and analysis. Features include word visualizations, frequency counts, and contexts for the most frequent terms of the document. With it users can open existing corpuses, upload files, or use the text box on the main page to type in URLs or paste in full text. Supported file types include plain text, HTML, XML, PDF, and MS Word.   Word and Phrase http://www.wordandphrase.info/analyzeText.asp This tool is designed to work with either a user's own texts which can be copied and pasted into the software, or the Corpus of Contemporary American English (COCA) (see Sources of Text - Current Online Sources). Initial results include a list of the words in the text and their frequency. Additionally, selecting a word from the list will show concordance lines, definitions of the word, and give an option to search collocates. As mentioned in the COCA description, this tool could be used to discover how science and technology topics are covered in newspapers and popular magazines.   Desktop Desktop programs allow users to work with larger corpuses than those tools that are online only. Most will require that users will need to have the files that compose their corpus saved on their computer or some other drive. Once again programming knowledge is not required for these tools. Some of them are operating system specific. Functions well suited to desktop tools include word frequency, concordance, and in some cases collocation. AntConc http://www.laurenceanthony.net/software/antconc/ AntConc is a freeware text analysis and concordance tool kit. Features include a clustering/N-gram tool, a collocation tool, and word list tools. AntConc can be run on Windows, Macintosh, and Linux, and the preferred file format is plain text.   AntFileConverter http://www.laurenceanthony.net/software/antfileconverter/ AntFileConverter is a tool for converting PDF and Word documents to plain text for use in text mining.   TextSTAT http://neon.niederlandistik.fu-berlin.de/en/textstat/ Found at the Free University of Berlin web site, TextSTAT is an analysis program that allows users to build a corpus and find word frequencies and concordances. Acceptable file formats include plain text, html (directly from the internet), MS Word, and OpenOffice files.   CasualConc https://sites.google.com/site/casualconc/ CasualConc is a concordance program designed for the Mac X 10.5 or later operating system. Features include concordances, word clusters, collocation analysis, and word counts. The preferred file format is plain text, but it will also work with MS Word, PDF and HTML. At this time the developer notes that they have used CasualConc exclusively with English, and would appreciate feedback from anyone who uses it with other languages.   Paper Machines http://papermachines.org/ Paper Machines is an add-on to the citation management software Zotero and its features include topic modeling and word clouds.   Apps Smart phone applications that are currently available for mobile text analysis. Textal http://textal.org/ Textal is free for the iOS system. It allows for the analysis of web sites, tweets, and documents. Features include a word cloud, word frequency, and collocation; for fun it does also include a words score for the game scrabble. This app does require association with a Twitter account to function properly.   Programming If users do not have experience with programing this method will require a larger commitment in time to learn both a programing language and then the specifics for using that language with text mining. Many programing scripts are available online which can be very helpful. Due to nature of programming however it is the most flexible way to text mine and allows users to tailor it to closely meet their needs. Programing will be the best way to use functions such as entity recognition, lemmatization, parts of speech, and topic modeling. NLTK (Natural Language Tool Kit) with Python http://www.nltk.org/ NLTK is a platform for working with human language data in the programing language Python. Files should be plain text. It has several text processing libraries that can be used for classification, tokenization, lemmatization, and stemming of text. Functions for text mining include word count, word frequency, collocations, named entity identification, etc. The platform is available for Windows, Mac OS X, and Linux. Finally, NLTK is an open source project and can be downloaded for free.   MALLET http://mallet.cs.umass.edu/index.php MALLET is an open source software for topic modeling and statistical natural language processing that runs on both Windows and Macs. This software allows users to gain insight in topic trends of large corpus.   R https://www.r-project.org/ R is an open source programing language that can be used for data manipulation, calculations, and graphing. It is highly extensible and can be adapted to fit the user's needs. This is done through using different packages, such as the tm package for text mining. R is compatible with many platforms such as Linux, Windows, and the Mac OS.   Visualization Visualization can range from simple word clouds to more complex graphs and charts. These tools are not for extensive analysis but more often to show the results of the different functions of text mining. Word clouds for example are a visualization of word frequency. Wordle http://www.wordle.net/ Wordle is a word cloud generating web site. Features include the ability to customize fonts, layouts, and color schemes of the word cloud. As Wordle uses the Java browser plug-in, it does require Java to be installed and the browser be configured to use it.   Bookworm http://bookworm.culturomics.org/ Bookworm is an interface tool to work with the Google Books corpus. It allows users to track the changes in a word or phrases in the corpus over a selected amount of time.   Infogr.am https://infogr.am/ A web site for creating graphs and charts, Infogr.am has more than 30 types to choose from. Data can be entered directly into the web site, can be imported from a sources such as the cloud, Google Drive, and Dropbox, or can be uploaded via spreadsheet.   Tagxedo http://www.tagxedo.com/ Tagxedo is a web site for generating word clouds. Clouds can be created by uploading plain text files, entering a URL, Twitter ID, RSS feed, or search. Features include the ability to customize font, colors, theme, and shape of the word cloud.   References Duke University Libraries. 2016. Introduction to Text Analysis. [Internet]. [ Available from: http://guides.library.duke.edu/text_analysis Frijters, R., Vugt, M. Van, Smeets, R., Schaik, V., Vlieg, J. De & Alkema, W. 2010. Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLOS Computational Biology. DOI: 10.1371/journal.pcbi.1000943 Grubert, E. & Siders, A. 2016. Benefits and applications of interdisciplinary digital tools for environmental meta-reviews and analyses. Environmental Research Letters 11(9). DOI: 10.1088/1748-9326/11/9/093001 Hirdt, J. & Brown, D. 2016. Identifying understudied nuclear reactions by text-mining the EXFOR experimental nuclear reaction library. Nuclear Data Sheets 131:377-399. DOI: 10.1016/j.nds.2015.12.008 Nunez-Mir, G.C., Iannone, B.V., Pijanowski, B.C., Kong, N., Fei, S. & Fitzjohn, R. 2016. Automated content analysis: addressing the big literature challenge in ecology and evolution. Methods in Ecology and Evolution 7(11):1262-1272. DOI: 10.1111/2041-210X.12602 Okerson, A. 2013. Text & data mining - a librarian overview. IFLA World Congress:1-6. Available from: http://library.ifla.org/252/1/165-okerson-en.pdf Reilly, B.F. 2012. CRL Reports. Charleston Advisor 14:75-76. DOI: 10.5260/chara.14.2.75 University of California, San Diego. 2016. [Accessed 2016 June 10]. Finding data & statistics: text mining. [Internet]. Available from: http://ucsd.libguides.com/data-statistics/textmining University of Chicago. 2016. Text and data mining. [Internet]. [Accessed 2016 June 12]. Available from: http://guides.lib.uchicago.edu/textmining#s-lg-box-6075047 University of Illinois at Urbana-Champaign. 2016. Text mining tools. [Internet]. [Accessed 2016 June 12]. Available from: http://guides.library.illinois.edu/c.php?g=405110&p=2757860 University of Southern California. 2016. Text & data mining. [Internet]. [Accessed 2016 June 10]. Available from: http://libguides.usc.edu/textmining Previous Contents Next This work is licensed under a Creative Commons Attribution 4.0 International License.