Tiny list of part-of-speech taggers

Posted on November 21, 2013 by Eric Lease Morgan

This is a tiny list of part-of-speech (POS) taggers, where taggers are tools used to denote what words in a sentence are nouns, verbs, adjectives, etc. Once parts-of-speech are denoted, a reader can begin to analyze a text on a dimension beyond the simple tabulating of words. The list is posted here for my own edification, and just in case it can be useful to someone else and in the future:

CLAWS – From the website, “Part-of-speech (POS) tagging, also called grammatical tagging, is the commonest form of corpus annotation, and was the first form of annotation to be developed by UCREL at Lancaster. Our POS tagging software for English text, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to POS tag c.100 million words of the British National Corpus (BNC).” After obtaining a licence, the reader can download CLAWS and use it accordingly. The site is also interesting because it includes a simple Web interface allowing the reader to supply a text and have it tagged. The input is limited to 100,000 words.
Junk Tagger – This tool’s underlying model is documented in “A Maximum Entropy Tagger with Unsupervised Hidden Markov Models” by Jun’ichi Kazama, Yusuke Miyao and Jun’ichi Tsujii . It comes with a number of pre-compiled Linux binaries and quite a number of Ruby scripts. I’m sure it runs, but the instructions were terse and my experience with Ruby is less than limited.
Linguistic Tools – Here you will find a simple Web-based interface allowing you to parse and tag pasted text or text at the other end of a URL. The output is good for demonstration purposes, but not necessarily for computing against. The service is really intended to be used through its SOAP interface.
OpenNLP – This tool set seems to be up-and-coming. Based on Java, there is a command line interface, but I believe the input needs to be a list of tokenized lines (sentences), and the output resembles the output of the Stanford tool — a sentence with POS tags appended to tokens. The command line interface is intended for demonstration purposes.
Stanford Log-linear Part-Of-Speech Tagger – Written in Java, this too seems to be an increasingly popular tagger. The output from the command line interface reads like sentences with the POS tags appended to the ends of each… part-of-speech.
TreeTagger – TreeTagger seems to be the grand daddy of POS tools. It is available for many operating systems and many languages. Read the installation instructions carefully because they matter. I like TreeTagger the best because there is a Perl module — Lingua::TreeTagger — that goes hand-in-hand with it. There are a few different output styles from the command line interface: XML-ish, a table, etc. To make my life easier, I wrote a Perl script called pos-summarize.pl. Its purpose is to tabulate TreeTagger’s tabled output listing the number of times different parts-of-speech occurred. This way it is relatively easy to see if there were a preponderance of adjective, gender-specific nouns, etc.

Quick And Dirty Website Analysis

Posted on November 12, 2013 by Eric Lease Morgan

This posting describes a quick & dirty way to begin doing website content analysis.

A student here at Notre Dame wants to do computer and text mining analyze a set of websites. After a bit of discussion and investigation, I came up with the following recipe:

create a list of websites to analyze
save the list of website URLs in a text file
feed the text file to wget and mirror the sites locally [1]
for each website, strip the HTML to create sets of plain text files [2]
for each website, concatenate the resulting text files into a single file [3]
feed each concatenated text file to any number of text analysis tools [4, 5]

This is just a beginning, but it is something one can do on their own and without special programming. Besides, it is elegant.

[1] mirror – wget -r -i urls.txt
[2] convert – find . -name "*.html" -exec textutil -convert txt {} \;
[3] concatenate – find . -name "*.txt" -exec cat {} >> site-01.log \;
[4] tool #1 (Voyant Tools) – http://voyant-tools.org
[5] set of tools – http://taporware.ualberta.ca/~taporware/textTools/

Beth Plale, Yiming Sun, and the HathiTrust Research Center

Posted on May 21, 2013 by Eric Lease Morgan

Beth, Matt, & Yiming

Beth Plale and Yiming Sun, both from the HathiTrust Research Center, came to Notre Dame on Tuesday (May 7) to give the digital humanities group an update of some of the things happening at the Center. This posting documents some of my take-aways.

As you may or may not know, the HathiTrust Research Center is a part of the HathiTrust. And in the words of Plale, “the purpose of the Center is to develop the cyberinfrastructure of the Trust as well as to provide cutting edge software applied against the Trust’s content.” The later was of greatest interest to the presentation’s twenty or so participants.

The Trust is a collection of close to 11 million digitized books. Because close to 70% of these books are not in the public domain, any (digital humanities) computing must be “non-consumptive” in nature. What does this mean? It means the results of any computing process must not be able to reassemble the content of analysis back into a book’s original form. (It is interesting to compare & contrast the definition of non-consumptive research with the “non-expressive” research of Matt Sag. ) What types of research/analysis does this leave? According to Plale, there are a number of different things including but not necessarily limited to: classification, statistical analysis, network graphing, trend tracking, and maybe even information retrieval (search). Again, according to Plale, “We are looking to research that can be fed back into the system, perhaps to enhance the metadata, correct the OCR, remove duplicate items, date items according to when they were written, or possibly do gender detection… We want the Trust to be a community resource.”

After describing the goals behind the Center, Sun demonstrated some of the functionality of the Center’s interactive interface — their portal:

Ideally, log in but this is not always necessary.
Create or choose a “workset” — a collection of documents — by searching the ‘Trust with a simple faceted interface. Alternatively, a person can select any one of the existing worksets.
Choose an algorithm to apply against a workset. Many of the algorithms have been created using Meandre and output things like tag clouds and named entities. There is presently an algorithm to download all the metadata (MARCXML records) of a set.
Download the results of Step #3 to further one’s own analysis.
Go to Step #2.

Interaction with the Center in this manner is very much like interaction with JSTOR’s Data For Research. Search content. Run job. Download results. Do further analysis. See a blog posting called JSTOR Tool as an example.

Unlike JSTOR’s Data For Research site, the HathiTrust Research Center has future plans to allow investigators to upload a virtual machine image to the Center. Researchers will then be able to run their own applications through a shell on the virtual machine. (No, hackers will not be enabled to download copyrighted materials through such an interface.) This type of interaction with the Center is a perfect example of moving an application to the data instead of the other way around.

Sun also highlighted the Center’s wiki where there is documentation describing the query and data APIs. The query API is based on Solr allowing you to search the Trust. The data API provides a means for downloading metadata and associated content.

As the presentation was winding down I thought of a number of ways the underlying metadata and reading experience could be improved through a series of relatively easy applications. They include:

add the length of a book in terms of words allowing a person to search for a “short” book
add one or more “readability” scores allowing a person to search for “easy” books
allow search results to be plotted graphically (visualized) using readability and lengths
compute a short list of statistically significant words for each book to supplement “aboutness”
enhance the reader’s experience by supplementing it with concordances, frequency tables, word clouds, etc.

This was the last of our sponsored digital humanities presentations for the academic year. Matthew Wilkens and I sincerely appreciate the time and effort Beth Plale and Yiming Sun spent in coming to visit us. It was very interesting to learn about and discuss ways the content of HathiTrust can used to expand our knowledge of the human condition. “Thank you, Beth and Yiming! Fun with the digital humanities”.

JSTOR Tool — A Programatic sketch

Posted on May 17, 2013 by Eric Lease Morgan

JSTOR Tool is a “programatic sketch” — a simple and rudimentary investigation of what might be done with datasets dumped from Data For Research of JSTOR.

More specifically, a search was done against JSTOR for English language articles dealing with Thoreau, Emerson, Hawthorne, Whitman, and transcendentalism. A dataset of citations, n-grams, frequently used words, and statistically significant key words was then downloaded. A Perl script was used to list the articles, provide access to them, but also visualize some of their characteristics. These visualizations include wordclouds, a timeline, and a concordance.

Why do this? Because we suffer from information overload and computers provide a way to read things from a “distance”. Indexes and search engines are great, but no matter how sophisticated your query, the search results are going to be large. Given a corpus of materials, computers can be used to evaluate, analyze, and measure content in ways that are not feasible for humans. This page begins to illustrate how a cosmos can be created from an apparent chaos of content — it is a demonstration of how librarianship can go beyond ~~find & get~~ and move towards use & understand.

Give JSTOR Tool a whirl, and tell me how you think the data from JSTOR could be exploited for use & understanding.

Matt Sag and copyright

Posted on April 29, 2013 by Eric Lease Morgan

Eric, Matt, and Matt

Matt Sag (Loyola University Chicago) came to visit Notre Dame on Friday, April 12 (2013). His talk was on copyright and the digital humanities. In his words, “I will explain how practices such as text mining present a fundamental challenge to our understanding of copyright law and what this means for scholars in the digital humanities.”

The presentation was well-attended, and here a few of my personal take-aways:

Sag enumerated a number of technologies for presenting media (photographs, phonographs, radios, photocopiers, televisions, tape recorders, etc.), and then he said, “Just about all new technologies required a re-thinking of the ideas of copyright…” This is/was interesting because I imagined how the copyright laws have changed along with the advent to new devices, but…
A popular phrase used to describe the way digital humanists investigate content is “non-consumptive”, meaning the results do not really use up the resource. Sag prefers a different phrase — “non-expressive”.
He went on to say, “…but copyright did not really change.” Furthermore, “The ‘non-expressive’ use of content is not really copyrightable. Or is it?”
To answer his own question, Sag does not believe processes like text mining violate copyright because the results are generated automatically — created by machines. The results are algorithmically determined and are not dissimilar to the way Internet search engines work. Copyright claims against search engines have not stood up in court. Maybe it could be put this way. Text mining is an automated process similar to Internet search engine indexing. Internet search engine indexing has not been determined to be in violation of copyright. Therefore text mining is not in violation of copyright either. (A equals B. B is not C. Therefore A is not C either.)

Okay. So this particular mini-travelogue may not be one of my greatest, but Sag was a good speaker, and a greater number of people than usual came up to me after the event expressing their appreciation to hear him share his ideas. Matt Sag, thank you!

Copyright And The Digital Humanities

Posted on April 10, 2013 by Eric Lease Morgan

This Friday (April 12) the Notre Dame Digital Humanities group will be sponsoring a lunchtime presentation by Matthew Sag called Copyright And The Digital Humanities:

I will explain how practices such as text mining present a fundamental challenge to our understanding of copyright law and what this means for scholars in the digital humanities.

Matthew Sag is a faculty member at the law school at Loyola University Chicago. [1, 2] If you would like to attend, then please drop me (Eric Lease Morgan <emorgan@nd.edu>) a note so I can better plan. Free food.

Who: Matthew Sag
What: A talk and discussion on copyright and humanities research
Where: LaFortune Gold Room (3rd floor)
When: Friday, April 12, 11:45 am – 1:00 pm

[1] Sag’s personal Web page
[2] Sag’s professional Web page

Digital humanities and the liberal arts

Posted on March 5, 2013 by Eric Lease Morgan

Galileo demonstrates the telescope

The abundance of freely available full text combined with ubiquitous desktop and cloud computing provide a means to inquire on the human condition in ways not possible previously. Such an environment offers a huge number of opportunities for libraries and liberal arts colleges.

Much of the knowledge created by humankind is manifest in the arts — literature, music, sculpture, architecture, painting, etc. Once the things of the arts are digitized it is possible to analyse them in the same way physical scientists analyze the natural world. This analysis almost always takes the shape of measurement. Earthquakes have a measurable magnitude and geographic location. Atomic elements have a measurable charge and behave in predicable ways. With the use of computers, Picasso’s paintings can be characterized by color, and Shakespeare’s plays can be classified according to genre. The arts can be analyzed similarly, but this type of analysis is in no way a predeterminer of truth nor meaning. They are only measurements and observations.

Libraries and other cultural heritage institutions — the homes for many of artistic artifacts — can play a central role in the application of the digital humanities. None of it happens without digitization. This is the first step. The next step is the amalgamation and assimilation of basic digital humanities tools so they can be used by students, instructors, and researchers for learning, teaching, and scholarship. This means libraries and cultural heritage institutions will need to go beyond basic services like find and get; they will want to move to other things such as annotate, visualize, compare & contrast, etc.

This proposed presentation elaborates on the ideas outlined above, and demonstrates some of them through the following investigations:

Digital humanities simply applies computing techniques to the liberal arts. Their use is similar to use of the magnifying glass by Galileo. Instead of turning it down to count the number of fibers in a cloth (or to write an email message), it is being turned up to gaze at the stars (or to analyze the human condition). What he finds there is not so much truth as much as new ways to observe. Digital humanities computing techniques hold similar promises for students, instructors, and scholars of the liberal arts.

Introduction to text mining

Posted on March 4, 2013 by Eric Lease Morgan

Starry Night by Van Gogh

Text mining is a process for analyzing textual information. It can be used to find both patterns and anomalies in a corpus of one or more documents. Sometimes this process is called “distant reading”. It is very important to understand that this process is akin to a measuring device and should not used to make value judgements regarding a corpus. Computers excel at counting (measuring), which is why they is used in this context. Value judgements — evaluations — are best done by humans.

Text mining starts with counting words. Feed a computer program a text document. The program parses the document into words (tokens), and a simple length of the document can be determined. Relatively speaking, is the document a long work or a short work? After the words are counted, they can be tabulated to determine frequencies. One set of words occurs more frequently than others. Some words only occur once. Documents with a relatively high number of unique words are usually considered more difficult to read.

The positions of words in a document can also be calculated. Where in a document are particular words used? Towards the beginning? The middle? The end? In the form of a histogram, plotting frequencies of words against relative positions can highlight the introduction, discussion, and conclusion of themes. Plotting multiple words on the same histogram — whether they be synonyms or antonyms — may literally illustrate ways they are used in conjunction. Or not? If a single word holds some meaning, then do pairs of words hold twice as much meaning? The answer is, “Sometimes”. Phrases (n-grams) are easy to count and tabulate once the postions of words are determined, and since meaning is often not determined solely in single words but multi-word phrases, n-grams are interesting to observe.

Each human language adheres to a set of rules and conventions. If they didn’t, then no body would be able to understand anybody else. A written language has syntax and semantics. Such rules in the English language include: all sentences start with a capital letter and end with a defined set of punctuation marks. Proper nouns begin with capital letters. Gerunds end in “ing”, and adverbs end in “ly”. Furthermore, we know certain words carry gender connotations or connotations regarding singularity or plurality. Given these rules (which are not necessarily hard and fast) it is possible to write computer programs do to some analysis. This is called natural language processing. Is this book more or less male or female? Are there many people in the book? Where does it take place? Over what time periods? Is the text full of action verbs or are things rather passive? What parts-of-speech predominate the text or corpus?

All of the examples from the preceding paragraphs describe the beginnings of text mining in the digital humanities. There are many Web-based applications allowing one to do some of this analysis, and there are many others that are not Web-based, but there are few, if any, doing everything the scholar will want to do. That is the definition of scholarship. Correct? Most digital humanities investigations will require team efforts — the combined skills of many different people: domain experts, computer programmers, graphic designers, etc.

The following links point to directories of digital humanities tools. Browse the content of the links to get and idea of what sorts of things can be done relatively quickly and easily:

In the following is a link to a particular digital humanities tool. Also included are a few links making up a tiny corpus. Use the tool to do some evaluation against the texts. What sort of observations are you able to discern using the tool? Based on those observations, what else might you want to discover? Are you able to make any valid judgments about the texts or about the corpus as a whole?

Use some of your own links — build your own corpus — to do some of analysis from your own domain. What new things did you learn? What things did you know previously that were brought to light quickly? Would a novice in your domain be able to see these things as quickly as you?

Text mining is a perfect blend between the humanities and the sciences. It epitomizes a bridge between the two cultures of C. P. Snow. [1] Science does not explain. Instead it merely observes, describes, and predicts. Moreover, it does this in a way that can be verified and repeated by others. Through the use of a computer, text mining offers the same observation processes to the humanist. In the end text mining — and other digital humanities endeavors — can provide an additional means for accomplishing the goals of the humanities scholar — to describe, predict, and ultimately understand the human condition.

The digital humanities simply apply computing techniques to the liberal arts. Their use is similar to use of the magnifying glass by Galileo. Instead of turning it down to count the number of fibers in a cloth (or to write an email message), it is being turned up to gaze at the stars (or to analyze the human condition). What he finds there is not so much truth as much as new ways to observe.

[1] Snow, C. P., 1963. The two cultures ; and, A second look. New York: New American Library.

Genderizing names

Posted on January 29, 2013 by Eric Lease Morgan

I was wondering what percentage of subscribers to the Code4Lib mailing list were male and female, and consequently I wrote a hack. This posting describes it — the hack that is, genderizing names.

I own/moderate a mailing list called Code4Lib. The purpose of the list is provide a forum for the discussion of computers in libraries. It started out as a place to discuss computer programming, but it has evolved into a community surrounding the use of computers in libraries in general. I am also interested in digital humanities computing techniques, and I got to wondering whether or not I could figure out the degree the list is populated by men and women. To answer this question, I:

extracted a list of all subscribers
removed everything from the list except the names
changed the case of all the letters to lower case
parsed out the first word of the name and assumed it was a given name
tabulated (counted) the number of times that name occurred in the list
queried a Web Service called Gendered Names to determine… gender
tabulated the results
output the tabulated genders
output the tabulate names
used the tabulated genders to create pie chart
used the tabulated names to create a word cloud

In my opinion, the results were not conclusive. About a third of the names are “ungenderizable” because no name was supplied by a mailing list subscriber or the Gendered Names service was not able to determine gender. That aside, most of the genderized names are male (41%) and just over a quarter (26%) of the names are female. See the chart:

To illustrate how the names are represented in the subscriber base, I also created a word cloud. The cloud does not include the “no named” people, the unknown genders, nor the names where there was only one occurrence. (The later have been removed to protect the innocent.) Here is the word cloud:

While I do not feel comfortable giving away the original raw data, I am able to make available the script used to do these calculations as well as the script’s output:

names.pl – Perl script that does the tabulations
data.txt – the aggregated results (output) of the script

What did I learn? My understanding of the power of counting was re-enforced. I learned about a Web Service called Gendered Names. (“Thank you, Misty De Meo!”). And I learned a bit about the make-up of the Code4Lib mailing list, but not much.

Visualization and GIS

Posted on December 19, 2012 by Eric Lease Morgan

The latest “digital humanities” lunch presentations were on the topics of visualization and GIS.

Kristina Davis (Center for Research Computing) gave our lunchtime crowd a tour of online resources for visualization. “Visualization is about transforming data into visual representations in order to analyze and understand… It is about seeing the invisible, and it facilitates communication… Visualization has changed from a luxury to a necessity because of the volume of available information… And the process requires the skills of many people, at least: content specialists, graphic designers, computer programmers, and cognitive psychologists… The later is important because different people and cultures bring along different perceptions and assumptions. Red and green may mean stop and go to one group of people, but they mean something completely different to a different group of people. Color choice is not random.” Davis then proceded to use her website as well as others to show examples of visualization:

A stacked diagram from Information Zoo

“Visualization is not eye-candy but rather a way to convey information.”

Rumana Reaz Arifin (Center for Research Computing) then shared with the audience an overview of geographic information systems (GIS) for the humanities. She described it as a process of mapping data with reference to some spatial location. It therefore consists of two parts: a spatial part denoting place, and attributes — information of the spatial features. Often times the spatial parts are in vector (shapefiles) or raster formats (images), and the attributes are contained in text files or database application files. The two file types are joined together to illustrate characteristics of a place using latitudes and longitudes, addresses or relational key. One of the things I found most interesting was the problem of projection. In other words, considering the fact that the Earth is round, maps of large areas of the world need to be bent in order to compensate for the flat surface of paper or a computer screen. Arifin then gave an overview of various GIS applications, both commercial (ArcGIS, Mapinfo, etc.) and open source (GRASS, GeoDA, etc.), as well as some of the functionality of each. Finally, she demonstrated some real world GIS solutions.

Hospitals around Dallas (Texas)

“GIS is not just mapping, [the] map is the end product of analysis.”

DH Blog @ Notre Dame

Learning about human expression through the use of computers