This is a tiny list of part-of-speech (POS) taggers, where taggers are tools used to denote what words in a sentence are nouns, verbs, adjectives, etc. Once parts-of-speech are denoted, a reader can begin to analyze a text on a dimension beyond the simple tabulating of words. The list is posted here for my own edification, and just in case it can be useful to someone else and in the future:
- CLAWS – From the website, “Part-of-speech (POS) tagging, also called grammatical tagging, is the commonest form of corpus annotation, and was the first form of annotation to be developed by UCREL at Lancaster. Our POS tagging software for English text, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to POS tag c.100 million words of the British National Corpus (BNC).” After obtaining a licence, the reader can download CLAWS and use it accordingly. The site is also interesting because it includes a simple Web interface allowing the reader to supply a text and have it tagged. The input is limited to 100,000 words.
- Junk Tagger – This tool’s underlying model is documented in “A Maximum Entropy Tagger with Unsupervised Hidden Markov Models” by Jun’ichi Kazama, Yusuke Miyao and Jun’ichi Tsujii . It comes with a number of pre-compiled Linux binaries and quite a number of Ruby scripts. I’m sure it runs, but the instructions were terse and my experience with Ruby is less than limited.
- Linguistic Tools – Here you will find a simple Web-based interface allowing you to parse and tag pasted text or text at the other end of a URL. The output is good for demonstration purposes, but not necessarily for computing against. The service is really intended to be used through its SOAP interface.
- OpenNLP – This tool set seems to be up-and-coming. Based on Java, there is a command line interface, but I believe the input needs to be a list of tokenized lines (sentences), and the output resembles the output of the Stanford tool — a sentence with POS tags appended to tokens. The command line interface is intended for demonstration purposes.
- Stanford Log-linear Part-Of-Speech Tagger – Written in Java, this too seems to be an increasingly popular tagger. The output from the command line interface reads like sentences with the POS tags appended to the ends of each… part-of-speech.
- TreeTagger – TreeTagger seems to be the grand daddy of POS tools. It is available for many operating systems and many languages. Read the installation instructions carefully because they matter. I like TreeTagger the best because there is a Perl module — Lingua::TreeTagger — that goes hand-in-hand with it. There are a few different output styles from the command line interface: XML-ish, a table, etc. To make my life easier, I wrote a Perl script called pos-summarize.pl. Its purpose is to tabulate TreeTagger’s tabled output listing the number of times different parts-of-speech occurred. This way it is relatively easy to see if there were a preponderance of adjective, gender-specific nouns, etc.