A siMPle scHeMe FOr BOOK clAssiFicAtiON usiNG WiKiPeDiA | YeltON 7 Andromeda Yelton A Simple Scheme for Book Classification Using Wikipedia ■■ Background Hanne Albrechtsen outlines three types of strategies for subject analysis: simplistic, content-oriented, and require- ments-oriented.3 In the simplistic approach, “subjects [are] absolute objective entities that can be derived as direct lin- guistic abstractions of documents.” The content-oriented model includes an interpretive step, identifying subjects not explicitly stated in the document. Requirements- oriented approaches look at documents as instruments of communication; thus they anticipate users’ potential information needs and consider the meanings that docu- ments may derive from their context. (See, for instance, the work of Hjørland and Mai.4) Albrechtsen posits that only the simplistic model, which has obvious weaknesses, is amenable to automated analysis. The difficulty in moving beyond a simplistic approach, then, lies in the ability to capture things not stated, or at least not stated in proportion to their impor- tance. Synonymy and polysemy complicate the task. Background knowledge is needed to draw inferences from text to larger meaning. These would be insuperable barri- ers if computers limited to simple word counts. However, thesauri, ontologies, and related tools can help computers as well as humans in addressing these problems; indeed, a great deal of research has been done in this area. For instance, enriching metadata with Princeton University’s WordNet and the National Library of Medicine’s Medical Subject Headings (MeSH) is a common tactic,5 and the Yahoo! category structure has been used as an ontology for automated document classification.6 Several projects have used Library of Congress Classification (LCC), Dewey Decimal Classification (DDC), and similar library tools for automated text classification, but their results have not been thoroughly reported.7 All of these tools have had problems, though, with issues such as coverage, currency, and cost. This has motivated research into the use of Wikipedia in their stead. Since Wikipedia’s founding in 2001, it has grown prodigiously, encompassing more than 3 million articles in its English edition alone as of this writing; this gives it unparalleled coverage. Wikipedia also has many thesaurus-like features. Redirects function as “see” references by linking syn- onyms to preferred terms. Disambiguation pages deal with homonyms. The polyhierarchical category structure provides broader and narrower term relationships; the vast majority of pages belong to at least one category. Links between pages function as related-term indicators. Editor’s note: This article is the winner of the LITA/Ex Libris Student Writing Award, 2010. Because the rate at which documents are being generated outstrips librarians’ ability to catalog them, an accurate, automated scheme of subject classification is desirable. However, simplistic word-counting schemes miss many important concepts; librarians must enrich algorithms with background knowledge to escape basic problems such as polysemy and synonymy. I have developed a script that uses Wikipedia as context for analyzing the subjects of nonfiction books. Though a simple method built quickly from freely available parts, it is partially successful, suggesting the promise of such an approach for future research. A s the amount of information in the world increases at an ever-more-astonishing rate, it becomes both more important to be able to sort out desirable information and more egregiously daunting to manually catalog every document. It is impossible even to keep up with all the documents in a bounded scope, such as aca- demic journals; there were more than twenty-thousand peer-reviewed academic journals in publication in 2003.1 Therefore a scheme of reliable, automated subject classifi- cation would be of great benefit. However, there are many barriers to such a scheme. Naive word-counting schemes isolate common words, but not necessarily important ones. Worse, the words for the most important concepts of a text may never occur in the text. How can this problem be addressed? First, the most characteristic (not necessarily the most common) words in a text need to be identified—words that particularly distinguish it from other texts. Some corpus that con- nects words to ideas is required—in essence, a way to automatically look up ideas likely to be associated with some particular set of words. Fortunately, there is such a corpus: Wikipedia. What, after all, is a Wikipedia article, but an idea (its title) followed by a set of words (the article text) that characterize that title? Furthermore, the other elements of my scheme were readily available. For many books, Amazon lists Statistically Improbable Phrases (SIPs)— that is, phrases that are found “a large number of times in a particular book relative to all Search Inside! books.”2 And Google provides a way to find pages highly relevant to a given phrase. If I used Google to query Wikipedia for a book’s SIPs (using the query form “site:en.wikipedia .org SIP”), would Wikipedia’s page titles tell me some- thing useful about the subject(s) of the book? Andromeda Yelton (andromeda.yelton@gmail.com) graduated from the Graduate School of library and information Science, Simmons college, Boston, in May 2010. 8 iNFOrMAtiON tecHNOlOGY AND liBrAries | MArcH 2011 ■■ An Initial Test Case To explore whether my method was feasible, I needed to try it on a test case. I chose Stephen Hawking’s A Brief History of Time, a relatively accessible meditation on the origin and fate of the universe, classified under “cosmol- ogy” by the Library of Congress. I began by looking up its SIPs on Amazon.com. Noticing that Amazon also lists Capitalized Phrases (CAPs)—“people, places, events, or important topics mentioned frequently in a book”—I included those as well (see table 1).14 I then queried Wikipedia via Google for each of these phrases, using queries such as “site:en.wikipedia .org ‘grand unification theory.’” I selected the top three Wikipedia article hits for each phrase. This yielded a list of sixty-one distinct items with several interesting properties: ■■ Four items appeared twice (Arrow of time, Entropy [arrow of time], Inflation [cosmology], Richard Feynman). However, nothing appeared more than twice; that is, nothing definitively stood out. ■■ Many items on the list were clearly relevant to Brief History, although often at too small a level of granu- larity to be good subject headings (e.g., Black hole, Second law of thermodynamics, Time in physics). ■■ Some items, while not unrelated, were wrong as sub- ject classifications (e.g., List of Solar System objects by size, Nobel Prize in Physics). ■■ Some items were at best amusingly, and at worst baf- flingly, unrelated (e.g., Alpha Centauri [Doctor Who], Electoral district [Canada], James K. Polk, United States men’s national soccer team). ■■ In addition, I had to discard some of the top Google hits because they were not articles but Wikipedia spe- cial pages, such as “talk” pages devoted to discussion of an article. This test showed that I needed an approach that would give me candidate subject headers at a higher level of granularity. I also needed to be able to draw a brighter line between candidates and noncandidates. The pres- ence of noncandidates was not in itself distressing—any automated approach will consider avenues a human would not—but not having a clear basis for discarding low-probability descriptors was a problem. As it happens, Wikipedia itself offers candidate subject headers at a higher level of granularity via its categories system. Most articles belong to one or more categories, which are groups of pages belonging to the same list or topic.15 I hoped that by harvesting categories from the sixty-one pages I had discovered, I could improve my method. This yielded a list of more than three hundred catego- ries. Unsurprisingly, this list mostly comprised irrelevant Because of this thesaurus structure, all of which can be harvested and used automatically, many researchers have used Wikipedia for metadata enrichment, text clustering and classification, and the like. For example, Han and Zhao wanted to automati- cally disambiguate names found online but faced many problems familiar to librarians: “The traditional methods measure the similarity using the bag of words (BOW) model. The BOW, however, ignores all the semantic rela- tions such as social relatedness between named entities, associative relatedness between concepts, polysemy and synonymy between key terms. So the BOW cannot reflect the actual similarity.” To counter this, they constructed a semantic model from information on Wikipedia about the associative relationships of various ideas. They then used this model to find relationships between information found in the context of the target name in different pages. This enabled them to accurately group pages pertaining to particular individuals.8 Carmel, Roitman, and Zwerdling used page catego- ries and titles to enhance labeling of document clusters. Although many algorithms exist for sorting large sets of documents into smaller, interrelated clusters, there is less work on labeling those clusters usefully. By extract- ing cluster keywords, using them to query Wikipedia, and algorithmically analyzing the results, they created a system whose top five recommendations contained the human-generated cluster label more than 85 percent of the time.9 Schönhofen looked at the same problem I examine— identifying document topics with Wikipedia data—but he used a different approach. He calculated the related- ness between categories and words from titles of pages belonging to those categories. He then used that relat- edness to determine how strongly words from a target document predicted various Wikipedia categories. He found that although his results were skewed by how well- represented topics were on Wikipedia, “for 86 percent of articles, the top 20 ranked categories contain at least one of the original ones, with the top ranked category correct for 48 percent of articles.”10 Wikipedia has also been used as an ontology to improve clustering of documents in a corpus,11 to auto- matically generate domain-specific thesauri,12 and to improve Wikipedia itself by suggesting appropriate cat- egories for articles.13 In short, Wikipedia has many uses for metadata enrichment. While text classification is one of these poten- tial uses, and one with promise, it is under-explored at present. Additionally, this exploration takes place almost entirely in the proceedings of computer science confer- ences, often without reference to library science concepts or in a place where librarians would be likely to benefit from it. This paper aims to bridge that gap. A siMPle scHeMe FOr BOOK clAssiFicAtiON usiNG WiKiPeDiA | YeltON 9 computationally trivial to do so, given such a list. (The list need not be exhaustive as long as it exhaustively described category types; for instance, the same regular expression could filter out both “articles with unsourced statements from October 2009” and “articles with unsourced state- ments from May 2008.”) At this stage of research, however, I simply ignored these categories in analyzing my results. To find a variety of books to test, I used older New York Times nonfiction bestseller lists because brand-new books are less likely to have SIPs available on Amazon.19 These lists were heavily slanted toward autobiography, but also included history, politics, and social science topics. ■■ Results Of the thirty books I examined (the top fifteen each from paperback and hardback nonfiction lists), twenty-one had SIPs and CAPs available on Amazon. I ran my script against each of these phrase sets and calculated three measures for each resulting category list: ■■ Precision (P): of the top categories, how many were synonyms or near-synonyms of the book’s LCSHs? ■■ Recall (R): of the book’s LCSHs, how many had syn- onyms or near-synonyms among the top categories? ■■ Right-but-wrongs (RbW): of the top categories, how many are reminiscent of the LCSHs without actu- ally being synonymous? These included narrower terms (e.g., the category “African_American_actors” when the LCSHs included “Actors—United States —Biography”), broader terms (e.g., “American_folk_ singers” vs. “Dylan, Bob, 1941–”), related terms (e.g., “The_Chronicles_of_Narnia_books” vs. “Lion, the Witch and the Wardrobe (Motion picture)”), and exam- ples (“Killian_documents_controversy” vs. “United States—Politics and government—2001–2009”). I considered the “top categories” for each book to be the five that most commonly occurred (excluding Wikipedia administrative categories), with the following exceptions: ■■ Because I had no basis to distinguish between them, I included all equally popular categories, even if that would bring the total to more than five. Thus, for example, for the book Collapse, the most common category occurred seven times, followed by two cat- egories with five appearances and six categories with four. Rather than arbitrarily selecting two of the six four-occurrence categories to bring the total to five, I examined all nine top categories. ■■ If there were more than five LCSHs, I expanded the number of categories accordingly, so as not to candidates (“wars involving the states and peoples of Asia,” “video games with expansion packs,” “organiza- tions based in Sweden,” among many others). Many categories played a clear role in the Wikipedia ecology of knowledge but were not suitable as general-purpose sub- ject headers (“living people,” “1849 deaths”). Strikingly, though, the vast majority of candidates occurred only once. Only forty-two occurred twice, fifteen occurred three times, and one occurred twelve times: “physical cosmology.” Twelve occurrences, four times as many as the next candidate, looked like a bright line. And “physical cosmology” is an excellent description of Brief History— arguably better than LCSH’s “cosmology.” The approach looked promising. ■■ Automating Further Test Cases The next step was to test an extensive variety of books to see if the method was more broadly applicable. However, running searches and collating queries for even one book is tedious; investigating a large number by hand was prohibitive. Therefore I wrote a categorization script (see appendix) that performs the following steps:16 ■■ reads in a file of statistically improbable phrases17 ■■ runs Google queries against Wikipedia for all of them18 ■■ selects the top hits after filtering out some common Wikipedia nonarticles, such as “category” and “user” pages ■■ harvests these articles’ categories ■■ sorts these categories by their frequency of occurrence This algorithm did not filter out Wikipedia adminis- trative categories, as creating a list of them would have been prohibitively time-consuming. However, it would be Table 1. SIPs and CAPs for A Brief History of Time SIPs grand unification energy, complete unified theory, thermodynamic arrow, psychological arrow, primordial black holes, boundary proposal, hot big bang model, big bang singularity, more quarks, contracting phase, sum over histories CAPs Alpha Centauri, Solar System, Nobel Prize, North Pole, United States, Edwin Hubble, Royal Society, Richard Feynman, Milky Way, Roger Penrose, First World War, Weak Anthropic Principle 10 iNFOrMAtiON tecHNOlOGY AND liBrAries | MArcH 2011 “Continental_Army_generals” vs. “United States— History—Revolution, 1775–1783.” ■■ weak: some categories treated the same subject as the LCSH but not at all in the same way ■■ wrong: the categories were actively misleading The results are displayed in table 2. ■■ Discussion The results of this test were decidedly more mixed than those of my initial test case. On some books the Wikipedia method performed remarkably well; on misleadingly increase recall statistics. ■■ I did not consider any categories with fewer than four occurrences, even if that left me with fewer than five top categories to consider. The lists of three-, two-, and one-occurrence categories were very long and almost entirely composed of unrelated items. I also considered, subjectively, the degree of overlap between the LCSHs and the top Wikipedia categories. I chose four degrees of overlap: ■■ strong: the top categories were largely relevant and included synonyms or near-synonyms for the LCSH ■■ near miss: some categories suggested the LCSH but missed its key points, such as Table 2. Results (sorted by percentage of relevant categories). Book P R RbW Subjective Quality Chronicles, Bob Dylan 0.2 0.5 0.8 strong The Chronicles of Narnia: The Lion, the Witch and the Wardrobe Official Illustrated Movie Companion, Perry Moore 0.25 1 0.625 strong 1776, David McCullough 0 0 0.8 near miss 100 People Who Are Screwing Up America, Bernard Goldberg 0 0 0.625 weak The Bob Dylan Scrapbook, 1956–1966, with text by Robert Santelli 0.2 0.5 0.4 strong Three Weeks With My Brother, Nicholas Sparks 0 0 0.57 weak Mother Angelica, Raymond Arroyo 0.07 0.33 0.43 near miss Confessions of a Video Vixen, Karrine Steffans 0.25 0.33 0.25 weak The Fairtax Book, Neal Boortz and John Linder 0.17 0.33 0.33 strong Never Have Your Dog Stuffed, Alan Alda 0 0 0.43 weak The World is Flat, Thomas L. Friedman 0.4 0.5 0 near miss The Tender Bar, J. R. Moehringer 0 0 0.2 wrong The Tipping Point, Malcolm Gladwell 0 0 0.2 wrong Collapse, Jared Diamond 0 0 0.11 weak Blink, Malcolm Gladwell 0 0 0 wrong Freakonomics, Steven D. Levitt and Stephen J. Dubner 0 0 0 wrong Guns, Germs, and Steel, Jared Diamond 0 0 0 weak Magical Thinking, Augusten Burroughs 0 0 0 wrong A Million Little Pieces, James Frey 0 0 0 wrong Worth More Dead, Ann Rule 0 0 0 wrong Tuesdays With Morrie, Mitch Albom No category with more than 4 occurrences A siMPle scHeMe FOr BOOK clAssiFicAtiON usiNG WiKiPeDiA | YeltON 11 my method’s success with A Brief History of Time. I tested another technical, jargon-intensive work (N. Gregory Mankiw’s Macroeconomics textbook), and found that the method also worked very well, giving categories such as “macroeconomics” and “economics terminology” with high frequency. Therefore a system of this nature, even if not usable for a broad-based collection, might be very useful for scientific or other jargon-intensive content such as a database of journal articles. ■■ Future Research The method outlined in this paper is intended to be a proof of concept using readily available tools. The follow- ing work might move it closer to a real-world application: ■■ A configurable system for providing statistically improbable phrases; there are many options.23 This would provide the user with more control over, and understanding of, SIP generation (instead of the Amazon black box), as well as providing output that could integrate directly with the script. ■■ A richer understanding of the Wikipedia category system. Some categories (e.g., “all articles with unsourced statements”) are clearly useful only for Wikipedia administrative purposes, not as document descriptors; others (e.g., “physical cosmology”) are excellent subject candidates; others have unclear value as subjects or require some modification (e.g., “environmental non-fiction books,” “macroeconom- ics stubs”). Many of these could be filtered out or reformatted automatically. ■■ Greater use of Wikipedia as an ontology. For exam- ple, a map of the category hierarchies might help locate headers at a useful level of granularity, or to find the overarching meaning suggested by several headers by finding their common broader terms. A more thorough understanding of Wikipedia’s rela- tional structure might help disambiguate terms.24 others, it performed very poorly. However, there are several patterns here: Many of these books were autobiographies, and the method was ineffective on nearly all of these.20 A key feature of autobiographies, of course, is that they are typi- cally written in the first person, and thus lack any term for the major subject—the author’s name. Biography, by contrast, is rife with this term. This suggests that includ- ing titles and authors along with SIPs and CAPs may be wise. Additionally, it might require making better use of Wikipedia as an ontology to look for related concepts (rather in the manner that Han and Zhao used it for name disambiguation).21 Books that treat a single, well-defined subject are eas- ier to analyze than those with more sprawling coverage. In particular, books that treat a concept via a sequence of illustrative essays (e.g., Tipping Point, Freakonomics) do not work well at all. SIPs may apply only to particu- lar chapters rather than to the book as a whole, and the algorithm tends to pick out topics of particular chapters (e.g., for Freakonomics, the fascinating chapter on Sudhir Venkatesh’s work on “Gangs_in_Chicago, _Illinois”22) rather than the connecting threads of the entire book (e.g. “Economics—Sociological aspects”). The tactics sug- gested for autobiography might help here as well. My subjective impressions were usually, but not always, borne out by the statistics. This is because some of the RbWs were strongly related to one another and sug- gested to a human observer a coherent narrative, whereas others picked out minor or dissimilar aspects of the book. There was one more interesting, and promising, pattern: my subjective impressions of the quality of the categories were strongly predicted by the frequency of the most common category. Remember that in the Brief History example, the most common category, “physical cosmology,” occurred twelve times, conspicuously more than any of its other categories. Therefore I looked at how many times the top category for each book occurred in my results. I averaged this number for each subjective quality group; the results are in table 3. In other words, the easier it was to draw a bright line between common and uncommon categories, the more likely the results were to be good descriptions of the work. This suggests that a system such as this could be used with very little modification to streamline catego- rization. For example, it could automatically categorize works when it met a high confidence threshold (when, for instance, the most common category has double-digit occurrence), suggest categories for a human to accept or reject at moderate confidence, and decline to help at low confidence. It was also interesting to me that—unlike my initial test case—none of the bestsellers were scientific or techni- cal works. It is possible that the jargon-intensive nature of science makes it easier to categorize accurately, hence Table 3. Category Frequency and Subjective Quality Subjective Quality of Categories Frequencies of Most Common Category Average Frequency of Most Common Category strong 6, 12, 16, 19 13.25 near miss 5, 5, 7, 10 6.75 weak 4, 5, 6, 7, 8 6 wrong 3, 4, 4, 5, 5, 5, 7, 7 5 12 iNFOrMAtiON tecHNOlOGY AND liBrAries | MArcH 2011 (1993): 219. 4. Birger Hjørland, “The Concept of Subject in Information Science,” Journal of Documentation 48, no. 2 (1992): 172; Jens- Erik Mai, “Classification in Context: Relativity, Reality, and Representation,” Knowledge Organization 31, no. 1 (2004): 39; Jens-Erik Mai, “Actors, Domains, and Constraints in the Design and Construction of Controlled Vocabularies,” Knowledge Organization 35, no. 1 (2008): 16. 5. Xiaohua Hu et al., “Exploiting Wikipedia as External Knowledge for Document Clustering,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009 (New York: ACM, 2009): 389. 6. Yannis Labrou and Tim Finin, “Yahoo! as an Ontology— Using Yahoo! Categories to Describe Documents,” in Proceedings of the Eighth International Conference on Information and Knowledge Management, Kansas City, MO, USA 1999 (New York: ACM, 1999): 180. 7. Kwan Yi, “Automated Text Classification using Library Classification Schemes: Trends, Issues, and Challenges,” International Cataloging & Bibliographic Control 36, no. 4 (2007): 78. 8. Xianpei Han and Jun Zhao, “Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge,” in Proceeding of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, China, 2–6 November 2009 (New York: ACM, 2009): 215. 9. David Carmel, Haggai Roitman, and Naama Zwerdling, “Enhancing Cluster Labeling using Wikipedia,” in Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA (New York: ACM, 2009): 139. 10. Peter Schönhofen, “Identifying Document Topics using the Wikipedia Category Network,” in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong, China, 18–22 December 2006 (Los Alamitos, Calif.: IEEE Computer Society, 2007). 11. Hu et al., “Exploiting Wikipedia.” 12. David Milne, Olena Medelyan, and Ian H. Witten, “Mining Domain-Specific Thesauri from Wikipedia: A Case Study,” in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, 22–26 December 2006 (Washington, D.C.: IEEE Computer Society, 2006): 442. 13. Zeno Gantner and Lars Schmidt-Thieme, “Automatic Content-Based Categorization of Wikipedia Articles,” in Proceedings of the 2009 Workshop on the People’s Web Meets NLP, ACL-IJCNLP 2009, 7 August 2009, Suntec, Singapore (Morristown, N.J.: Association for Computational Linguistics, 2009): 32. 14. “Amazon.com Capitalized Phrases,” Amazon.com, http://www.amazon.com/gp/search-inside/capshelp.html/ ref=sib_caps_help (accessed Mar. 13, 2010). 15. For more on the epistemological and technical roles of categories in Wikipedia, see http://en.wikipedia.org/wiki/ Wikipedia:Categorization. 16. Two sources greatly helped the script-writing process: William Steinmetz, Wicked Cool PHP: Real-World Scripts that Solve Difficult Problems (San Francisco: No Starch, 2008); and the docu- mentation at http://php.net. 17. Not all books on Amazon.com have SIPs, and books that do may only have them for one edition, although many editions may be found separately on the site. There is not a readily appar- ent pattern determining which edition features SIPs. Therefore ■■ A special-case system for handling books and authors that have their own article pages on Wikipedia. In addition, a large-scale project might want to work from downloaded snapshots of Wikipedia (via http:// download.wikimedia.org/), which could be run on local hardware rather than burdening their servers, This would require using something other than Google for relevance ranking (there are many options), with a corresponding revision of the categorization script. ■■ Conclusions Even a simple system, quickly assembled from freely available parts, can have modest success in identifying book categories. Although my system is not ready for real-world applications, it demonstrates that an approach of this type has potential, especially for collections limited to certain genres. Given the staggering volume of docu- ments now being generated, automated classification is an important avenue to explore. I close with a philosophical point. Although I have characterized this work throughout as automated clas- sification, and it certainly feels automated to me when I use the script, it does in fact still rely on human judgment. Wikipedia’s category structure and its articles linking text to title concepts are wholly human-created. Even Google’s PageRank system for determining relevancy rests on human input, using web links to pages as votes for them (like a vast citation index) and the texts of these links as indicators of page content.25 My algorithm there- fore does not operate in lieu of human judgment. Rather, it lets me leverage human judgment in a dramatically more efficient, if also more problematic, fashion than traditional subject cataloging. With the volume of content spiraling ever further beyond our ability to individually catalog documents—even in bounded contexts like academic databases, which strongly benefit from such cataloging— we must use human judgment in high-leverage ways if we are to have a hope of applying subject cataloging everywhere it is expected. References and Notes 1. Carol Tenopir. “Online Databases—Online Scholarly Journals: How Many?” Library Journal (Feb. 1, 2004), http://www .libraryjournal.com/article/CA374956.html (accessed Mar. 13, 2010). 2. “Amazon.com Statistically Improbable Phrases,” Amazon. com, http://www.amazon.com/gp/search-inside/sipshelp .html/ref=sib_sip_help (accessed Mar. 13, 2010). 3. Hanne Albrechtsen. “Subject Analysis and Indexing: From Automated Indexing to Domain Analysis,” The Indexer, 18, no. 4 A siMPle scHeMe FOr BOOK clAssiFicAtiON usiNG WiKiPeDiA | YeltON 13 problematic Million Little Pieces to be autobiography, as it has that writing style, and as its LCSH treats it thus. 21. Han and Zhao, “Named Entity Disambiguation.” 22. Sudhir Venkatesh, Off the Books: The Underground Economy of the Urban Poor (Cambridge: Harvard Univ. Pr., 2006). 23. See Karen Coyle, “Machine Indexing,” The Journal of Academic Librarianship 34, no. 6 (2008): 530. She gives as examples PhraseRate (http://ivia.ucr.edu/projects/PhraseRate/), KEA (http://www.nzdl.org/Kea/), and Extractor (http://extractor. com/). 24. Per Han and Zhao, “Named Entity Disambiguation.” 25. Lawrence Page et al., “The PageRank Citation Ranking: Bringing Order to the Web,” Stanford InfoLab (1999), http:// ilpubs.stanford.edu:8090/422/ (accessed Mar. 13, 2010). This paper precedes the launch of Google; as the title indicates, the citation index is one of Google’s foundational ideas. this step cannot be automated. 18. Be aware that running automated queries without per- mission is an explicit violation of Google’s Terms of Service. SeeGoogle Webmaster Central, “Automated Queries,” http://www.google.com/support/webmasters/bin/answer .py?hl=en&answer=66357 (accessed Mar. 13, 2010). Before using this script, obtain an API key, which confers this permission. AJAX web search API keys can be instantly and freely obtained via http://code.google.com/apis/ajaxsearch/web.html. 19. “Hardcover Nonfiction,” New York Times, Oct. 9, 2005, http://www.nytimes.com/2005/10/09/books/bestseller /1009besthardnonfiction.html?_r=1 (accessed Mar. 13, 2010); “Paperback nonfiction,” New York Times, Oct. 9, 2005, http://www .nytimes.com/2005/10/09/books/bestseller/1009bestpapernon fiction.html?_r=1 (accessed Mar. 13, 2010). 20. For the purposes of this discussion I consider the Appendix. PHP Script for Automated Classification 4) { echo “I’m sorry; the number specified cannot be more than 4.”; die; } // Next, turn our comma-separated list into an array. 14 iNFOrMAtiON tecHNOlOGY AND liBrAries | MArcH 2011 $sip_temp = fopen($argv[1], ‘r’); $sip_list = ‘’; while (! feof($sip_temp)) { $sip_list .= fgets($sip_temp, 5000); } fclose($sip_temp); $sip_array = explode(‘, ‘, $sip_list); /* Here we access Google search results for our SIPs and CAPs. It is a violation of the Google Terms of Service to run automated queries without permission. Obtain an AJAX API key via http://code.google.com. */ $apikey = ‘your_key_goes_here’; foreach($sip_array as $query) { /* In multiword terms, change spaces to + so as not to break the google search. */ $query = str_replace( “ “, “+”,,” $query); $googresult = “http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Aen.wikipedia.org+$query&key=$apikey”; $googdata = file_get_contents($googresult); // pick out the URLs we want and put them into the array $links preg_match_all(‘|” url”:” [^” ]*”|i’,, $googdata, $links); /* Strip out some crud from the JSON syntax to get just URLs */ $links[0] = str_replace( “\” url\”:\” “, “”, $links[0]); $links[0] = str_replace(“\” “, “”, $links[0]); /* Here we step through the links in the page Google returned to us and find the top Wikipedia articles among the results */ $i=0; foreach($links[0] as $testlink) { /* These variables test to see if we have hit a Wikipedia special page instead of an article. There are many more flavors of special page, but these are the most likely to show up in the first few hits. */ $filetest = strpos($testlink, ‘wiki/File:’); $cattest = strpos($testlink, ‘wiki/Category:’); $usertest = strpos($testlink, ‘wiki/User’); $talktest = strpos($testlink, ‘wiki/Talk:’); $disambtest = strpos($testlink, ‘(disambiguation)’); $templatetest = strpos($testlink, ‘wiki/Template_’); if (!$filetest && !$cattest && !$usertest && !$talktest && !$disambtest && !$templatetest) { $wikipages[] = $testlink; $i++; } /* Once we’ve accumulated as many article pages as the user asked for, stop adding links to the $wikipages array. */ Appendix. PHP Script for Automated Classification (continued) A siMPle scHeMe FOr BOOK clAssiFicAtiON usiNG WiKiPeDiA | YeltON 15 if ($i == $argv[2]) { break; } //This closes the foreach loop which steps through $links } // This closes the foreach loop which steps through $sip_array } /* For each page that we identified in the above step, let’s find the categories it belongs to. */ $mastercatarray = array(); foreach ($wikipages as $targetpage) { // Scrape category information from the article page. $wikiscrape = file_get_contents($targetpage); preg_match_all(“|/wiki/Category.[^\” ]+|”,,” $wikiscrape, $categories); foreach ($categories[0] as $catstring) { /* Strip out the “wiki/Category:” at the beginning of each string */ $catstring = substr($catstring, 15); /* Keep count of how many times we’ve seen this category. */ if (array_key_exists($catstring, $mastercatarray)) { $mastercatarray[$catstring]++; } else { $mastercatarray[$catstring] =1; } } } // Sort by value: most popular categories first. arsort($mastercatarray); echo “The top categories are:\n”; print_r($mastercatarray); ?> Appendix. PHP Script for Automated Classification (continued)