Edinburgh Research Explorer Adapting the Edinburgh Geoparser for Historical Georeferencing Citation for published version: Alex, B, Byrne, K, Grover, C & Tobin, R 2015, 'Adapting the Edinburgh Geoparser for Historical Georeferencing', International Journal of Humanities and Arts Computing, vol. 9, no. 1, pp. 15-35. https://doi.org/10.3366/ijhac.2015.0136 Digital Object Identifier (DOI): 10.3366/ijhac.2015.0136 Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: International Journal of Humanities and Arts Computing General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://doi.org/10.3366/ijhac.2015.0136 https://doi.org/10.3366/ijhac.2015.0136 https://www.research.ed.ac.uk/portal/en/publications/adapting-the-edinburgh-geoparser-for-historical-georeferencing(59afd055-a05e-44fb-802e-41d77f50a251).html ADAPTING THE EDINBURGH GEOPARSER FOR HISTORICAL GEOREFERENCING BEATRICE ALEX, KATE BYRNE, CLAIRE GROVER AND RICHARD TOBIN Abstract Place name mentions in text may have more than one potential referent (e.g. Peru, the country vs. Peru, the city in Indiana). The Edinburgh Language Technology Group (LTG) has developed the Edinburgh Geoparser, a system that can automatically recognise place name men- tions in text and disambiguate them with respect to a gazetteer. The recognition step is required to identify location mentions in a given piece of text. The subsequent disambiguation step, generally referred to as georesolution, grounds location mentions to their corresponding gazetteer entries with latitude and longitude values, for example, to visualise them on a map. Geoparsing is not only useful for mapping purposes but also for making document collections more accessible as it can provide additional metadata about the geographical content of documents. Combined with other information mined from text such as person names and date expressions, complex relations between such pieces of information can be identified. The Edinburgh Geoparser can be used with several gazetteers including Unlock and GeoNames to process a variety of input texts. The orig- inal version of the Geoparser was a demonstrator configured for modern text. Since then, it has been adapted to georeference historic and ancient text collections as well as modern-day news- paper text. 1–4 Currently, the LTG is involved in three research projects applying the Geoparser to historical text collections of very different types and for a variety of end-user applications. This paper discusses the ways in which we have customised the Geoparser for specific datasets and applications relevant to each project. Keywords: Georeferencing, georesolution, text mining, domain adaptation International Journal of Humanities and Arts Computing ... , Edinburgh University Press DOI: ... c© Edinburgh University Press and the Association for History and Computing 2014 http://www.euppublishing.com/ijhac 1 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin INTRODUCTION: THE EDINBURGH GEOPARSER The Edinburgh Geoparser is a Natural Language Processing (NLP) system designed to analyse text in order to identify occurrences of locations and ‘pin’ them to a map by determining their correct latitude and longitude. This involves disambiguation wherever a location has more than one possible interpretation. An detailed introduction to geoparsing and the steps involved can be found in the paper by Kalev H. Leetaru (2012). 5 The Geoparser’s functionality is comparable to other software such as CLAVIN 6, OpenCalais 7 or Yahoo PlaceSpotter. 8 The Geoparser described here is an update of the system which we reported on previously. 2,9 In those papers, we evaluated its performance against the SpatialML corpus 10 as well as against historical English documents. The software can be downloaded from the LTG website 11 and a version of the Geoparser is also the backend of EDINA’s Unlock Text, 12 a RESTful API for geoparsing texts on the Web. There are two main components in the Geoparser, a named entity recognition (NER) or geotag- ging component and a georesolution component. The former uses NLP techniques to identify named entities in text, specifically location, person and date entities. The latter looks up the lo- cation names in a gazetteer and resolves ambiguities to suggest the most likely interpretation (i.e. latitude/longitude, country and type) for each location given its context in the text being processed. The recognition component is a pipeline of sub-components built using XML tools in combi- nation with Unix shell scripting. The XML tools are LT-XML2 13 and LT-TTT2 14 which we have been specifically designed for NLP applications. The pipeline converts input text to XML, performs low-level analysis such as tokenisation and sentence-splitting and then applies part-of- speech (POS) tagging and lemmatisation, syntactic chunking and NER. For POS tagging and lemmatisation we use third party software. 15,16 The chunking and NER steps are rule-based as op- posed to components using machine learning to make predictions. The output of the recognition component is a linguistically annotated version of the input text with the location, person and date entities marked up in XML format. The georesolution component takes this as input and looks up the location names in a gazetteer. The standard version of the Geoparser allows for the use of a two gazetteers, namely GeoNames 17 and Ordnance Survey data (both available in Unlock 18). In the projects we report on here, we have extended the Geoparser to allow the use of historical 2 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin gazetteers or adjusted its feature set used for the georesolution. Queries to a gazetteer return all the records which match the input location name and the job of georesolution is to rank these records in order of likelihood in the given context. The georesolver uses heuristics combined with weighting of information to arrive at its rankings. For example, populated places are preferred over places described in a gazetteer as facilities, and larger places (by population) are weighted more highly than smaller ones. The possible interpretations of other locations in the document are used so that all locations mutually constrain one another to be as close together as possible. Thus a document mentioning Portsmouth, Southampton and Bournemouth will be analysed so that the place-names resolve to towns on the south coast of England, while a document containing Portsmouth, Hampton and Chesapeake will resolve the names to places in Virginia, USA. The Edinburgh Geoparser has been in development for a number of years and has been used in several projects with good effect. The rule sets for recognition and resolution have been tuned and tested in many contexts and are reasonably stable. However, potential users of the Geoparser are very wide-ranging and the texts that they wish to analyse are of all kinds, in many formats and put to many different purposes. This makes it extremely difficult to create a robust Geoparser which will please all of the users at all times. The easiest access to the Geoparser is via the Unlock Text API which accepts a range of parameters but cannot be fully customised for a particular purpose, and we expect that many users will want to download the Geoparser source and adjust it for their own needs, much as we have done in the projects described in this paper. The named entity recogniser in the Geoparser has advantages and disadvantages: its behaviour is more transparent than supervised machine learning NER systems and rule sets can be altered relatively straight-forwardly; however, because it relies on lexicons and hand-written rules, it will stumble in cases not foreseen by the authors. The pipeline architecture of the system allows for the user to completely replace the NER component with a component of their own choosing, or to input documents where the named entities have been annotated by hand. Similarly, users may have documents with very specific formats such as tabular data where they may wish to restrict which parts of the document get processed. The georesolution component has been tuned primarily to modern newswire text. Texts from 3 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin newswire are typically quite short and deal with a fairly specific topic with a fairly specific geo- graphic focus. When processing longer documents, the user needs to be aware of the way in which the location names influence each other’s interpretations and they should consider segmenting the document into smaller, geographically coherent pieces. Furthermore, the weights that we have chosen for the features that contribute to the resolution have been optimised for newswire and in different settings users may want to adjust these weights. Often, a user will know what the geo- graphical focus of their document is, whether this focus is a continent, a country or a smaller area. We provide command line options to allow this area to be specified either as a bounding circle or a bounding box so that interpretations inside the area can be more highly ranked according to a user specified weight. Places outside the bounding circle or box may still be selected, so users wishing for an absolute constraint would need to filter the results to exclude the outliers. ADAPTING THE GEOPARSER In this section, we report on adjustments made to the Edinburgh Geoparser for Trading Conse- quences, GAP and DEEP, three research projects all processing historical text of different kinds. (1) The Trading Consequences Project: Georeferencing Nineteenth Century Text In Trading Consequences, 19 a Digging into Data II project (CIINN01), the aim was to assist his- torians in understanding economic and environmental consequences of commodity trading in the nineteenth century British Empire. We applied text mining to large quantities of digitised histori- cal text, which when combined with visualisations presented in a web interface enables historians to analyse trends in commodity trading for a broad range of commodities (see Figure 1). 20 We analysed textual data from major British and Canadian datasets, including the House of Commons Parliamentary Papers available through ProQuest, 22 the Early Canadiana Online data archive 23 and a sub-part of the Foreign and Commonwealth Office Collection from JSTOR. 24 We are also analysing Adam Matthew’s Confidential Print collections, 25 the Directors’ Corre- spondence Collection from the Archives at Kew Gardens available at JSTOR Global Plants 26 and several hundred manually selected titles relevant to this domain. With the exception of the Kew data, all datasets were digitised via Optical Character Recognition (OCR) and text quality varies 4 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin Figure 1. Web interface to the interlinked visualisation of Trading Consequences. 21 considerably for and within each collection. Together these sources amount to over 10 million pages of text and over 7 billion word tokens. We used the Edinburgh Geoparser combined with the GeoNames gazetteer as part of the text mining component to identify and ground locations in these collections. From previous experience, we knew early on that we had to make some adjustments to the Geoparser to process historical collections relevant to Trading Consequences. At the recognition step, for example, we found that identifying person names and location names in parallel, even though there is no need to extract person name information for the intended ap- plication, helped to improve the overall quality of the text mined output. There are many location names which are made up of person names or which are similar to them. For example, there is the location Markham in Ontario and the person name Clements Markham, the British official who was responsible for collecting cinchona plants from their native Peruvian forests and transplant- ing them to India. In the initial Trading Consequences prototype the person name Markham was wrongly identified as a location mention and grounded accordingly. Consequently, this error ap- peared in the map visualisation for the commodity cinchona, the plant whose bark was processed into quinine. We therefore switched on the named entity recognition step for person names to 5 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin avoid such entity type confusion. This approach is not specific to this project but works equally well for other datasets and applications. In the experiments presented next we evaluate the geo- resolution step of the Edinburgh Geoparser for adjustments we made to its feature set specifically for Trading Consequences. A. Gold Standard Data In order to evaluate the effect of the changes we made to the Edinburgh Geoparser, we created a gold standard dataset containing manually annotated location mentions georeferenced to GeoN- ames. The gold standard is made up of document extracts from 25 randomly selected documents for each of the five collections processed in Trading Consequences and for the manually selected documents. Extracts were created to reduce the load of the annotator by splitting the document into equal sized chunks of 5 KB and randomly selecting one extract per document. The gold stan- dard therefore contains a total of 150 document extracts. The annotation was performed in two steps. We firstly asked an annotator to mark up the entire gold standard with location mentions even if they contained errors introduced through the digitisation process. In total, the gold stan- dard contains 4,373 manually identified location mentions. We then ran the Edinburgh Geoparser over the manually annotated data without applying a cut-off to the number of locations returned by GeoNames and without ranking the results. The annotator then carried out the georesolution anno- tation using the Edinburgh Geo-Annotator 27 by selecting one of the suggested candidates. He was able to do that for 3,109 locations. He selected none of the suggested candidates for 283 locations; and for 981 locations GeoNames did not return any candidate, so no candidate resolution could be made. One of the reason for the high number of locations without any GeoNames candidates is that 14.8% of location mentions in the gold standard contain OCR errors. For example, all men- tions referring to the location name Montreal containing at least one error are listed in Figure 2 along with the number of times they occur in the gold standard. OCR errors affect named entities worse than common vocabulary, as this percentage decreases, for example, to 9.1% for commod- ity mentions in text. 28 This is most likely because OCR engines used to digitise documents rely on a dictionary or language model which does not contain many proper nouns. More detailed information on the effect that OCR errors have on named entity recognition for the historical texts 6 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin processed in Trading Consequences can be found in Alex and Burns (2014). 29 Figure 2. Forms of Montreal containing OCR errors and their counts in the gold standard. B. Georesolution Experiments The georesolution step of the Geoparser uses a combination of heuristics such as location feature type, population size, contextual information of location mentions combined with location clus- tering to disambiguate between multiple locations with the same name in the gazetteer. 1 In the prototype Geoparser integrated at the start of Trading Consequences, features and parameters had been applied based on empirical analysis of georeferenced newspaper text but without methodi- cal parameter tuning for performance optimisation. For example, a cut-off parameter was applied to consider the top 20 locations returned for a given GeoNames search in the case where more results were returned. We first processed the gold standard using the Geoparser with its default settings and compared the output to the manual annotations (see Table 1). Of the 3,109 locations which were resolved by the annotator, 2,586 (83.2%) were correctly resolved (exact match of the GeoNames identifier) and 2,626 (84.5%) fall within a 5km radius of the gold resolution. A large majority of trading during the nineteenth century was carried out by ship, making loca- tions with ports extremely important in this context. We therefore gave the Geoparser access to a 7 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin Exact Match Within a 5km Radius Feature settings # Correct Score # Correct Score Default settings 2,586 83.2% 2,626 84.5% 1. port feature 2,528 81.3% 2,565 82.5% 2. increase country feature 2,585 83.1% 2,625 84.4% 3. decrease spot feature 2,585 83.1% 2,625 84.4% Combination of 1. to 3. 2,601 83.7% 2,638 84.9% Combination of 1. to 3. and optimised cut-off 2,608 83.9% 2,645 85.1% Table 1. Georesolution performance of the Edinburgh Geoparser for its default settings, new features and a combination of them on the Trading Consequences gold standard. We report number of correct locations (# Correct) and accuracy scores for two types of evalua- tion (exact match of GeoNames identifier and occurrence within a 5km radius). Figure 3. Top of page 125 from the Ships’ Reports of the House of Commons Parliamentary papers from 1836. 30 gazetteer of ports (with latitude and longitude values). 31 It contains a list of 1,646 ports collected from early-mid 20th century Royal Navy logs, provided to us by Philip Brohan at the Met Office Hadley Centre in Exeter, which we manually supplemented with 136 additional ports listed in the gazetteer of Colonial and Foreign ports. 32 We adjusted the Geoparser by assigning a higher weight to location candidates within 0.1 degree to a port. For example, the location mention Dalhousie, is clearly referring to a port when mentioned in the From whence column of a table in the Ships’ Re- ports of the House of Commons Parliamentary papers from 1836 shown in Figure 3. Incidentally, tables, as shown in this example, also have a negative effect of the performance of our text mining tools which are optimised for running text but we will explore this problem in future work. 8 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin The previous version of the Geoparser grounded the mention of Dalhousie wrongly to Dalhousie in India (GeoNames ID: 1273648; lat: 32.5333300, long: 75.9833300) as a result of the popu- lation size heuristic and other factors, such as locations in context and location clustering. The ports-based adjustment means that the correct Dalhousie in Canada (GeoNames ID: 6943599; lat: 48.0550200, long: -66.3847200) is ranked as the top candidate by the georesolution component. However, Table 1 shows that the ports-based adjustment (port feature) deteriorated the resolution on the gold standard. Error analysis showed that the port feature gave too much weight to smaller locations stored in Geonames, which is why we added two new features to overcome this prob- lem. We increased the weight for GeoNames locations of type PCLI (independent political entity) which usually refer to countries (see 2. increase country feature). We also reduced the weight of GeoNames locations of class S (spot), including buildings, facilities and farms (see 3. decrease spot feature). Both features do not damage the performance of the default Geoparser when applied in isolation, but in combination with the port feature they result in a small improvement of 0.5% exact match accuracy. Figure 4. Georesolution accuracy with cut-off values varying between 0 and ∞. We also optimised the cut-off parameter applied in the Geoparser when retrieving multiple entries from the GeoNames database for one location mention. Figure 4 shows the results we obtained for 9 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin varying the cut-off between 1 and ∞. In the default settings, the cut-off was set to 20. The graph illustrates that selecting the first entry extracted from the GeoNames database is not an adequate method to perform georesolution. Not applying a cut-off and considering all possible locations for a given mention also does not result in an optimal performance and it means the resolver needs to work a lot harder when ranking the candidates returned for highly ambiguous location names. The best performance for both types of evaluation is achieved when limiting the number of entries returned from the GeoNames database to 15 before ranking. This results in an overall accuracy of 83.9% for exact match evaluation and 85.1% for evaluation within a 5 km radius. Given the quality of the OCRed text and the historical nature of the Trading Consequences data, these scores are surprisingly high. To put them into perspective, Catherine D’Ignazio et al. (2014) report georesolution scores of 96.3% using the Yahoo Placespotter, 90.3% using OpenCalais and 89.9% using CLAVIN when processing modern news article data from the New York Times, Huffington Post and the BBC. 33 (2) The GAP Project: Georeferencing Classical Texts In 2010, the Language Technology Group was approached by the Google Ancient Places (GAP) team who were looking for a tool capable of georeferencing English translations of Greek and Ro- man classical texts, available as Google Books. The GAP project, 34 funded under the Google Dig- ital Humanities programme, aimed to identify place name references in works such as Herodotus’ Histories, Livy’s History of Rome and Tacitus’ Annals, and create a map-based visualisation tool to be used by students and researchers of the ancient world. This project was the beginning of a collaboration with the members of the GAP team that has spanned several related projects and is still continuing. 3,35,36 The team is international and interdisciplinary, comprising specialists from classics, archaeology, language engineering and visualisation. The first adaptation needed was to enable the Geoparser to use a gazetteer of the ancient rather than the modern world, namely Pleiades, 37 a freely available scholarly resource run by Sean Gillies and Tom Elliott, of the Institute for the Study of the Ancient World at New York University. The Pleiades team allowed us to take a copy of their entire dataset, which we turned into a relational 10 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin database with a schema approximately mirroring that of GeoNames, as this minimised the cus- tomisation required in the Geoparser code. Drawing on the expert knowledge of the classicists on the GAP team, the dataset was expanded to create “Pleiades+” by matching, where possible, the ancient places to their modern equivalents in GeoNames. This provided much more precise latitude/longitude positioning and also added alter- native spellings or representations of the place-names in many cases. At run time we introduced a further enhancement using GeoNames (the “Pleiades++” step), for cases where a place name candidate found by the Geoparser was not present in Pleiades+. In these cases we checked the candidate against GeoNames, to collect alternative names that could then be sourced in Pleiades+. In all cases Pleiades+ was the sole source for successful candidate place names, as we only want places existing in the ancient world. An example may make the Pleiades++ step clearer. Trans- lators will often replace the names of well known places with their modern equivalents, so a Google Book text in translation might mention Egypt. However, Pleiades only contains Aegyptus, the equivalent ancient name. Looking up Egypt in GeoNames produces Aegyptus as one of the alternative names, and hence we are led to the correct entry in Pleiades+. This project raised other issues that are relevant to how feasible it is to adapt the Geoparser for widely varying texts. Just as in the Trading Consequences work described above, it proved nec- essary to disambiguate personal and location names. In the geotagging phase of the Geoparser pipeline, lexicon lists of personal names and location names are used to help determine whether a candidate entity should be categorised as a place or a person. For the GAP project both of these lexicons had to be tailored for ancient texts. For example, Paris, Priam and Medea are obviously people in this context, whereas in a modern text they are probably places. This means not only that suitable lexicons of common ancient personal names had to be used but that the standard lexicons in the Geoparser had to be switched off as they reduced classification performance when included. The input texts for GAP were mainly Google Books, though some Open Library 38 texts were in- cluded to test the adaptability of the pipeline. These texts are typically quite untidy, being scanned and OCRed on a large scale. Some pre-processing was done to remove extraneous characters and the books were divided into smaller chunks (typically chapters). These processes were made as 11 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin generic as possible, but it is difficult to split an arbitrary text into smaller pieces in a coherent man- ner without some hand-tailoring. The successor projects to GAP wish to process complete books with minimal user intervention, which raises yet further questions. As explained in the introduc- tion, the clustering algorithms of the georesolution step may not be appropriate if the context is unreasonably large: an entire book rather than a single chapter, say. Because the GAP project worked with raw unannotated text it was not possible to produce norma- tive evaluations of the geoparser’s performance over the texts processed, nor was such evaluation one of the objectives of this humanities project. However, some form of benchmark was required, in order to test improvements during the configuration phase. For this we used the output of an earlier project, Hestia 39,40 to gauge accuracy over comparable ancient text. The Hestia project used a hand-annotated version of Herodotus’ Histories from the Perseus Digital Library. 41 The precision and recall scores for place name recognition over this text were 77.74% and 95.58% respectively, giving F-score of 85.74%. 42 It was only possible to evaluate the geoparser’s first step of geotagging by this method, as we had no gold standard for the georesolution step. One of the products of GAP was the GapVis online interface 43 illustrated in Figure 5. This presents a selection of classical texts and is intended to assist scholarly interpretation of the ancient world. The user can choose from the “Book Summary” or “Reading” views, or examine a chosen place in detail. The summary view shows the distribution of place names throughout the text, giving an overview of the key locations relevant to the text. The Reading view is that shown in Figure 5, where the text is presented beside a map showing the locations of places mentioned. A scrolling bar beneath the map allows the user to move forwards and backwards through the pages of the text, seeing the places come in and out of focus on the map as they are mentioned in the narrative. The “Place Detail” option gives a network diagram showing possible relationships between the chosen location and others, based on co-occurrence frequency of the place names in a moving text window of a fixed size. The interface builds on earlier visualisation work in the Hestia project. The GapVis interface has recently been evaluated qualitatively by using it in an undergraduate course on the Ancient World at the University of Texas. The georeferenced text makes it possi- ble to set student exercises with detailed questions about where events happened – questions it 12 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin Figure 5. The GapVis web interface. would be unreasonable to expect to be answered following a traditional book-based reading of the text. Detailed analyses of the results of this case study have been published on the Hestia project blog. 44–46 These posts provide valuable insights into the advantages and shortcomings of auto- mated spatial annotation such as geoparsing, from a humanities perspective. As with all software projects involving a user interface, it is proving difficult to test the underlying functionality as distinct from the user experience – many of the students’ queries relate to issues that are not part of the research project, such as problems with operating the interface on a touch-screen. This set of projects has been an interesting application of the Geoparser. The priorities of a hu- manities led project have been different, with less interest in formal performance against gold standards, and more in practical use in real-world situations. The fact that high-performing au- tomatic text-processing tools typically achieve precision and recall scores somewhere in the 80s means that up to 20% of the target is mis-identified, and this inaccuracy is sometimes hard for inexperienced users to deal with. Even sophisticated users tend to expect the results shown on screen to be totally correct, and may lose confidence in the entire methodology if they spot an obvious error. 13 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin (3) The DEEP Project: Georeferencing Historical English Place-names The Digital Exposure of English Place-names project (2011-2013) was a JISC-funded collabora- tion between ourselves, the Institute for Name-Studies in Nottingham, the Centre for Data Digiti- sation and Analysis in Belfast, the Centre for e-Research at King’s College London and EDINA. The project has digitised all 86 volumes of the Survey of English Place-Names (SEPN), the ulti- mate authority on historic place-names in England. These volumes were compiled over a period of nine decades by the English Place-Name Society and work is still ongoing on outstanding coun- ties. One outcome of the DEEP project is an immensely detailed historical gazetteer for most of the counties in England which can be accessed as a gazetteer service via EDINA’s Unlock. It can be also browsed and searched independently. 47 As described earlier EDINA also hosts Unlock Text, a means to access the Edinburgh Geoparser, and we have modified the Geoparser to allow georeferencing of historical documents against the DEEP gazetteer. 48 The Edinburgh Geoparser has thus been used in two ways in the project, firstly to assist in calculating coordinates for all the parishes and other place-names in the DEEP gazetteer, and secondly to allow access to the resulting gazetteer for historical text georeferencing. In the following sections we describe the modifications needed for each of these in turn. A. Adding Georeferences to the DEEP Data The LTG’s main role in the DEEP project was to transform the output of the OCR process into structured data which can be used for a variety of purposes. Our focus in this section is on the use we have made of the Edinburgh Geoparser to assign georeferences to DEEP place-names and to provide links between historical gazetteer records and their counterparts in the Unlock and GeoNames gazetteers. The first county survey to be published by SEPN was Buckinghamshire in 1925 with the remaining eighty plus volumes appearing regularly up until the present day. The surveys follow broadly the same format but their appearance over such a long time-span means that there is considerable variation in the type, amount and formatting of information in the volumes. Given the nature and layout of the text, the geotagging component of the Edinburgh Geoparser would have been inappropriate for identifying the place-names so we have instead developed specialised rule sets 14 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin for identifying all the relevant pieces of information in the SEPN volumes. We make extensive use of an adapted version of the georesolution component. Figure 6. Excerpt from Survey for Fleet in Dorset. An typical entry from one of the most recent volumes (Dorset Part 4 published in 2010) is shown in Figure 6. This is the entry for the township of Fleet in the parish of the same name. An OS grid reference (SY 634805) is provided. The entry starts with a list of historical variants of the name where each variant is associated with at least one attestation indicating a historical source in which the name occurred and the date of that source. Thus the first attestation for Flete shows it occurring in the Domesday Book (DB) in 1086. It occurs in several other sources up to the last one in 1428 in a source abbreviated ‘FA’ (Feudal Aids in the Public Record Office). The entry goes on to discuss etymology and then lists smaller places in the vicinity, including East & West Fleet, the inlet alluded to in the extract in Figure 6, Bagwell Barn, Bagwell Barn Cottages, Crook Hill and Fleet Common. After that there is a list of modern field-names followed by a list of historical field-names. Dated, attested historical variants of modern names are provided at all levels from county name through hundreds/wards/wapentakes etc., to parishes, townships, minor names, street names and field-names. These historical names are converted to records in the DEEP gazetteer along with their date, source and latitude/longitude. The modern names in SEPN are also included in the DEEP gazetteer. 15 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin The example given above is one where the volume itself provides authoritative georeferencing but, in fact, only a minority of SEPN volumes contain grid references. There is however, a sec- ond authoritative source of this information, the Key to English Place-Names (KEPN) database developed and maintained by INS. 49 In creating the DEEP gazetteer we have used the Edinburgh Geoparser to aggregate information from the volumes, the KEPN database, Unlock and GeoN- ames in order to provide highly accurate, multi-faceted georeferencing focused on the parishes and the major places within them. By preserving the containment relationships between the larger and smaller places, we can allow smaller places without authoritative georeferences to share the georeference of their containing place. Figure 7. Extract of MADS record for Fleet. 16 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin The SEPN-text-to-structured-data process results in output files in MADS. 50 A cut-down version of the MADS for the example in Figure 6 is shown in Figure 7. The elements in the element contain the georeferencing for the subparish Fleet, and for its historical variants. This place has the maximum number of elements: one derived from the grid reference in Figure 6 (source=“epns”), one derived from the KEPN database (source=“kepn”), and two more created by using the Geoparser to select the most likely records from Unlock and GeoNames (source=“unlock” and source=“geonames”). The coordinates are all slightly different but they each approximate the position of the historical names associated with Fleet. When the DEEP data is ingested into the Unlock service, one of the sets of coordinates has to be treated as primary, and the preference order for selecting the source of the primary coordinates is epns, kepn, unlock, geonames. In cases where there are no elements, an entry is given the coordinates of the closest containing element in the hierarchy. Note that the presence of coordinates from multiple sources provides a sort of linking between the sources and it would be relatively straightforward to convert the MADS format of the DEEP data into proper linked data. In order to achieve multiple georeferencing, we needed to make a number of extensions to the Geoparser for the DEEP system, including implementing a mapping from modern OS grid ref- erences as well as older OS sheet-number grid references to latitude/longitude coordinates. We have implemented "known-lat", "known-long" and "known-gridref" parameters and heuristics to allow the georesolution component to be provided with known coordinates and to weight the rank- ing of Unlock or GeoNames records to strongly prefer those close to the known coordinates. In addition, we have extended the gazetteer look-up output to include information about distance to the known coordinates so that we can discard any Unlock or Geonames records that are not within a reasonable distance of the KEPN record that is our authoritative source of information. In this way we compute highly accurate links between the historical gazetteer and entries in modern gazetteers. Moreover, where KEPN lacks information, the links to Unlock obtained via geores- olution can provide the missing information and for smaller places can sometimes provide more accurate coordinates. 17 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin B. Using the DEEP Gazetteer in the Geoparser In order to use the DEEP gazetteer as the source of information for georeferencing historical documents, it was necessary to make alterations to both the place-name recognition and the geo- resolution components of the Geoparser. To give a flavour of some of the issues involved we illustrate the discussion of this work with reference to Figure 8. This figure shows a visualisation of the results of the Geoparser on an input text which is a sample taken from Farrer and Curwen (1923), 51 a collection of summaries and transcripts of documents for townships of the parish of Kendal, accessed via British History Online. 52 Figure 8. Visualisation of Geoparser output for subsection of Records relating to the Barony of Kendale. The place-name recognition component needs to be able to recognise DEEP historical names in English historical texts, for example Banerhowe and Hoggehalebek in Figure 8. These names do not occur in the lists of modern place-names that we use in the location recognition part of the Geoparser and if run in ‘modern’ mode, many of the places are not recognised. In addition, 18 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin many subparts of the person names are mistaken for place-names. To address these issues, we first converted all of the DEEP modern and historical names into a lexicon to be used by the recogniser. This step is similar to the way lexicons had to be tailored for ancient texts in the GAP project described above. We used the DEEP lexicon instead of the other location lexicons but left the remainder of the NER component in place. As with the other projects described here, we found it essential to recognise person names in tandem with locations and we also tailored the person name rules to deal more effectively with names such as Walter de Lyndesey and Peter de Brus. As can be seen from the lower left frame in Figure 8, many of the place-names have been recognised, but some have not. The names Foulbarg, Wodewardehowe, Thwaytlenkyld and Hethementer are all field-names in the relevant SEPN volume (vol. XLII, part 1 of Westmorland). Fayrhayt and Whystoner have been missed by the recogniser. The SEPN volume, and therefore the DEEP data, has the field Fayrhayk, instead of Fayrhayt. Whystoner is not in SEPN but there is a field Whystan’ mentioned in the same section as Wodewardehowe. The georesolution component looks up the recognised names in the Unlock ingest of the DEEP gazetteer, accessed through the same API as is used for Unlock but with “gazetteer=deep” as part of the query. The run shown in Figure 8 used a prototype version of DEEP in Unlock which does not include the field-names, so they have not been georeferenced. In the visualisation these are the location mentions without links, as links are created from the relevant placenames.org.uk URL returned as part of the response from Unlock. We have implemented a new feature to be used with the DEEP gazetteer that allows the user to specify which SEPN county (or counties) the document is about. In our example we specified “Westmorland” and this caused the gazetteer look-up to reject any records outside of this area. If the user does not wish to use such an absolute constraint, the alternative is to use the standard Geoparser mechanism of weighting more highly those entries which are inside a bounding circle or bounding box. The place-names in our example are so distinct that there is very little ambiguity for the geo- resolution component to resolve. Staveley matches more than one Westmorland record, Staveley Chapelry and the settlements Over Staveley and Nether Staveley as well as the minor places Stave- ley stone, Staveley Head Fell, Staveley Park and Staveley-gate. Levens appears twice with the 19 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin same coordinates as it occurs in the SEPN volume as both a modern name and a recorded histori- cal variant of that modern name (in 1352 and 1376, Inquisitions post mortem). A version of the Geoparser adapted to use DEEP is accessible in Unlock Text. We have attempted to fine-tune it on the basis of a small number of test documents chosen because they are among the sources cited by the SEPN editors and are therefore known to contain historical names. It has not been possible to perform a formal evaluation of this version of the Geoparser though we suspect that the range of possible historical input documents is so wide that a one-size-fits-all version in Unlock Text is unlikely to lead to high performance for many users. It may be necessary for users to adapt the Geoparser source for their own needs and they may also benefit from using it in an assisted-curation scenario where the output is manually post-edited. SUMMARY AND CONCLUSION The Edinburgh Geoparser has been in development for a number of years and has now become a practical and useful tool for georeferencing many kinds of texts. As the back-end to Unlock Text it is now available to a wide range of users. The API for Unlock Text is evolving in response to requests from projects such as GAP and more of the underlying functionality is gradually being exposed in the API. However, the Geoparser itself is evolving as we, its developers, put it to use in various projects, as illustrated above. It is becoming clear that customisation of the Geoparser is frequently needed to achieve optimal performance in a particular context and this means that there is an issue as to how we can provide a tool that meets everybody’s needs. As we take development forward we will need to address this issue. However, the Edinburgh Geoparser has shown its flexibility over very disparate texts and we are optimistic that future versions will continue to support scholars working with a range of texts. The need to disambiguate places from people in different types of text has been found to be an important step throughout our research. Not every geoparsing system may be set up to deal with this task prior to georesolution. We can conclude that producing a general purpose geoparsing tool that works "off the shelf" with any type of text is difficult given the current state-of-the-art. Developing a geoparser which can be easily adapted to new domains and types of text by users who do not always want to delve deep into the code is therefore crucial for such technology to be used widely in new and 20 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin emerging digital humanities research. As there are many benefits of geoparsing texts, it is starting to be recognised as an important method to analyse text in humanities and social science research. Locations are key for connecting separate datasets and can add a new dimension to longitudinal studies. Plotting place mentions on a map givers users a visual connection between quite separate source documents. Geoparsing can also be a very efficient shortcut to linking big datasets, which is notoriously challenging to achieve through close reading of documents, even for domain experts. ACKNOWLEDGEMENTS We are greatly indebted to our project partners on Trading Consequences, GAP and DEEP and the project funders (Jisc, AHRC, SSHRC and Google) for making this research possible. Further information can be found on each respective project website. 19,34,47 END NOTES 1 C. Grover, S. Givon, R. Tobin, and J. Ball. Named entity recognition for digitised historical texts. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), pages 1343–1346, Marrakech, Morocco, 2008. 2 C. Grover, R. Tobin, K. Byrne, M. Woollard, J. Reid, S. Dunn, and J. Ball. Use of the Edinburgh Geoparser for georeferencing digitised historical collections. Philosophical Transactions of the Royal Society A, 368(1925):3875–3889, 2010. 3 L. Isaksen, E. Barker, E.C. Kansa, and K. Byrne. GAP: A NeoGeo Approach to Classical Resources. Leonardo Transactions, 45(1), 2011. 4 B. Alex and C. Grover. Labelling and spatio-temporal grounding of news events. In Proceedings of the workshop on Computational Linguistics in a World of Social Media at NAACL 2010, Los Angeles, CA, 2010. 5 K. H. Leetaru. Fulltext geocoding versus spatial metadata for large text archives: Towards a geographi- cally enriched wikipedia. D-Lib Magazine, 18(9/10), 2012. 6 Cartographic Location And Vicinity INdexer (CLAVIN). http://clavin. bericotechnologies.com. 7 OpenCalais. http://www.opencalais.com. 8 Yahoo PlaceSpotter,Yahoo BOSS Geo Services. http://developer.yahoo.com/boss/geo. 9 R. Tobin, C. Grover, K. Byrne, J. Reid, and J. Walsh. Evaluation of georeferencing. In Proceedings of 21 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin the 6th Workshop on Geographic Information Retrieval, GIR ’10, pages 7:1–7:8, New York, US, 2010. ACM. 10 I. Mani, J. Hitzeman, J. Richer, D. Harris, R. Quimby, and B. Wellner. SpatialML: Annotation scheme, corpora, and tools. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), 2008. 11 Edinburgh Language Technology Group. http://www.ltg.ed.ac.uk. 12 Unlock Text. http://edina.ac.uk/unlock/texts. 13 LT-XML2. http://www.ltg.ed.ac.uk/software/ltxml2. 14 LT-TTT2. http://www.ltg.ed.ac.uk/software/lt-ttt2. 15 J.R. Curran and S. Clark. Investigating GIS and smoothing for maximum entropy taggers. In Pro- ceedings of the 11th Meeting of the European Chapter of the Association for Computational Linguistics (EACL-03), pages 91–98. Budapest, Hungary, 2003. 16 G. Minnen, J. Carroll, and D. Pearce. Robust, applied morphological generation. In Proceedings of the 1st International Natural Language Generation Conference, Mitzpe Ramon, Israel, 2000. 17 GeoNames. http://www.geonames.org. 18 Unlock. http://edina.ac.uk/unlock. 19 Trading Consequences. http://tradingconsequences.blogs.edina.ac.uk. 20 U. Hinrichs, B. Alex, J. Clifford, and A. Quigley. Trading Consequences: A Case Study of Combin- ing Text Mining and Visualisation to Facilitate Document Exploration. In Proceedings of DH2014. Lausanne, Switzerland, 2014. 21 Trading Consequences’ Interlinked Visualisation. http://http://tcqdev.edina.ac.uk/ vis/tradConVis. 22 House of Commons Parliamentary Papers, ProQuest. http://parlipapers.chadwyck.co. uk/home.do. 23 Early Canadiana Online. http://eco.canadiana.ca. 24 Foreign and Commonwealth Office Collection, JSTOR. http://www.jstor.org. 25 Confidential Print collections, Adam Matthrew. http://www.amdigital.co.uk. 26 Directors’ Correspondence Collection from the Archives, Kew Gardens, available at JSTOR Global Plants. http://plants.jstor.org. 27 B. Alex, Byrne K, C. Grover, and R. Tobin. A web-based geo-resolution annotation and evaluation tool. In Proceedings of the 8th Linguistic Annotation Workshop (LAW VIII). Dublin, Ireland, 2014. 28 Commodity. A commodity is defined as a natural resource or a lightly processed product. 22 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin 29 B. Alex and J. Burns. Estimating and rating the quality of optically character recognised text. In Pro- ceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH 2014), pages 97–102, Madrid, Spain, 2014. ACM. 30 Ships’ reports. Return to an order of the Honourable the House of Commons, dated 31 May 1836;–for, a return of the number of ships’ reports that required amendment during the two years ending 5th January 1836; the date of each ship’s arrival; and the date at which the amended report was completed; stating the nature of the error in each case. In House of Commons Parliamentary Papers, 1836. Document id: 1836-016588. 31 Ports gazetteer used in Trading Consequences available on GitHub. https://github.com/ digtrade/digtrade/blob/master/lexical-resources/ports.csv. 32 F. Miltoun. Ships and Shipping. Alexander Moring Ltd., De La More Press, 1903. 33 C. D’Ignazio, R. Bhargava, and E. Zuckerman. CLIFF-CLAVIN: Determining Geographic Focus for News Articles. In Proceedings of NewsKDD 2014. New York, US, 2014. 34 Google Ancient Places. http://googleancientplaces.wordpress.com. 35 L. Isaksen, E. Barker, E.C. Kansa, and K. Byrne. Googling Ancient Places. In Proceedings of Digital Humanities 2011 (DH2011), Stanford, CA, 2011. 36 E. Barker, K. Byrne, L. Isaksen, E. Kansa, and N. Rabinowitz. The Geographic Annotation Platform – a Framework for Unlocking the Places in Free-text Corpora. In NeDiMAH workshop at Digital Human- ities 2012 Conference (DH2012), Hamburg, Germany, 2012. 37 Pleiades. http://pleiades.stoa.org/home. 38 Open Library. http://openlibrary.org. 39 The Herodotus Encoded Space-Text-Image Archive. http://hestia.open.ac.uk. 40 E. Barker, S. Bouzarovski, C. Pelling, and L. Isaksen. Mapping an Ancient Historian in a Digital Age: the Herodotus Encoded Space-Text-Image Archive (HESTIA). Leeds International Classical Journal, 9:1–24, 2010. 41 Perseus Digital Library. http://www.perseus.tufts.edu/hopper. 42 K. Byrne. Matching lexicons to gazetteers. GAP project blog post. April 18, 2011. http://googleancientplaces.wordpress.com/2011/04/18/ matching-lexicons-to-gazetteers. 43 GapVis online interface. http://nrabinowitz.github.io/gapvis. 44 Reading Herodotus spatially in the undergraduate classroom, Part I. Hes- tia project blog post. June 5, 2014. http://hestia.open.ac.uk/ 23 Beatrice Alex, Kate Byrne, Claire Grover and Richard Tobin reading-herodotus-spatially-in-the-undergraduate-classroom-part-i. 45 Reading Herodotus spatially in the undergraduate classroom, Part II. Hes- tia project blog post. June 8, 2014. http://hestia.open.ac.uk/ reading-herodotus-spatially-in-the-undergraduate-classroom-part-ii. 46 Reading Herodotus spatially in the undergraduate classroom, Part III. Hes- tia project blog post. June 22, 2014. http://hestia.open.ac.uk/ reading-herodotus-spatially-in-the-undergraduate-classroom-part-iii. 47 The Historical Gazetteer of England’s Place-Names. http://placenames.org.uk. 48 C. Grover and R. Tobin. A gazetteer and georeferencing for historical English documents. In Pro- ceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2014), pages 119–127, Gothenburg, Sweden, 2014. 49 Key to English Place-Names (KEPN). http://kepn.nottingham.ac.uk. 50 Metadata Authority Description Standard (MADS). http://www.loc.gov/standards/mads/ mads-doc.html. 51 W. Farrer and J.F. Curwen, editors. Records relating to the Barony of Kendale: volume 1. Cumberland and Westmorland Antiquarian and Archaeological Society, 1923. 52 Records relating to the Barony of Kendale: volume 1. Available at British History Online. http: //www.british-history.ac.uk/report.aspx?compid=49295. 24