Design Challenges for Entity Linking Xiao Ling Sameer Singh University of Washington, Seattle WA {xiaoling,sameer,weld}@cs.washington.edu Daniel S. Weld Abstract Recent research on entity linking (EL) has in- troduced a plethora of promising techniques, ranging from deep neural networks to joint in- ference. But despite numerous papers there is surprisingly little understanding of the state of the art in EL. We attack this confusion by analyzing differences between several versions of the EL problem and presenting a simple yet effective, modular, unsupervised system, called VINCULUM, for entity linking. We con- duct an extensive evaluation on nine data sets, comparing VINCULUM with two state-of-the- art systems, and elucidate key aspects of the system that include mention extraction, candi- date generation, entity type prediction, entity coreference, and coherence. 1 Introduction Entity Linking (EL) is a central task in information extraction — given a textual passage, identify entity mentions (substrings corresponding to world entities) and link them to the corresponding entry in a given Knowledge Base (KB, e.g. Wikipedia or Freebase). For example, JetBlue begins direct service between Barnstable Airport and JFK International. Here, “JetBlue” should be linked to the en- tity KB:JetBlue, “Barnstable Airport” to KB:Barnstable Municipal Airport, and “JFK International” to KB:John F. Kennedy International Airport1. The links not only 1We use typewriter font, e.g., KB:Entity, to indicate an entity in a particular KB, and quotes, e.g., “Mention”, to denote textual mentions. provide semantic annotations to human readers but also a machine-consumable representation of the most basic semantic knowledge in the text. Many other NLP applications can benefit from such links, such as distantly-supervised relation extraction (Craven and Kumlien, 1999; Riedel et al., 2010; Hoffmann et al., 2011; Koch et al., 2014) that uses EL to create training data, and some coreference systems that use EL for disambiguation (Hajishirzi et al., 2013; Zheng et al., 2013; Durrett and Klein, 2014). Unfortunately, in spite of numerous papers on the topic and several published data sets, there is surprisingly little understanding about state-of-the-art performance. We argue that there are three reasons for this con- fusion. First, there is no standard definition of the problem. A few variants have been studied in the liter- ature, such as Wikification (Milne and Witten, 2008; Ratinov et al., 2011; Cheng and Roth, 2013) which aims at linking noun phrases to Wikipedia entities and Named Entity Linking (aka Named Entity Dis- ambiguation) (McNamee and Dang, 2009; Hoffart et al., 2011) which targets only named entities. Here we use the term Entity Linking as a unified name for both problems, and Named Entity Linking (NEL) for the subproblem of linking only named entities. But names are just one part of the problem. For many variants there are no annotation guidelines for scor- ing links. What types of entities are valid targets? When multiple entities are plausible for annotating a mention, which one should be chosen? Are nested mentions allowed? Without agreement on these is- sues, a fair comparison is elusive. Secondly, it is almost impossible to assess ap- proaches, because systems are rarely compared using the same data sets. For instance, Hoffart et al. (2011) 315 Transactions of the Association for Computational Linguistics, vol. 3, pp. 315–328, 2015. Action Editor: Kristina Toutanova. Submission batch: 11/2014; Revision batch 3/2015; Published 6/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. developed a new data set (AIDA) based on the CoNLL 2003 Named Entity Recognition data set but failed to evaluate their system on MSNBC previ- ously created by (Cucerzan, 2007); Wikifier (Cheng and Roth, 2013) compared to the authors’ previous system (Ratinov et al., 2011) using the originally se- lected datasets but didn’t evaluate using AIDA data. Finally, when two end-to-end systems are com- pared, it is rarely clear which aspect of a system makes one better than the other. This is especially problematic when authors introduce complex mech- anisms or nondeterministic methods that involve learning-based reranking or joint inference. To address these problems, we analyze several sig- nificant inconsistencies among the data sets. To have a better understanding of the importance of various techniques, we develop a simple and modular, un- supervised EL system, VINCULUM. We compare VINCULUM to the two leading sophisticated EL sys- tems on a comprehensive set of nine datasets. While our system does not consistently outperform the best EL system, it does come remarkably close and serves as a simple and competitive baseline for future re- search. Furthermore, we carry out an extensive ab- lation analysis, whose results illustrate 1) even a near-trivial model using CrossWikis (Spitkovsky and Chang, 2012) performs surprisingly well, and 2) in- corporating a fine-grained set of entity types raises that level even higher. In summary, we make the following contributions: • We analyze the differences among several versions of the entity linking problem, compare existing data sets and discuss annotation inconsistencies between them. (Sections 2 & 3) • We present a simple yet effective, modular, unsu- pervised system, VINCULUM, for entity linking. We make the implementation open source and pub- licly available for future research.2 (Section 4) • We compare VINCULUM to 2 state-of-the-art sys- tems on an extensive evaluation of 9 data sets. We also investigate several key aspects of the system including mention extraction, candidate genera- tion, entity type prediction, entity coreference, and coherence between entities. (Section 5) 2http://github.com/xiaoling/vinculum 2 No Standard Benchmark In this section, we describe some of the key differ- ences amongst evaluations reported in existing litera- ture, and propose a candidate benchmark for EL. 2.1 Data Sets Nine data sets are in common use for EL evaluation; we partition them into three groups. The UIUC group (ACE and MSNBC datasets) (Ratinov et al., 2011), AIDA group (with dev and test sets) (Hoffart et al., 2011), and TAC-KBP group (with datasets rang- ing from the 2009 through 2012 competitions) (Mc- Namee and Dang, 2009). Their statistics are summa- rized in Table 1 3. Our set of nine is not exhaustive, but most other datasets, e.g. CSAW (Kulkarni et al., 2009) and AQUAINT (Milne and Witten, 2008), annotate com- mon concepts in addition to named entities. As we argue in Sec. 3.1, it is extremely difficult to define an- notation guidelines for common concepts, and there- fore they aren’t suitable for evaluation. For clarity, this paper focuses on linking named entities. Sim- ilarly, we exclude datasets comprising Tweets and other short-length documents, since radically differ- ent techniques are needed for the specialized corpora. Table 2 presents a list of recent EL publications showing the data sets that they use for evaluation. The sparsity of this table is striking — apparently no system has reported the performance data from all three of the major evaluation groups. 2.2 Knowledge Base Existing benchmarks have also varied considerably in the knowledge base used for link targets. Wikipedia has been most commonly used (Milne and Wit- ten, 2008; Ratinov et al., 2011; Cheng and Roth, 2013), however datasets were annotated using dif- ferent snapshots and subsets. Other KBs include Yago (Hoffart et al., 2011), Freebase (Sil and Yates, 2013), DBpedia (Mendes et al., 2011) and a subset of Wikipedia (Mayfield et al., 2012). Given that al- most all KBs are descendants of Wikipedia, we use Wikipedia as the base KB in this work.4 3An online appendix containing details of the datasets is avail- able at https://github.com/xiaoling/vinculum/ raw/master/appendix.pdf. 4Since the knowledge bases for all the data sets were around 2011, we use Wikipedia dump 20110513. 316 http://github.com/xiaoling/vinculum https://github.com/xiaoling/vinculum/raw/master/appendix.pdf https://github.com/xiaoling/vinculum/raw/master/appendix.pdf Group Data Set # of Mentions Entity Types KB # of NILs Eval. Metric UIUC ACE 244 Any Wikipedia Topic Wikipedia 0 BOC F1 MSNBC 654 Any Wikipedia Topic Wikipedia 0 BOC F1 AIDA AIDA-dev 5917 PER,ORG,LOC,MISC Yago 1126 Accuracy AIDA-test 5616 PER,ORG,LOC,MISC Yago 1131 Accuracy TAC KBP TAC09 3904 PERT ,ORGT ,GPE TAC ⊂ Wiki 2229 Accuracy TAC10 2250 PERT ,ORGT ,GPE TAC ⊂ Wiki 1230 Accuracy TAC10T 1500 PERT ,ORGT ,GPE TAC ⊂ Wiki 426 Accuracy TAC11 2250 PERT ,ORGT ,GPE TAC ⊂ Wiki 1126 B3+ F1 TAC12 2226 PERT ,ORGT ,GPE TAC ⊂ Wiki 1049 B3+ F1 Table 1: Characteristics of the nine NEL data sets. Entity types: The AIDA data sets include named entities in four NER classes, Person (PER), Organization (ORG), Location (LOC) and Misc. In TAC KBP data sets, both Person (PERT ) and Organization entities (ORGT ) are defined differently from their NER counterparts and geo-political entities (GPE), different from LOC, exclude places like KB:Central California. KB (Sec. 2.2): The knowledge base used when each data was being developed. Evaluation Metric (Sec. 2.3): Bag-of-Concept F1 is used as the evaluation metric in (Ratinov et al., 2011; Cheng and Roth, 2013). B3+ F1 used in TAC KBP measures the accuracy in terms of entity clusters, grouped by the mentions linked to the same entity. Data Set ACE MSNBC AIDA-test TAC09 TAC10 TAC11 TAC12 AQUAINT CSAW Cucerzan (2007) x Milne and Witten (2008) x Kulkarni et al. (2009) x x Ratinov et al. (2011) x x x Hoffart et al. (2011) x Han and Sun (2012) x x He et al. (2013a) x x He et al. (2013b) x x x Cheng and Roth (2013) x x x x Sil and Yates (2013) x x x Li et al. (2013) x x Cornolti et al. (2013) x x x TAC-KBP participants x x x x Table 2: A sample of papers on entity linking with the data sets used in each paper (ordered chronologically). TAC-KBP proceedings comprise additional papers (McNamee and Dang, 2009; Ji et al., 2010; Ji et al., 2010; Mayfield et al., 2012). Our intention is not to exhaust related work but to illustrate how sparse evaluation impedes comparison. NIL entities: In spite of Wikipedia’s size, there are many real-world entities that are absent from the KB. When such a target is missing for a mention, it is said to link to a NIL entity (McNamee and Dang, 2009) (aka out-of-KB or unlinkable entity (Hoffart et al., 2014)). In the TAC KBP, in addition to deter- mining if a mention has no entity in the KB to link, all the mentions that represent the same real world entities must be clustered together. Since our focus is not to create new entities for the KB, NIL clustering is beyond the scope of this paper. The AIDA data sets similarly contain such NIL annotations whereas ACE and MSNBC omit these mentions altogether. We only evaluate whether a mention with no suitable entity in the KB is predicted as NIL. 2.3 Evaluation Metrics While a variety of metrics have been used for evalu- ation, there is little agreement on which one to use. However, this detail is quite important, since the choice of metric strongly biases the results. We de- scribe the most common metrics below. Bag-of-Concept F1 (ACE, MSNBC): For each document, a gold bag of Wikipedia entities is evalu- ated against a bag of system output entities requiring exact segmentation match. This metric may have its historical reason for comparison but is in fact flawed since it will obtain 100% F1 for an annotation in which every mention is linked to the wrong entity, but the bag of entities is the same as the gold bag. Micro Accuracy (TAC09, TAC10, TAC10T): For a list of given mentions, the metric simply measures 317 the percentage of correctly predicted links. TAC-KBP B3+ F1 (TAC11, TAC12): The men- tions that are predicted as NIL entities are required to be clustered according to their identities (NIL cluster- ing). The overall data set is evaluated using a entity cluster-based B3+ F1. NER-style F1 (AIDA): Similar to official CoNLL NER F1 evaluation, a link is considered correct only if the mention matches the gold boundary and the linked entity is also correct. A wrong link with the correct boundary penalizes both precision and recall. We note that Bag-of-Concept F1 is equivalent to the measure for Concept-to-Wikipedia task proposed in (Cornolti et al., 2013) and NER-style F1 is the same as strong annotation match. In the experiments, we use the official metrics for the TAC data sets and NER-style F1 for the rest. 3 No Annotation Guidelines Not only do we lack a common data set for evalua- tion, but most prior researchers fail to even define the problem under study, before developing algorithms. Often an overly general statement such as annotat- ing the mentions to “referent Wikipedia pages” or “corresponding entities” is used to describe which entity link is appropriate. This section shows that failure to have a detailed annotation guideline causes a number of key inconsistencies between data sets. A few assumptions are subtly made in different papers, which makes direct comparisons unfair and hard to comprehend. 3.1 Entity Mentions: Common or Named? Which entities deserve links? Some argue for re- stricting to named entities. Others argue that any phrase that can be linked to a Wikipedia entity adds value. Without a clear answer to this issue, any data set created will be problematic. It’s not fair to pe- nalize a NEL system for skipping a common noun phrases; nor would it be fair to lower the precision of a system that “incorrectly” links a common concept. However, we note that including mentions of com- mon concepts is actually quite problematic, since the choice is highly subjective. Example 1 In December 2008, Hoke was hired as the head football coach at San Diego State Uni- versity. (Wikipedia) At first glance, KB:American football seems the gold-standard link. However, there is another entity KB:College football, which is clearly also, if not more, appropriate. If one argues that KB:College football should be the right choice given the context, what if KB:College football does not exist in the KB? Should NIL be returned in this case? The question is unanswered.5 For the rest of this paper, we focus on the (better defined) problem of solely linking named entities.6 AQUAINT and CSAW are therefore not used for eval- uation due to an disproportionate number of common concept annotations. 3.2 How Specific Should Linked Entities Be? It is important to resolve disagreement when more than one annotation is plausible. The TAC- KBP annotation guidelines (tac, 2012) specify that different iterations of the same organization (e.g. the KB:111th U.S. Congress and the KB:112th U.S. Congress) should not be con- sidered as distinct entities. Unfortunately, this is not a common standard shared across the data sets, where often the most specific possible entity is preferred. Example 2 Adams and Platt are both injured and will miss England’s opening World Cup qualifier against Moldova on Sunday. (AIDA) Here the mention “World Cup” is labeled as KB:1998 FIFA World Cup, a specific occur- rence of the event KB:FIFA World Cup. It is indeed difficult to decide how specific the gold link should be. Given a static knowledge base, which is often incomplete, one cannot always find the most specific entity. For instance, there is no Wikipedia page for the KB:116th U.S. Congress be- cause the Congress has not been elected yet. On the other hand, using general concepts can cause troubles for machine reading. Consider president-of relation extraction on the following sentence. Example 3 Joe Biden is the Senate President in the 113th United States Congress. 5Note that linking common noun phrases is closely related to Word Sense Disambiguation (Moro et al., 2014). 6We define named entity mention extensionally: any name uniquely referring to one entity of a predefined class, e.g. a specific person or location. 318 Person Common Concepts E.g. Brain_Tumor, Desk, Water, etc. Misc.Organization Location TAC GPE (Geo- political Entities) TAC Organization TAC Person Figure 1: Entities divided by their types. For named enti- ties, the solid squares represent 4 CoNLL(AIDA) classes; the red dashed squares display 3 TAC classes; the shaded rectangle depicts common concepts. Failure to distinguish different Congress iterations would cause an information extraction system to falsely extracting the fact that KB:Joe Biden is the Senate President of the KB:United States Congress at all times! 3.3 Metonymy Another situation in which more than one annotation is plausible is metonymy, which is a way of referring to an entity not by its own name but rather a name of some other entity it is associated with. A common example is to refer to a country’s government using its capital city. Example 4 Moscow’s as yet undisclosed propos- als on Chechnya’s political future have , mean- while, been sent back to do the rounds of various government departments. (AIDA) The mention here, “Moscow”, is labeled as KB:Government of Russia in AIDA. If this sentence were annotated in TAC-KBP, it would have been labeled as KB:Moscow (the city) instead. Even the country KB:Russia seems to be a valid label. However, neither the city nor the country can ac- tually make a proposal. The real entity in play is KB:Government of Russia. 3.4 Named Entities, But of What Types? Even in the data sets consisting of solely named en- tities, the types of the entities vary and therefore the data distribution differs. TAC-KBP has a clear definition of what types of entities require links, namely Person, Organization and Geo-political enti- ties. AIDA, which adopted the NER data set from the CoNLL shared task, includes entities from 4 classes, Person, Organization, Location and Misc.7 Com- 7 http://www.cnts.ua.ac.be/conll2003/ner/ annotation.txt pared to the AIDA entity types, it is obvious that TAC- KBP is more restrictive, since it does not have Misc. entities (e.g. KB:FIFA World Cup). Moreover, TAC entities don’t include fictional characters or organizations, such as KB:Sherlock Holmes. TAC GPEs include some geographical regions, such as KB:France, but exclude those without govern- ments, such as KB:Central California or lo- cations such as KB:Murrayfield Stadium.8 Figure 1 summarizes the substantial differences be- tween the two type sets. 3.5 Can Mention Boundaries Overlap? We often see one entity mention nested in another. For instance, a U.S. city is often followed by its state, such as “Portland, Oregon”. One can split the whole mention to individual ones, “Portland” for the city and “Oregon” for the city’s state. AIDA adopts this segmentation. However, annotations in an early TAC- KBP dataset (2009) select the whole span as the men- tion. We argue that all three mentions make sense. In fact, knowing the structure of the mention would facilitate the disambiguation (i.e. the state name pro- vides enough context to uniquely identify the city entity). Besides the mention segmentation, the links for the nested entities may also be ambiguous. Example 5 Dorothy Byrne, a state coordinator for the Florida Green Party, said she had been in- undated with angry phone calls and e-mails from Democrats, but has yet to receive one regretful note from a Nader voter. The gold annotation from ACE is KB:Green Party of Florida even though the mention doesn’t contain “Florida” and can arguably be linked to KB:US Green Party. 4 A Simple & Modular Linking Method In this section, we present VINCULUM, a simple, unsupervised EL system that performs compara- bly to the state of the art. As input, VINCULUM takes a plain-text document d and outputs a set of segmented mentions with their associated entities Ad = {(mi, li)}. VINCULUM begins with mention extraction. For each identified mention m, candi- date entities Cm = {cj} are generated for linking. VINCULUM assigns each candidate a linking score 8 http://nlp.cs.rpi.edu/kbp/2014/elquery.pdf 319 http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt http://nlp.cs.rpi.edu/kbp/2014/elquery.pdf Candidate Generation Entity Type Coreference Coherence Mention phrases less context sentence document world knowledge more context All possible entities One most likely entity Figure 2: The process of finding the best entity for a mention. All possible entities are sifted through as VINCULUM proceeds at each stage with a widening range of context in consideration. s(cj|m,d) based on the entity type compatibility, its coreference mentions, and other entity links around this mention. The candidate entity with the maxi- mum score, i.e. l = arg max c∈Cm s(c|m,d), is picked as the predicted link of m. Figure 2 illustrates the linking pipeline that follows mention extraction. For each mention, VINCULUM ranks the candidates at each stage based on an ever widening context. For example, candidate generation (Section 4.2) merely uses the mention string, entity typing (Section 4.3) uses the sentence, while corefer- ence (Section 4.4) and coherence (Section 4.5) use the full document and Web respectively. Our pipeline mimics the sieve structure introduced in (Lee et al., 2013), but instead of merging coreference clusters, we adjust the probability of candidate entities at each stage. The modularity of VINCULUM enables us to study the relative impact of its subcomponents. 4.1 Mention Extraction The first step of EL extracts potential mentions from the document. Since VINCULUM restricts attention to named entities, we use a Named Entity Recogni- tion (NER) system (Finkel et al., 2005). Alternatively, an NP chunker may be used to identify the mentions. 4.2 Dictionary-based Candidate Generation While in theory a mention could link to any entity in the KB, in practice one sacrifices little by restricting attention to a subset (dozens) precompiled using a dictionary. A common way to build such a dictionary D is by crawling Web pages and aggregating anchor links that point to Wikipedia pages. The frequency with which a mention (anchor text), m, links to a par- ticular entity (anchor link), c, allows one to estimate the conditional probability p(c|m). We adopt the CrossWikis dictionary, which was computed from a Google crawl of the Web (Spitkovsky and Chang, 2012). The dictionary contains more than 175 million unique strings with the entities they may represent. In the literature, the dictionary is often built from the anchor links within the Wikipedia website (e.g., (Ratinov et al., 2011; Hoffart et al., 2011)). In addition, we employ two small but precise dic- tionaries for U.S. state abbreviations and demonyms when the mention satisfies certain conditions. For U.S. state abbreviations, a comma before the men- tion is required. For demonyms, we ensure that the mention is either an adjective or a plural noun. 4.3 Incorporating Entity Types For an ambiguous mention such as “Washington”, knowing that the mention denotes a person allows an EL system to promote KB:George Washington while lowering the rank of the capital city in the candi- date list. We incorporate this intuition by combining it probabilistically with the CrossWikis prior. p(c|m,s) = ∑ t∈T p(c,t|m,s) = ∑ t∈T p(c|m,t,s)p(t|m,s) , where s denotes the sentence containing this men- tion m and T represents the set of all possible types. We assume the candidate c and the sentential con- text s are conditionally independent if both the men- tion m and its type t are given. In other words, p(c|m,t,s) = p(c|m,t), the RHS of which can be estimated by renormalizing p(c|m) w.r.t. type t: p(c|m,t) = p(c|m)∑ c7→t p(c|m) , where c 7→ t indicates that t is one of c’s entity types.9 The other part of the equation, p(t|m,s), can be estimated by any off-the-shelf Named Entity Recognition system, e.g. Finkel et al. (2005) and Ling and Weld (2012). 4.4 Coreference It is common for entities to be mentioned more than once in a document. Since some mentions are less ambiguous than others, it makes sense to use the 9We notice that an entity often has multiple appropriate types, e.g. a school can be either an organization or a location depend- ing on the context. We use Freebase to provide the entity types and map them appropriately to the target type set. 320 most representative mention for linking. To this end, VINCULUM applies a coreference resolution system (e.g. Lee et al. (2013)) to cluster coreferent mentions. The representative mention of a cluster is chosen for linking.10 While there are more sophisticated ways to integrate EL and coreference (Hajishirzi et al., 2013), VINCULUM’s pipeline is simple and modular. 4.5 Coherence When KB:Barack Obama appears in a document, it is more likely that the mention “Washington” rep- resents the capital KB:Washington, D.C. as the two entities are semantically related, and hence the joint assignment is coherent. A number of re- searchers found inclusion of some version of coher- ence is beneficial for EL (Cucerzan, 2007; Milne and Witten, 2008; Ratinov et al., 2011; Hoffart et al., 2011; Cheng and Roth, 2013). For incorporating it in VINCULUM, we seek a document-wise assignment of entity links that maximizes the sum of the coher- ence scores between each pair of entity links pre- dicted in the document d, i.e. ∑ 1≤i 80% recall for all but one data set). Note that Crosswikis itself can be used a context-insensitive EL system by looking up the mention string and predict- ing the entity with the highest conditional probability. The second row in Table 4 presents the results using this simple baseline. Crosswikis alone, using only the mention string, has a fairly reasonable performance. 15We also compared to another intra-Wikipedia dictionary (Table 3 in (Ratinov et al., 2011)). A recall of 86.85% and 88.67% is reported for ACE and MSNBC, respectively, at a cut- off level of 20. CrossWikis has a recall of 90.1% and 93.3% at the same cut-off. 322 opennlp.apache.org https://www.googleapis.com/freebase/v1/search https://www.googleapis.com/freebase/v1/search Approach TAC09 TAC10 TAC10T TAC11 TAC12 AIDA-dev AIDA-test ACE MSNBC CrossWikis only 80.4 85.6 86.9 78.5 62.4 62.6 60.4 87.7 70.3 +NER 79.2 83.3 85.1 76.6 61.1 66.4 66.2 77.0 71.8 +FIGER 81.0 86.1 86.9 78.8 63.5 66.7 64.6 87.7 75.4 +NER(GOLD) 85.7 87.4 88.0 80.1 66.7 72.6 72.0 89.3 83.3 +FIGER(GOLD) 84.1 88.8 89.0 81.6 66.1 76.2 76.5 91.8 87.4 Table 4: Performance (%) after incorporating entity types, comparing two sets of entity types (NER and FIGER). Using a set of fine-grained entity types (FIGER) generally achieves better results. 5.3 Incorporating Entity Types Here we investigate the impact of the entity types on the linking performance. The most obvious choice is the traditional NER types (TNER = {PER, ORG, LOC, MISC}). To predict the types of the mentions, we run Stanford NER (Finkel et al., 2005) and set the predicted type tm of each mention m to have probability 1 (i.e. p(tm|m,s) = 1). As to the types of the entities, we map their Freebase types to the four NER types16. A more appropriate choice is 112 fine-grained en- tity types introduced by Ling and Weld (2012) in FIGER, a publicly available package 17. These fine- grained types are not disjoint, i.e. each mention is allowed to have more than one type. For each men- tion, FIGER returns a set of types, each of which is accompanied by a score, tFIGER(m) = {(tj,gj) : tj ∈ TFIGER}. A softmax function is used to proba- bilistically interpret the results as follows: p(tj|m,s) = { 1 Z exp(gj) if (tj,gj) ∈ tFIGER(m), 0 otherwise where Z = ∑ (tk,gk)∈tFIGER(m) exp(gk). We evaluate the utility of entity types in Table 4, which shows that using NER typically worsens the performance. This drop may be attributed to the rigid binary values for type incorporation; it is hard to output the probabilities of the entity types for a mention given the chain model adopted in Stanford NER. We also notice that FIGER types consistently improve the results across the data sets, indicating that a finer-grained type set may be more suitable for the entity linking task. To further confirm this assertion, we simulate the scenario where the gold types are provided for each 16The Freebase types “/person/*” are mapped to PER, “/lo- cation/*” to LOC, “/organization/*” plus a few others like “/sports/sports team” to ORG, and the rest to MISC. 17http://github.com/xiaoling/figer mention (the oracle types of its gold entity). The per- formance is significantly boosted with the assistance from the gold types, which suggests that a better per- forming NER/FIGER system can further improve performance. Similarly, we notice that the results using FIGER types almost consistently outperform the ones using NER types. This observation endorses our previous recommendation of using fine-grained types for EL tasks. 5.4 Coherence Two coherence measures suggested in Section 4.5 are tested in isolation to better understand their effects in terms of the linking performance (Table 5). In gen- eral, the link-based NGD works slightly better than the relational facts in 6 out of 9 data sets (comparing row “+NGD” with row “+REL”). We hypothesize that the inferior results of REL may be due to the in- completeness of Freebase triples, which makes it less robust than NGD. We also combine the two by taking the average score, which in most data set performs the best (“+BOTH”), indicating that two measures provide complementary source of information. 5.5 Overall Performance To answer the last question of how well does VINCULUM perform overall, we conduct an end-to- end comparison against two publicly available sys- tems with leading performance:18 AIDA (Hoffart et al., 2011): We use the recom- mended GRAPH variant of the AIDA package (Ver- sion 2.0.4) and are able to replicate their results when gold-standard mentions are given. 18We are also aware of other systems such as TagMe-2 (Fer- ragina and Scaiella, 2012), DBpedia Spotlight (Mendes et al., 2011) and WikipediaMiner (Milne and Witten, 2008). A trial test on the AIDA data set shows that both Wikifier and AIDA tops the performance of other systems reported in (Cornolti et al., 2013) and therefore it is sufficient to compare with these two systems in the evaluation. 323 http://github.com/xiaoling/figer Approach TAC09 TAC10 TAC10T TAC11 TAC12 AIDA-dev AIDA-test ACE MSNBC no COH 80.9 86.2 87.0 78.6 59.9 68.9 66.3 87.7 86.6 +NGD 81.8 85.7 86.8 79.7 63.2 69.5 67.7 88.1 86.8 +REL 81.2 86.3 87.0 79.3 63.1 69.1 66.4 88.5 86.1 +BOTH 81.4 86.8 87.0 79.9 63.7 69.4 67.5 88.5 86.9 Table 5: Performance (%) after re-ranking candidates using coherence scores, comparing two coherence measures (NGD and REL). “no COH”: no coherence based re-ranking is used. “+BOTH”: an average of two scores is used for re-ranking. Coherence in general helps: a combination of both measures often achieves the best effect and NGD has a slight advantage over REL. Approach TAC09 TAC10 TAC10T TAC11 TAC12 AIDA-dev AIDA-test ACE MSNBC Overall CrossWikis 80.4 85.6 86.9 78.5 62.4 62.6 62.4 87.7 70.3 75.0 +FIGER 81.0 86.1 86.9 78.8 63.5 66.7 64.5 87.7 75.4 76.7 +Coref 80.9 86.2 87.0 78.6 59.9 68.9 66.3 87.7 86.6 78.0 +Coherence =VINCULUM 81.4 86.8 87.0 79.9 63.7 69.4 67.5 88.5 86.9 79.0 AIDA 73.2 78.6 77.5 68.4 52.0 71.9 74.8 77.8 75.4 72.2 WIKIFIER 79.7 86.2 86.3 82.4 64.7 72.1 69.8 85.1 90.1 79.6 Table 6: End-to-end performance (%): We compare VINCULUM in different stages with two state-of-the-art systems, AIDA and WIKIFIER. The column “Overall” lists the average performance of nine data sets for each approach. CrossWikis appears to be a strong baseline. VINCULUM is 0.6% shy from WIKIFIER, each winning in four data sets; AIDA tops both VINCULUM and WIKIFIER on AIDA-test. WIKIFIER (Cheng and Roth, 2013): We are able to reproduce the reported results on ACE and MSNBC and obtain a close enough B3+ F1 number on TAC11 (82.4% vs 83.7%). Since WIKIFIER overgenerates mentions and produce links for common concepts, we restrict its output on the AIDA data to the men- tions that Stanford NER predicts. Table 6 shows the performance of VINCULUM after each stage of candidate generation (Cross- Wikis), entity type prediction (+FIGER), coreference (+Coref) and coherence (+Coherence). The column “Overall” displays the average of the performance numbers for nine data sets for each approach. WIKI- FIER achieves the highest in the overall performance. VINCULUM performs quite comparably, only 0.6% shy from WIKIFIER, despite its simplicity and un- supervised nature. Looking at the performance per data set, VINCULUM and WIKIFIER each is superior in 4 out of 9 data sets while AIDA tops the perfor- mance only on AIDA-test. The performance of all the systems on TAC12 is generally lower than on the other dataset, mainly because of a low recall in the candidate generation stage. We notice that even using CrossWikis alone works pretty well, indicating a strong baseline for future comparisons. The entity type prediction provides the highest boost on performance, an absolute 1.7% increase, among other subcomponents. The corefer- ence stage and the coherence stage also give a rea- sonable lift. In terms of running time, VINCULUM runs reason- ably fast. For a document with 20-40 entity mentions on average, VINCULUM takes only a few seconds to finish the linking process on one single thread. 5.6 System Analysis We outline the differences between the three system architectures in Table 7. For identifying mentions to link, both VINCULUM and AIDA rely solely on NER detected mentions, while WIKIFIER additionally in- cludes common noun phrases, and trains a classifier to determine whether a mention should be linked. For candidate generation, CrossWikis provides better coverage of entity mentions. For example, in Fig- ure 3, we observe a recall of 93.2% at a cut-off of 30 by CrossWikis, outperforming 90.7% by AIDA’s dictionary. Further, Hoffart et al. (2011) report a precision of 65.84% using gold mentions on AIDA- test, while CrossWikis achieves a higher precision at 69.24%. Both AIDA and WIKIFIER use coarse NER types as features, while VINCULUM incorpo- rates fine-grained types that lead to dramatically im- proved performance, as shown in Section 5.3. The differences in Coreference and Coherence are not cru- 324 VINCULUM AIDA WIKIFIER Mention Extraction NER NER NER, noun phrases Candidate Generation CrossWikis an intra-Wikipedia dictionary an intra-Wikipedia dictionary Entity Types FIGER NER NER Coreference find the representative mention - re-rank the candidates Coherence link-based similarity, relation triples link-based similarity link-based similarity, relation triples Learning unsupervised trained on AIDA trained on a Wikipedia sample Table 7: Comparison of entity linking pipeline architectures. VINCULUM components are described in detail in Section 4, and correspond to Figure 2. Components found to be most useful for VINCULUM are highlighted. cial to performance, as they each provide relatively small gains. Finally, VINCULUM is an unsupervised system whereas AIDA and WIKIFIER are trained on labeled data. Reliance on labeled data can often hurt performance in the form of overfitting and/or incon- sistent annotation guidelines; AIDA’s lower perfor- mance on TAC datasets, for instance, may be caused by the different data/label distribution of its train- ing data from other datasets (e.g. CoNLL-2003 con- tains many scoreboard reports without complete sen- tences, and the more specific entities as annotations for metonymic mentions). We analyze the errors made by VINCULUM and categorize them into six classes (Table 8). “Metonymy” consists of the errors where the men- tion is metonymic but the prediction links to its lit- eral name. The errors in “Wrong Entity Types” are mainly due to the failure to recognize the correct en- tity type of the mention. In Table 8’s example, the link would have been right if FIGER had correctly predicted the airport type. The mistakes by the coref- erence system often propagate and lead to the errors under the “Coreference” category. The “Context” cat- egory indicates a failure of the linking system to take into account general contextual information other than the fore-mentioned categories. “Specific Labels” refers to the errors where the gold label is a specific instance of a general entity, includes instances where the prediction is the parent company of the gold en- tity or where the gold label is the township whereas the prediction is the city that corresponds to the town- ship. “Misc” accounts for the rest of the errors. In the example, usually the location name appearing in the byline of a news article is a city name; and VINCULUM, without knowledge of this convention, mistakenly links to a state with the same name. The distribution of errors shown in Table 9 pro- vides valuable insights into VINCULUM’s varying performance across the nine datasets. First, we ob- serve a notably high percentage of metonymy-related errors. Since many of these errors are caused due to incorrect type prediction by FIGER, improvements in type prediction for metonymic mentions can provide substantial gains in future. The especially high per- centage of metonymic mentions in the AIDA datasets thus explains VINCULUM’s lower perfomance there (see Table 6). Second, we note that VINCULUM makes quite a number of “Context” errors on the TAC11 and TAC12 datasets. One possible reason is that when highly ambiguous mentions have been intentionally selected, link-based similarity and relational triples are insufficient for capturing the context. For exam- ple, in “... while returning from Freeport to Port- land. (TAC)”, the mention “Freeport”is unbounded by the state, one needs to know that it’s more likely to have both “Freeport” and “Portland” in the same state (i.e. Maine) to make a correct prediction 19. Another reason may be TAC’s higher percentage of Web documents; since contextual information is more scattered in Web text than in newswire docu- ments, this increases the difficulty of context model- ing. We leave a more sophisticated context model for future work (Chisholm and Hachey, 2015; Singh et al., 2012). Since “Specific Labels”, “Metonymy”, and “Wrong Entity Types” correspond to the annotation issues discussed in Sections 3.2, 3.3, and 3.4, the distribution of errors are also useful in studying annotation inconsistencies. The fact that the er- rors vary considerably across the datasets, for in- stance, VINCULUM makes many more “Specific Labels” mistakes in ACE and MSNBC, strongly suggests that annotation guidelines have a consid- erable impact on the final performance. We also observe that annotation inconsistencies also cause reasonable predictions to be treated as a mistake, 19e.g. Cucerzan (2012) use geo-coordinates as features. 325 Category Example Gold Label Prediction Metonymy South Africa managed to avoid a fifth successive defeat in 1996 at the hands of the All Blacks ... South Africa national rugby union team South Africa Wrong Entity Types Instead of Los Angeles International, for example, consider flying into Burbank or John Wayne Airport ... Bob Hope Airport Burbank, California Coreference It is about his mysterious father, Barack Hussein Obama, an imperious if alluring voice gone distant and then missing. Barack Obama Sr. Barack Obama Context Scott Walker removed himself from the race, but Green never really stirred the passions of former Walker supporters, nor did he garner out- sized support “outstate”. Scott Walker (politician) Scott Walker (singer) Specific Labels What we like would be Seles , ( Olympic champion Lindsay ) Davenport and Mary Joe Fernandez . 1996 Summer Olympics Olympic Games Misc NEW YORK 1996-12-07 New York City New York Table 8: We divide linking errors into six error categories and provide an example for each class. Error Category TAC09 TAC10 TAC10T TAC11 TAC12 AIDA-dev AIDA-test ACE MSNBC Metonymy 16.7% 0.0% 3.3% 0.0% 0.0% 60.0% 60.0% 5.3% 20.0% Wrong Entity Types 13.3% 23.3% 20.0% 6.7% 10.0% 6.7% 10.0% 31.6% 5.0% Coreference 30.0% 6.7% 20.0% 6.7% 3.3% 0.0% 0.0% 0.0% 20.0% Context 30.0% 26.7% 26.7% 70.0% 70.0% 13.3% 16.7% 15.8% 15.0% Specific Labels 6.7% 36.7% 16.7% 10.0% 3.3% 3.3% 3.3% 36.9% 25.0% Misc 3.3% 6.7% 13.3% 6.7% 13.3% 16.7% 10.0% 10.5% 15.0% # of examined errors 30 30 30 30 30 30 30 19 20 Table 9: Error analysis: We analyze a random sample of 250 of VINCULUM’s errors, categorize the errors into six classes, and display the frequencies of each type across the nine datasets. for example, AIDA predicts KB:West Virginia Mountaineers football for “..., Alabama of- fered the job to Rich Rodriguez, but he decided to stay at West Virginia. (MSNBC)” but the gold label is KB:West Virginia University. 6 Related Work Most related work has been discussed in the earlier sections; see Shen et al. (2014) for an EL survey. Two other papers deserve comparison. Cornolti et al. (2013) present a variety of evaluation measures and experimental results on five systems compared head- to-head. In a similar spirit, Hachey et al. (2014) pro- vide an easy-to-use evaluation toolkit on the AIDA data set. In contrast, our analysis focuses on the prob- lem definition and annotations, revealing the lack of consistent evaluation and a clear annotation guide- line. We also show an extensive set of experimental results conducted on nine data sets as well as a de- tailed ablation analysis to assess each subcomponent of a linking system. 7 Conclusion and Future Work Despite recent progress in Entity Linking, the com- munity has had little success in reaching an agree- ment on annotation guidelines or building a standard benchmark for evaluation. When complex EL sys- tems are introduced, there are limited ablation studies for readers to interpret the results. In this paper, we examine 9 EL data sets and discuss the inconsisten- cies among them. To have a better understanding of an EL system, we implement a simple yet effective, unsupervised system, VINCULUM, and conduct ex- tensive ablation tests to measure the relative impact of each component. From the experimental results, we show that a strong candidate generation component (CrossWikis) leads to a surprisingly good result; us- ing fine-grained entity types helps filter out incorrect links; and finally, a simple unsupervised system like VINCULUM can achieve comparable performance with existing machine-learned linking systems and, therefore, is suitable as a strong baseline for future research. There are several directions for future work. We hope to catalyze agreement on a more precise EL an- notation guideline that resolves the issues discussed in Section 3. We would also like to use crowdsourc- ing (Bragg et al., 2014) to collect a large set of these annotations for subsequent evaluation. Finally, we hope to design a joint model that avoids cascading errors from the current pipeline (Wick et al., 2013; Durrett and Klein, 2014). 326 Acknowledgements The authors thank Luke Zettle- moyer, Tony Fader, Kenton Lee, Mark Yatskar for constructive suggestions on an early draft and all members of the LoudLab group and the LIL group for helpful discussions. We also thank the action edi- tor and the anonymous reviewers for valuable com- ments. This work is supported in part by the Air Force Research Laboratory (AFRL) under prime con- tract no. FA8750-13-2-0019, an ONR grant N00014- 12-1-0211, a WRF / TJ Cable Professorship, a gift from Google, an ARO grant number W911NF-13- 1-0246, and by TerraSwarm, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA. Any opinions, findings, and conclusion or recommenda- tions expressed in this material are those of the au- thor(s) and do not necessarily reflect the view of DARPA, AFRL, or the US government. References Jonathan Bragg, Andrey Kolobov, and Daniel S Weld. 2014. Parallel task routing for crowdsourcing. In Sec- ond AAAI Conference on Human Computation and Crowdsourcing. Xiao Cheng and Dan Roth. 2013. Relational inference for wikification. In EMNLP. Andrew Chisholm and Ben Hachey. 2015. Entity disam- biguation with web links. Transactions of the Associa- tion for Computational Linguistics, 3:145–156. Marco Cornolti, Paolo Ferragina, and Massimiliano Cia- ramita. 2013. A framework for benchmarking entity- annotation systems. In Proceedings of the 22nd interna- tional conference on World Wide Web, pages 249–260. International World Wide Web Conferences Steering Committee. Mark Craven and Johan Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh Inter- national Conference on Intelligent Systems for Molecu- lar Biology (ISMB-1999), pages 77–86. S. Cucerzan. 2007. Large-scale named entity disam- biguation based on wikipedia data. In Proceedings of EMNLP-CoNLL, volume 2007, pages 708–716. Silviu Cucerzan. 2012. The msr system for entity linking at tac 2012. In Text Analysis Conference 2012. Greg Durrett and Dan Klein. 2014. A joint model for en- tity analysis: Coreference, typing, and linking. Trans- actions of the Association for Computational Linguis- tics, 2:477–490. Paolo Ferragina and Ugo Scaiella. 2012. Fast and ac- curate annotation of short texts with wikipedia pages. IEEE Software, 29(1):70–75. J.R. Finkel, T. Grenager, and C. Manning. 2005. Incor- porating non-local information into information extrac- tion systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Compu- tational Linguistics, pages 363–370. Association for Computational Linguistics. Ben Hachey, Joel Nothman, and Will Radford. 2014. Cheap and easy entity evaluation. In ACL. Hannaneh Hajishirzi, Leila Zilles, Daniel S. Weld, and Luke Zettlemoyer. 2013. Joint Coreference Resolution and Named-Entity Linking with Multi-pass Sieves. In EMNLP. Xianpei Han and Le Sun. 2012. An entity-topic model for entity linking. In Proceedings of the 2012 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing, pages 105–115. Association for Computational Linguistics. Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai Zhang, and Houfeng Wang. 2013a. Learning entity rep- resentation for entity disambiguation. Proc. ACL2013. Zhengyan He, Shujie Liu, Yang Song, Mu Li, Ming Zhou, and Houfeng Wang. 2013b. Efficient collective entity linking with stacking. In EMNLP, pages 426–435. Johannes Hoffart, Mohamed A. Yosef, Ilaria Bordino, Ha- gen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782–792. As- sociation for Computational Linguistics. Johannes Hoffart, Yasemin Altun, and Gerhard Weikum. 2014. Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international confer- ence on World wide web, pages 385–396. International World Wide Web Conferences Steering Committee. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge- based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th An- nual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, volume 1, pages 541–550. Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Grif- fitt, and Joe Ellis. 2010. Overview of the tac 2010 knowledge base population track. In Text Analysis Con- ference (TAC 2010). Mitchell Koch, John Gilmer, Stephen Soderland, and Daniel S Weld. 2014. Type-aware distantly supervised relation extraction with linked arguments. In EMNLP. Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2009. Collective annotation of Wikipedia entities in web text. In Proceedings of the 327 15th ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 457–466. ACM. Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013. Deterministic coreference resolution based on entity- centric, precision-ranked rules. Computational Linguis- tics, pages 1–54. Yang Li, Chi Wang, Fangqiu Han, Jiawei Han, Dan Roth, and Xifeng Yan. 2013. Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge dis- covery and data mining, pages 1070–1078. ACM. Xiao Ling and Daniel S Weld. 2012. Fine-grained entity recognition. In AAAI. James Mayfield, Javier Artiles, and Hoa Trang Dang. 2012. Overview of the tac2012 knowledge base popu- lation track. Text Analysis Conference (TAC 2012). P. McNamee and H.T. Dang. 2009. Overview of the tac 2009 knowledge base population track. Text Analysis Conference (TAC 2009). Pablo N Mendes, Max Jakob, Andrés Garcı́a-Silva, and Christian Bizer. 2011. Dbpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, pages 1–8. ACM. David Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM con- ference on Information and knowledge management, pages 509–518. ACM. Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2. Lev-Arie Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and global algorithms for dis- ambiguation to wikipedia. In ACL, volume 11, pages 1375–1384. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In ECML/PKDD (3), pages 148–163. Wei Shen, Jianyong Wang, and Jiawei Han. 2014. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE. Avirup Sil and Alexander Yates. 2013. Re-ranking for joint named-entity recognition and linking. In Pro- ceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2369–2374. ACM. Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2012. Wikilinks: A large- scale cross-document coreference corpus labeled via links to wikipedia. Technical report, University of Massachusetts Amherst, CMPSCI Technical Report, UM-CS-2012-015. Valentin I Spitkovsky and Angel X Chang. 2012. A cross- lingual dictionary for english wikipedia concepts. In LREC, pages 3168–3175. 2012. Tac kbp entity selection. http://www.nist. gov/tac/2012/KBP/task_guidelines/ TAC_KBP_Entity_Selection_V1.1.pdf. Michael Wick, Sameer Singh, Harshal Pandya, and An- drew McCallum. 2013. A joint model for discovering and linking entities. In CIKM Workshop on Automated Knowledge Base Construction (AKBC). Jiaping Zheng, Luke Vilnis, Sameer Singh, Jinho D. Choi, and Andrew McCallum. 2013. Dynamic knowledge- base alignment for coreference resolution. In Confer- ence on Computational Natural Language Learning (CoNLL). 328 http://www.nist.gov/tac/2012/KBP/task_guidelines/TAC_KBP_Entity_Selection_V1.1.pdf http://www.nist.gov/tac/2012/KBP/task_guidelines/TAC_KBP_Entity_Selection_V1.1.pdf http://www.nist.gov/tac/2012/KBP/task_guidelines/TAC_KBP_Entity_Selection_V1.1.pdf