Edinburgh Research Explorer Evolutionary FineTuning of Automated Semantic Annotation Systems Citation for published version: Cuzzola, J, Jovanovic, J, Bagheri, E & Gasevic, D 2015, 'Evolutionary FineTuning of Automated Semantic Annotation Systems', Expert Systems with Applications, vol. 42, no. 20, pp. 6864-6877. https://doi.org/10.1016/j.eswa.2015.04.054 Digital Object Identifier (DOI): 10.1016/j.eswa.2015.04.054 Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: Expert Systems with Applications General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://doi.org/10.1016/j.eswa.2015.04.054 https://doi.org/10.1016/j.eswa.2015.04.054 https://www.research.ed.ac.uk/portal/en/publications/evolutionary-finetuning-of-automated-semantic-annotation-systems(a441cfe2-919d-49a2-8037-47256495b81e).html   Evolutionary Fine­Tuning of Automated Semantic Annotation Systems    John Cuzzola​a​ (jcuzzola@ryerson.ca), Jelena Jovanovic​b​ (jeljov@gmail.com),   Ebrahim Bagheri​a​ (bagheri@ryerson.ca), Dragan Gasevic​c​ (dgasevic@acm.org)  a​ ​Laboratory for Systems, Software and Semantics (LS​3​), ​http://ls3.rnet.ryerson.ca​, Ryerson University, Ontario, Canada  b​ ​Faculty of Organizational Sciences (FOS), ​http://www.fon.bg.ac.rs​, University of Belgrade, Belgrade, Serbia  c ​Schools of Education and Informatics​, ​http://www.ed.ac.uk/schools-departments/informatics​, University of Edinburgh, Scotland, United                        Kingdom    Abstract. ​Considering the ever­increasing speed at which new textual content is generated, an efficient and effective                                use of large text corpora requires automated natural language processing and text analysis tools. A subset of such tools,                                      namely automated semantic annotation tools, are capable of interlinking syntactical forms of text with their underlying                                semantic concepts. The optimal performance of automated semantic annotation tools often depends on tuning the                              values of the tools’ adjustable parameters to the specificities of the annotation task, and particularly to the                                  characteristics of the text to be annotated. Such characteristics include the text domain, terseness or verbosity level, text                                    length, structure and style. Since the default configuration of annotation tools is not suitable for the large variety of                                      input texts that different combinations of these attributes can produce, users often need to adjust the annotators’ tunable                                    parameters in order to get the best results. However, the configuration of semantic annotators is presently a tedious and                                      time consuming task as it is primarily based on a manual trial­and­error process. In this paper, we propose a Parameter                                        Tuning Architecture (PTA) for automating the task of configuring parameter values of semantic annotation tools. We                                describe the core fitness functions of PTA that operate on the quality of the annotations produced, and offer a solution,                                        based on a genetic algorithm, for searching the space of possible parameter values. Our experiments demonstrate that                                  PTA enables effective configuration of parameter values of many semantic annotation tools.    Keywords.​ ​semantic annotation, automated configuration, genetic algorithm, parameter learning    I. Introduction  The quantity and variety of unstructured textual content has rapidly increased over the last few years,                                leading large and small organizations towards seeking solutions that enable effective and efficient use                            of both the internally produced textual content, and the content originating from the Web . Considering                             1 the amount of textual content and the speed at which it has to be processed, it is gradually becoming                                      evident that automated machine comprehension of text is a necessity, if the objectives of efficiency and                                effectiveness were to be reached. This has led to an increased research focus, both in academia and                                  industry, on text mining, natural language processing and other related Artificial Intelligence fields                          (Hovy, Navigli, & Ponzetto, 2013​), and resulted in numerous proposals and specific software solutions                            for addressing some aspects of text comprehension through, for example, named entity extraction                          (Ratinov & Roth, 2013; Atdağ & Labatut, 2013), relation extraction (Yan, Okazaki, Matsuo, Yang, &                              Ishizuka, 2009; Weston, Bordes, Yakhnenko, & Usunier, 2013), and sentiment analysis (Liu, 2012).  Automated semantic annotation of textual content addresses an important aspect of text comprehension,                          namely, the extraction and disambiguation of entities and topics mentioned in or related to a given                                piece of text (​Uren et al., 2005​). Each identified entity is disambiguated, i.e., unambiguously defined,                              by establishing a link to an appropriate entry (concept or instance) in a knowledge base that uniquely                                  1http://blog.digitalinsights.in/social­media­users­2014­stats­numbers/05205287.html    1  http://ls3.rnet.ryerson.ca/ http://www.fon.bg.ac.rs/ http://www.ed.ac.uk/schools-departments/informatics http://blog.digitalinsights.in/social-media-users-2014-stats-numbers/05205287.html identifies the entity and provides further information about it. This task, also known as ​entity linking                                (​Hachey, Radford, Nothman, Honnibal, & Curran, 2013​)​, typically relies on large, general­purpose,                        Web­based knowledge bases, such as Wikipedia and other more structured knowledge bases such as                            DBpedia (​http://dbpedia.org​), YAGO (​http://www.mpi­inf.mpg.de/yago­naga/yago/​), and Wikidata            (​http://wikidata.org​).  Tools and services for automated semantic annotation of text are offered by a constantly increasing                              number of companies and research groups (​Jovanovic et al., 2014​). Major Internet players are also very                                active in this area. For instance, to fulfill its well known mission of “organizing the world’s                                information”, Google is continuously evolving its proprietary knowledge base – the Knowledge Graph                          – and according to one Google executive, “every piece of information that we [Google] crawl, index, or                                  search is analyzed in the context of Knowledge Graph” . In addition, Google has been working on a                                 2 probabilistic knowledge base, named Knowledge Vault, that combines automated extraction of facts                        from the Web and prior knowledge derived from existing knowledge bases (​Dong ​et al., 2014).                              Similarly, Microsoft is developing its own knowledge repository called Satori and using it to                            semantically index content and thus improve both its search engine Bing and the applications running                              on Windows . 3 In (​Jovanovic et al., 2014​), we have provided a comprehensive descriptive comparison of the                            state­of­the­art semantic annotation tools by considering numerous features, especially those that could                        be relevant for selecting the right tool(s) to use in a specific application case. One common                                characteristic of all the reviewed tools is that they need to be optimally configured in order to give their                                      best results when working with different kinds of texts – such as texts of diverse level of formality,                                    length, domain­specificity, and use of jargon. While the examined annotators provide default                        configuration of their parameters suitable for some annotation tasks, to our knowledge, no single                            annotator can reach its best performance on all kinds of text with one single configuration.                              Furthermore, the quality of an annotator’s output is not a category that could be assessed in absolute                                  terms; instead, it depends on the application case, i.e., on the specificities of the requirements that stem                                  from a particular context of use (​Maynard, 2008​). For instance, in some cases a very detailed                                annotation would be required and highly valued, whereas in other cases a terse annotation of only the                                  most relevant entities would be considered the best output. This indicates that in order to get the best                                    from a semantic annotation tool, one should configure it according to the specificities of the intended                                context of use, including both the characteristics of the text to be annotated and the requirements of the                                    annotation task (e.g., precision/recall trade­off).  Configuration of semantic annotators is not an easy task, for at least two reasons. First, since an                                  annotator’s configuration parameters are closely tied to the tool’s internal functioning, it is difficult to                              expose them in a manner that would enable users to effectively and efficiently use the tool without                                  having to know the details of the tool’s inner logic. In other words, the first challenge is in enabling                                      users to tune the annotator with respect to the key issues such as specificity and comprehensiveness of                                  annotations, without them being concerned with the details of the tool’s parameters. The second                            challenge stems from the fact that those configuration parameters are not mutually independent but                            2 ​http://goo.gl/mZ7a9H   3 ​http://goo.gl/iDwP2x   2  http://dbpedia.org/ http://www.mpi-inf.mpg.de/yago-naga/yago/ http://wikidata.org/ http://goo.gl/mZ7a9H http://goo.gl/iDwP2x interact with one another, so that one has to find an optimal combination of parameter values for a                                    specific application case. Moreover, annotators may have many parameters, and some of those                          parameters are continuous variables, thus making the tuning task very time consuming. As the                            state­of­the­art annotators do not provide support for finding an optimal parameter combination for a                            specific annotation task, it is often done manually, through a trial­and­error process. For example,                            consider the commercial semantic annotator ​TextRazor ​whose best practices state the following:    “Experiment with different confidence score thresholds...If you prefer to avoid false­positives                      in your application you may want to ignore results below a certain threshold. The best way to                                  find an appropriate threshold is to run a sample set of your documents through the system and                                  then manually inspect the results.”   4 To our knowledge, no solution to the above stated problem of parameter configuration has been                              reported in the literature. Therefore, in this paper, we make the following contributions:  ● Parameter Tuning Architecture (PTA) to automate the task of parameter value selection for a                            user­supplied testing set; thus resulting in performance that is better or at least equal to the                                tool’s performance with its default parameter values.  ● Five variations of the fitness function that emphasize different aspects of annotation quality                          (namely most annotations produced, most known correct annotations, least unknown                    annotations, best recall/precision), and a means to identify which variation performed the best                          for a particular testing set.   ● A method to efficiently search the solution space of possible parameter values using a Genetic                              algorithm.  The proposed Parameter Tuning Architecture (PTA) is applicable to a variety of automated semantic                            annotators, since its core component of the fitness function is not concerned with any textual or                                annotator­specific features but rather metrics based on known correct, known incorrect, or unknown                          annotations produced. To search the space of possible solutions, i.e., possible configurations of                          parameter values, we rely on a Genetic algorithm (for the reasons given in Section IV), although PTA                                  can also be applied with other methods for searching a large solution space (e.g., evolutionary                              algorithms or probabilistic methods). Our experiments with PTA have demonstrated that PTA can be                            used as an effective configurator for automated semantic annotators.    After more precisely defining and illustrating the problem of parameter tuning in the context of                              semantic annotation tools (Sect. II), and the associated challenges (Sect. III), in Section IV, we present                                PTA in detail. Section V reports on the experiments that we performed in order to evaluate the PTA’s                                    ability to find a set of parameter values that provides an adequate level of the annotator’s output while                                    minimizing annotation errors. The experimental results and the overall proposal are further critically                          discussed and summarized in Section VI, while Section VII positions the contributions of our work                              with respect to related research work. Lastly, we acknowledge the limitations of our solution and                              propose future experiments before we conclude our paper (Sections VIII / IX).       4 ​https://www.textrazor.com/docs/rest#optimization   3  https://www.textrazor.com/docs/rest#optimization II. Problem Definition     As indicated in the Introduction, today’s automated semantic annotators offer a variety of tunable                            parameters in order to produce results accordant with the desired level of granularity, precision, and                              recall. There is no single “best” configuration as this is a function of various factors: is the text we are                                        annotating restricted to a specific topic or domain such as history, food, or politics? Is the input text                                    descriptive with verbose and meaningful wording or is it terse with numerous empty (stop) words?                              What is the length, style, and structure of the input text: paragraph, single sentence, or tweet?                                Therefore, we must decide on a ​gold-standard​, a manually labelled training or testing set, that contains                                such factor­specific target questions that the annotator will be exposed to. Observe, as in the TextRazor                                introduction example, that an annotator would already be trained with default parameter values; thus,                            PTA uses the gold standard as an evaluation/testing set to tailor the annotator’s parameters to the kind                                  of input text represented by the gold standard.      Further difficulties arise when a parameter configuration comprises many individual or continuous                        floating point parameters resulting in an exponential number of possible combinations that make a                            complete search of this space unrealistic. To illustrate this, consider Table 1 showing how the ​TagME                                semantic annotator (​Ferragina & Scaiella, 2012​) choses to disambiguate the same spot (“​field​”), when                            different values of the tool’s two tunable parameters (​epsilon ​and ​long_text​) ​are used. As the table                                indicates, even slight changes to the values of epsilon and long_text can have significant impact on the                                  disambiguation process and lead to absurd results.     Table 1​. Disambiguation of the spot “​field​” in the sentence “​Gesture or salute? A soccer star who made the sign on the                                            field says otherwise.​” for different values of TagMe’s tunable parameters ​epsilon​ and ​long_text​.  Configuration  Disambiguation of the spot “field” by TagMe  epsilon = 0.39  long_text = 0  Wikipedia Reference​: Field (agriculture)  Abstract​: In agriculture, the word field refers generally to an area of land enclosed or                              otherwise and used for agricultural purposes.  epsilon = 0.30  long_text = 1     Wikipedia Reference​: Field (mathematics)  Abstract​: In abstract algebra, a field is a ring whose nonzero elements form a                            commutative group under multiplication.  epsilon = 0.52  long_text = 4  Wikipedia Reference​: Academic discipline  Abstract​: An academic discipline, or field of study, is a branch of knowledge that is                              taught and researched at the college or university level.  epsilon = 0.55  long_text = 6  Wikipedia Reference​: Field (physics)  Abstract​: In physics, a field is a physical quantity associated with each point of                            spacetime.     It is this difficulty of choosing appropriate parameter values that provides the motivation for the work                                presented in this paper. To our knowledge, the problem of selecting appropriate values for                            4  configuration parameters of semantic text annotators has not yet been reported in the literature. In the                                following sections, we first further explain the challenges associated with tuning parameters of                          semantic annotators, and then proceed to introduce our Parameter Tuning Architecture (PTA) and a                            method to find suitable parameter values in reasonable time.    III. Challenges in Tuning Parameters of Semantic Annotators     Our proposed method involves a fitness function to evaluate the output of an annotator given an                                arbitrary configuration. PTA’s fitness function scores the annotator’s output against a testing set                          containing oracle­identified correct (C) and/or incorrect (E) annotations. Table 2 gives the four possible                            combinations of labelled annotations within the testing set. NOT(¬) indicates that any correct or                            incorrect annotations are absent from the testing set.     Table 2​. The four kinds of testing sets based on the presence/absence of correct (C) and incorrect (E) labels.  Combination (1,2,3,4)  Correct Annotations  (positive labels)  Incorrect Annotations    (negative labels)  Testing Model  #1  C  E  two­class supervised  #2  C  ¬E  one­class supervised  positive labels only  #3  ¬C  E  one­class supervised  negative labels only  #4  ¬C  ¬E  unsupervised    Combination #4 assumes a testing set with no labels, and thus is an unsupervised model that PTA                                  cannot take advantage of; hence, it will not be further considered. In contrast, combinations #1, #2, and                                  #3 are supervised models that variants of the PTA fitness function can utilize. However, supervised                              configuration of semantic annotators have caveats when relying on testing sets. Namely:    caveat (i)​: A testing set that depends on negative labels (#1, #3) cannot feasibly cover all                                possible mistakes that an annotator could output for an input text.     caveat (ii)​: Due to (i), publicly available testing sets with labelled errors are difficult to find,                                particularly one­class (error­only) testing sets (#3).    caveat (iii)​: Testing sets with correct labels (#1, #2) often specify only a few correct                              annotations that the annotator should identify from the input text. However, the output could                            contain many more correct annotations not mentioned in the testing set.    5  caveat (iv)​: An oracle may specify correct but not necessarily ideal annotations. The annotation                            tool may find an annotation that is better than or of equal semantic quality to that of the                                    oracle­recommended annotation.     caveat (v)​: Regardless of the form of the testing set used (#1, #2, #3), an annotator will often                                    produce annotations that are not explicitly identified in the testing set, thus forcing any method                              of evaluation to make assumptions for the unidentified annotations (assume correct, assume                        incorrect, or ignore). Consequently, the quality of the annotator’s output in relation to the                            testing set provided is questionable.    To illustrate the above given statements, we obtained the ​wiki-annot30 dataset of A​3 labs from the                                University of Pisa Computer Science Department available under a Creative Commons License . This                         5 set contains 186,000 short text fragments from Wikipedia with correct­only identified annotations                        (combination #2) and was constructed using the procedure described in ​Ferragina and Scaiella ​(2012).                            Table 3 summarizes the output on one of these text fragments produced by the TagME semantic                                annotator with default configuration. The output is given alongside the A​3 dataset gold standard. The                              table gives the Wikipedia pages that are linked to the identified spot. The TagME column also shows                                  whether the linked Wikipedia page for the spot is true correct (C) or a true error (E).    Table 3​. Wikipedia entities produced by TagME and compared with the A​3 gold standard for the text fragment “​It is                                        home to the American University Eagles basketball and volleyball teams. The arena, named for Washington DC                                philanthropists, Howard and Sondra Bender​”. TagME links are identified as (C)orrect or (E)rror.   Spot  A​3​ dataset  TagME  volleyball   wikipedia:Volleyball  wikipedia:Volleyball (C)  washington dc  wikipedia:Washington,D.C.  wikipedia:Washington,D.C. (C)  basketball   wikipedia:Basketball  wikipedia:Basketball (C)  the american university (eagles)  wikipedia:American  University  wikipedia:American Eagles (C)  arena   wikipedia:Arena  wikipedia:Arena (C)  home  Not available  wikipedia:Home(sports) (C)  teams  Not available  wikipedia:NHL (E)  howard  Not available  wikipedia:Howard University (E)  bender  Not available  wikipedia:Gary Bender (E)    From the output we can see that TagME correctly linked more spots than what was listed in the A​3                                      dataset (caveat ​iii​). Specifically, the ‘home’ spot is a correct entity mention absent from A​3​. Further,                                absent from the A​3 dataset is the correct spot for ‘team’ that was linked incorrectly by TagME along                                    with incorrect links for ‘howard’, and ‘bender’. Consequently, to assess the accuracy of the output, one                                must first consider how these unknown spots should be treated (caveat ​v​). Lastly, consider the spot ‘the                                  5 http://acube.di.unipi.it/tagme­dataset  6  american university (eagles)’. A​3 recommends the Wikipedia link of ‘American University’ while                        TagME prefers the link ‘American Eagles’. Both suggestions are correct since the official sports team                              of American University is in fact the American Eagles. However, in the context of the text fragment, it                                    is clear that the link to American Eagles is a better (i.e., more precise) choice than the gold standard                                      (caveat ​iv​). A similar situation occurs with the ‘arena’ spot in which TagME agrees with the                                wiki­annot30 dataset although the Wikipedia entity for ‘Bender Arena’ would clearly be superior.    In Section IV we detail our five variations of the Parameter Tuning Architecture (PTA) and how they                                  match to the testing set combinations given in Table 2. We also explain how each variation of PTA                                    addresses the caveats.    IV. The Parameter Tuning Architecture (PTA)    In this section, we derive our Parameter Tuning Architecture as a mathematical model. ​Let ​p​i be a                                tuneable parameter for a semantic annotator. Let be a vector of ​n tuneable parameters              vn                 for a semantic annotator configuration, and let V be the space of all possiblep p ..., ]vn = [ 1, 2, pn                               configurations. Let T be a testing or validation set of any of the supervised combinations of Table 2                                    (#1, #2, or #3). We wish to find a vector   that maximizes the following fitness function:υ                                 eq. 1IT(T) rg max f (T)   F = a νεV[ c ] 2 − f (T)[ e ] 2       here FIT(T) in [− , ] and f (T), (T) in [0, ].  w 1 1 c fe 1     FIT(T) is a real­valued ranking function in the interval [­1,1] consisting of a reward component                              (T)fc   and a penalty The reward is a real­valued scaled measure in the interval [0,1] representing the      (T).fe                             correct annotations discovered by an annotator. Conversely, is a measure of the incorrect              (T)fe               annotations within the same interval [0,1]. Consequently, FIT(T) is a trade­off between the correct                            annotations found by the annotator and the errors the annotator produces. We square and to place                         fc  fe     more importance on many correct and incorrect annotations over a few correct answers and infrequent                              mistakes. The exact formula for computation of and is specific to the variant of FIT(T) proposed              fc   fe                 and is given in sections IV.1 through IV.5.     However, equation 1 faces a key obstacle. Namely, the space of all possible solution vectors V is very                                    large making an exhaustive search of this space intractable. Furthermore, this space may be infinite if                                any of the tunable parameters are real­valued. Consequently, we need to decide on how to efficiently                                search a potentially enormous solution space. Our approach is to use a genetic algorithm (GA) for this                                  task. Genetic algorithms are a class of evolutionary algorithms inspired by nature (​Chiong & Beng,                              2007​). They are an easy to implement technique that works well in optimization problems where                              potential solutions to the given problem can be abstracted to the “chromosomes” model of GA. In our                                  case, the vector of configuration parameters ​v​ can be seen as the required chromosome.      7  Figure 1 outlines our PTA method with GA which begins with a set of randomly created configurations                                  (a). In the next step (b), we randomly select a subset of samples of size ​n from the validation set, then                                          annotate the n­samples (c) using each of the configurations initially created in the step (a). The                                configurations are evaluated against the samples by our fitness function (d). The highest ranking                            configurations generate offspring through crossover (g) and gene mutation (h). The initial n­samples                          from the validation set is updated by removing the oldest x­samples and replacing them with another                                random set of x­samples (i). We partially re­sample the validation set at each generation as a form of                                    bootstrap aggregation to minimize overfitting and improve generalization. The process repeats with the                          next generation of configurations until some stopping condition, such as convergence, is satisfied (f).     Figure 1​. The Genetic Algorithm applied in the Parameter Tuning Architecture (PTA)    Although GA does not guarantee an optimal solution, it often converges to near optimal answers in a                                  relatively short period of time (​Szczerbicka, Becker & Syrjakow​, 1998). It also has the capability to                                search multiple regions of the solution space simultaneously since each member of a population                            occupies a different area than its peers (​Grefenstette, 1992​). This is particularly useful when there are                                many global optimal solutions that could satisfy the problem. Finally, GA is an incremental method                              that allows for starting/stopping the search from its currently best known solutions without beginning                            from scratch. This trait is useful in reinforcement learning or re­tuning the annotator as more examples                                become available (​Moriarty, ​Schultz​ & ​Grefenstette​, 1999).     It is worth noting that our emphasis is on the fitness functions of PTA and less on the choice of                                        evolutionary algorithm for the traversal of the search space. Our future work will investigate the                              comparative effectiveness of other evolutionary algorithms such as swarm optimization. In order to                          provide the basis for a comparative analysis and provide GA implementation details, we have made the                                source code and data available for use under an open source license at:                          http://ls3.rnet.ryerson.ca/annotator/PTA​.    8  http://www.informatik.uni-trier.de/~ley/pers/hd/s/Syrjakow:Michael.html http://www.informatik.uni-trier.de/~ley/pers/hd/s/Schultz:Alan_C=.html http://www.informatik.uni-trier.de/~ley/pers/hd/g/Grefenstette:John_J=.html http://ls3.rnet.ryerson.ca/annotator/PTA http://ls3.rnet.ryerson.ca/annotator/PTA Algorithm 1 outlines the PTA process joining our fitness function (equation 1) with the GA of Figure 1.    Input: i) A one­class positive­labeled validation set from the gold standard. ii) A one­class                            positively­labeled testing set from the gold standard. iii) A semantic annotator.    Output:​ A recommended configuration for input annotator (iii).    Algorithm:    1. For each of these PTA variants: pessimistic (eq. 5), apathetic (eq. 6), delta (eq. 7), optimistic                                (eq. 8) and stochastic (eq. 9)  1.1. Find recommended configuration on the validation set using GA of Figure 1.  2. For each recommended configuration from 1.1:  2.1. Annotate the testing set using the recommended configuration.   2.2. Calculate F​1 ​area under the curve (F​1​­AUC) (figures 2,3,4,5).   3. Best recommended solution is the configuration with highest F​1​­AUC.  Algorithm 1: PTA algorithm for automated evolutionary fine­tuning of semantic annotators.     We now derive various forms of our PTA fitness function, for use within step 1 of Algorithm 1, We                                      focus on the one­class positive labels testing set (Table 2, combination #2) since this is the most                                  prevalent type among the available gold­standard datasets for the entity linking task.     To begin, we define A(X) as the output of a semantic annotator using text fragments from set X as                                      input. Let function C(X) return the set of correct annotations from the set of input text X. Similarly, let                                      E(X) return incorrect annotations from the input text set X. We use the generic term ‘annotation’                                broadly to refer to any kind of output produced by a semantic annotation system (an entity link, a                                    related topic, a keyword) as per the examples provided within the validation or testing set. We define                                  function as a numerical score for the correct annotations of set X, while does the same for  (X)fc                         (X)fe           the errors of X. With these definitions in place, we can identify subsets such as: (i) annotations that                                    match the gold standard A(T)∩C(T), (ii) gold standard annotations the annotator failed to identify                            C(T)\A(T), and (iii) additional annotations with unknown label A(T)\C(T). In the following                        subsections, we present five adaptations of PTA (pessimistic, apathetic, delta, optimistic, stochastic)                        that pair different combinations of these subsets to meet the challenge of constructing a suitable fitness                                function in the presence of uncertainty A(T)\C(T). In section V, we examine the effectiveness of these                                five PTA forms in tuning a variety of annotators.     IV.1 Pessimistic PTA    The basic form of our fitness function is:        eq. 2IT(T)  (C(A(T)))   (E(A(T)))F = fc 2 −  fe 2     9  This basic form is suitable for a two­class supervised testing set (Table 2 #1) because it assumes that                                    we are provided with known correct and incorrect annotations that can be exploited:       eq. 3IT(T)  f (C(A(T)))   (E(A(T)))   (A(T) (T))   (A(T) (T))F =   c 2 − fe 2 = fc ⋂C 2 − fe ⋂E 2     Specifically, we find the correct annotations by considering only those annotations that are            (A(T))C                 explicitly known to us as correct within the testing set . We do the same for the error                    (T) (T)A ⋂C                 component of equation 2 with . In this variant of PTA the annotations that are not part of          (T) (T)A ⋂E                           the testing set (unrecognized) are ignored (Section II caveat ​v​). Although this form operates for a                                two­class labeled testing set, it cannot be used for the commonly encountered one­class positive labels                              testing set (Table 2 #2) due to the absence of any defined errors. To compensate, we modify to                                  fe   assume that unrecognized annotations are most likely errors by default:       eq. 4IT (T)  (A(T) (T))   (A(T)∖ C(T))F P = fc ⋂C 2 − fe 2     We call equation 4 pessimistic because of the assumption that annotations not explicitly labeled as                              correct should be considered erroneous. The final step in the formulation is to define  and   as:f  c f  e      and     eq. 5(A(T) (T)) fc ⋂C =   | A(T) ⋂ C(T) | | A(T)  ⋂ C(T) |  max (A(T)∖C(T)) fe =   | A(T) ∖ C(T) | | A(T)  ∖ C(T) |  max     where the denominator |Y|​max of is the maximum observed count from all of the previously          ( )f |Y | |Y |max                     encountered sets of Y; it is used to normalize the numerator into a value between 0 and 1 inclusive.    A shortcoming of pessimistic PTA is the possibility that the testing set identifies a smaller number of                                  correct annotations than the annotator might generate (Section III caveat ​iii​), as demonstrated in Table                              3. Consequently, those unidentified correct annotations would be missed by thus undermining the                    fc       true quality of the output produced, and would be unfairly counted against the annotator in the error                                  calculation of   (equation 5) or ignored entirely (equation 3).fe     IV.2 Apathetic PTA    In this section, we introduce a version of PTA that allows for ignoring unknown annotations when only                                  a one­class positive testing set is available. Simply, we change the penalty component of PTA to                              fe     look at the number of missed known gold standard annotations rather than the total number of                                unrecognized annotations generated. Specifically:     eq. 6IT (T)  (A(T) (T))   (C(T)∖A(T))F A = fc ⋂C 2 − fe 2     where is defined as the normalized number of missed gold standard annotations and  fe                         |C(T)∖A(T)| |C(T)∖A(T)|max    fc   is the normalized number of matching answers as before . We call this variant ​apathetic                  |A(T) ⋂ C(T)||A(T) ⋂ C(T)|max             10  because its concern is only with the number of known correct and recognized versus correct but missed                                  annotations, regardless of the number of uncertain annotations.     IV.3 Delta PTA    With apathetic PTA, the reward component of matching annotations from the gold standard is paired                              with the penalty of missing annotations from the same. In this adaption, we compute as the absolute                           Δ         difference between the total number of annotations and the number of gold­standard annotations                          desired.      |A(T)| C(T)| |Δ = | − |     We want to be as close as possible to zero to minimize the number of unrecognized annotations.    Δ                                 Consequently, we define the reward component to encourage this result . Furthermore, we                     fc = 1 − ΔΔmax       want A(T) to have many matching C(T) answers thus we penalize missing gold standard annotations                              using the same  as in apathetic PTA.fe       eq. 7IT (T)   F D = 1[ − ΔΔmax] 2   −  [ |C(T)∖A(T)||C(T)∖A(T)|max] 2     IV.4 Optimistic PTA    The former pessimistic, apathetic, and delta versions of PTA reward known correct annotations                          (equations 3,4,6) while penalizing known errors (equation 3), missed answers (equations 6,7) and                          uncertainty (equation 4). With optimistic PTA we assume (optimistically) that unknown annotations are                          more often right than wrong. Consequently, we reward configurations that produce more annotations                          than those configurations that output less. To this end we redefine the reward component of the                                function  as the normalized ratio of the number of annotations produced :fc fc = | A(T) | | A(T) |max        eq. 8IT (T)   (E(A(T)))  F O   =  [ | A(T) || A(T) |max] 2 −  fe 2     Equation 8 is available for a two­class testing set and one­class negative label testing set (Table 2 #1,                                    #3). However, for the one­class positive labeled testing set, the term needs to be calculated without                    fe               assistance from negative testing samples. The term of equation 5 could be used in this circumstance            fe                       but works against the optimistic assumption that unknown annotations are most likely correct. In                            section IV.5, we compensate for this rigidness.    IV.5 Stochastic Optimistic PTA    Optimistic PTA in its current form (equation 8) cannot use a one­class positive label testing set without                                  ‘borrowing’ from its pessimistic counterpart (equation 5). While equation 4 assumes 100% of the  fe                           11  unknown (unlabeled) annotations are errors, stochastic optimistic PTA tries to estimate the expected                          number of errors within this unknown set.     Stochastic Optimistic PTA assumes if oracle­specific annotations are missing from the annotator’s                        output, then this is an indicator of poor performance of the annotation tool. However, it might also                                  happen that a tool produces more correct annotations than what the testing set mentions, with perhaps                                better semantic quality (Section III caveats iv,v and Table 3). On the other hand, one might assume that                                    although the gold standard might not list all the possible annotations (caveat iii), those annotations that                                are listed are likely to be the ​most important ​in the context of the given text fragment. In other words,                                        the annotations that form the gold standard are the results the annotator ​should ​produce. Consequently,                              Stochastic Optimistic PTA introduces an adjustment factor to equation 5 to soften the 100% error              φ)(                   assumption while still emphasizing the importance of the gold standard. First, precision (P) and recall                              (R) are computed:       and      eq. 9(T)P = | A(T) | |  A(T) ⋂ C(T)  | (T)R = | C(T) | |  A(T) ⋂ C(T)  |     Then, a ​likelihood of error​ is calculated using:          eq.10    The ratio in brackets [] is the familiar F​β​­score metric often used to balance precision and recall. When                                    β<1, unknown annotations are more likely considered errors, whereas β>1 puts the emphasis on                            matching answers to the gold standard. We use β=1 to equally weight precision and recall. To avoid a                                    division by zero error, we consider the case when precision is zero. In such a circumstance the                                  likelihood of error is 1. Next, the expected number of errors is determined for an estimate of the actual                                      number of errors of A(T).       eq. 11(T) (T)  A(T) ∖ C(T) |φ = Lφ × |     To illustrate, consider the example of Table 3. TagME produced nine annotations (|A(T)|) of which four                                matched the testing set ( ) out of a possible five known correct (|C(T)|). What remained         A(T) (T) || ⋂C                       was five annotations of unknown classification {​washington dc, home, teams, howard, bender​} (                        ). Substituting these values for equations 9, 10 and 11 gives a likelihood of error of A(T)∖C(T) ||                                 with an expected error count of . The true error count of the unknown set is(T) .429 Lφ = 0               (T) .15 φ = 2                     3 {​teams, howard, bender​}.    The last step is to compute   by normalizing   then substituting into equation 8.fe (T)φ      12     eq. 12IT (T)    F S   =  [ | A(T) || A(T) |max] 2 −[ φ(T)φ(T)max] 2        V. Experimentation    In this section, we evaluate the different forms of PTA against our one­class positive labeled gold                                standard under varying assumptions. Specifically, we examine how each form of PTA performs when                            the unrecognized annotations are believed to be mostly wrong (pessimistic), mostly right (optimistic),                          partially correct (stochastic/delta), or unimportant (apathetic).     We used a genetic algorithm (GA) to search the solution space as outlined in Figure 1 of Section IV.                                      The GA parameters were defined as follows: an initial population of 10 random configurations, with a                                maximum surviving population of 30 configurations per generation and a mutation rate of 0.05. We                              began with a testing set of 15 randomly selected text fragments from our gold standard set of                                  wiki-annot30 ​(Section III), with a sample replacement rate of 5 text fragments per generation​. The GA                                would terminate once the configurations within the surviving population stabilize. The top­ranked                        configurations for each PTA variant were then evaluated against a random set of 1000 text fragments                                from the gold standard and compared to the default out­of­the­box configuration of four popular                            semantic annotators: ​TagME, Wikipedia Miner, DBpedia Spotlight, and ​Yahoo Content Analysis​. We                        also considered numerous other semantic annotators that were ultimately not evaluated due to                          unavailability of a Web service or restrictive end­user license agreement. Lastly, we focused our tests                              around Wikipedia­centric annotation tools leaving other knowledge based annotation tools for future                        work. Table 4 lists all semantic annotators we considered for PTA evaluation.    Table 4​. List of tested and considered semantic annotators for PTA evaluation.  Annotator  Tested  URL and notes for considered but not tested annotators  TagME  YES  http://tagme.di.unipi.it/tagme_help.html#tagging  DBPedia Spotlight  YES  https://github.com/dbpedia­spotlight/dbpedia­spotlight/wiki/Web­service#Can didates  Wikipedia Miner  YES  http://wikipedia­miner.cms.waikato.ac.nz/services/?wikify  Yahoo Content Analysis  YES  https://developer.yahoo.com/contentanalysis/  AIDA  NO  Authors Web service is for demonstration purpose only. They explicitly  request not to use their service for research purposes.  http://www.mpi­inf.mpg.de/departments/databases­and­information­systems/r esearch/yago­naga/aida/webservice/  Denote  NO  Denote RESTful Web service is in beta and not available for use at the time  of writing this paper. http://inextweb.com/denote_demo  AlchemyAPI  NO  Restrictive Terms of Use. Alchemy can not be used for the purpose of  benchmarking and/or comparing with other annotators, especially for those  13  having a competing annotation service.  http://www.alchemyapi.com/api/register.html  Aylien  NO  Restrictive Terms of Service. Terms prohibits publication of any results  obtained using their annotator without permission. No response from Aylien  when contacted to obtain consent. http://aylien.com/text­api­tos  TextRazor  NO  Trial account allows for only 500 annotations per day which is insufficient for  training and testing, https://www.textrazor.com/plans  OntoText  NO  Formerly known as ​KIM - The Semantic Annotation Platform​. No accessible  Web service without first consultation with the commercial company.  http://www.ontotext.com/products/ontotext­semantic­platform/      V.1 PTA with TagME  The TagME annotation service offers fast execution and high accuracy particularly with short text                            fragments (​Chiong and Beng, 2007; ​Cornolti, Ferragina and Ciaramita, 2013​). Table 5 provides                          summary statistics and the recommended configuration for TagME using each of the five variants of                              PTA including TagME’s default configuration for side­by­side comparison. Summary statistics include                      the average of the following measures on the 1000 gold standard text fragments: number of annotations                                A(T), number of annotations matching to the gold standard A(T)∩C(T), number of unmatched gold                            standard annotations C(T)\A(T), and count of unrecognized annotations A(T)\C(T). Derived from these                        measures are precision, recall, and F​1​­score, also included in Table 5. These measures provide                            competing dimensions that impact each variant of PTA differently. For instance, TagME’s default                          configuration produces the highest number of annotations with an average of 9.56, but also produces                              the highest number of unrecognized annotations, 6.22 on average. Comparatively, the pessimistic PTA                          solution produces the least number of unrecognized annotations (0.228 on average), but does so by                              allowing only a few annotations (1.63 on average). Table 5 also includes an EMPTY% metric defined                                as the percentage of text fragments that returned no annotations and/or no gold standard annotations                              A(T)∩C(T)=∅. Pessimistic appears to do well in the precision dimension, but its scarce annotator                            output produces a large number of empty annotations at 33.8%.      Figure 2 provides graphs of the recall, precision, and F​1 scores on all individual 1000 text fragments,                                  sorted from lowest to highest score, instead of the average­only values shown in Table 5. Mean values                                  of recall, precision, and F​1 equate to the scaled [0,1] area­under­the­curve (AUC) for Figure 2.                              Precision and recall (equation 9) were calculated as the ratio of matched gold standard annotations to                                the total number of annotations produced (precision), and to the total number of gold standard                              annotations (recall).    The recall graph shows that TagME’s default configuration and apathetic PTA performed identically as                            the best configuration for matching the most number of gold standard annotations followed closely by                              stochastic, delta, and optimistic PTA. Pessimistic PTA performed the worst with respect to matching                            gold standard annotations due to its conservative nature of only providing 1.63 annotations on average.                              In regards to precision, the solutions offered by optimistic, delta, and stochastic PTA outperform the                              14  default and apathetic configurations. Finally, when both precision and recall are considered together                          through F​1 measure we see that stochastic, delta, and optimistic configurations perform the best with                              average F​1​­scores of 0.683, 0.677, and 0.672, respectively, compared with apathetic and default scores                            of 0.521 and 0.493, respectively.     Table 5​. Default and recommended values for TagMe’s tunable parameters (first 3 rows), followed by summary                                statistics (mean values) for comparison metrics computed on 1000 text fragments gold­standard using tunable                            parameters’ default values (1st column) and values recommended by different forms of PTA.      Default  Pessimistic  : ↑ A(T)∩C(T) fc   : ↓ ​A(T)\C(T) fe   Apathetic  : ↑A(T)∩C(T) fc   : ↓​C(T)\A(T) fe   Optimistic  : ↑ A(T) fc   : ↓​A(T)\C(T) fe   Stochastic  : ↑ A(T) fc   : ↓ E[​A(T)\C(T)] fe   Delta  : ↑ ±A(T)­C(T) fc   : ↓​C(T)\A(T) fe   *epsilon  0.30  0.494  0.282  0.156  0.427  0.357  *long_text  0  6  0  10  10  7  *rho  0.0  0.551  0.0221  0.2662  0.1613  0.2429  C(T)  4.031  4.031  4.031  4.031  4.031  4.031  A(T)  9.567  1.631  8.778  3.721  4.915  4.029  A(T)∩C(T)  3.347  1.403  3.346  2.757  3.128  2.861  C(T)\A(T)  0.684  2.628  0.685  1.274  0.903  1.17  A(T)\C(T)   6.22  0.228  5.432  0.964  1.787  1.168  PRECISION  0.375  0.607  0.407  0.728  0.659  0.710  RECALL  0.827  0.324  0.827  0.673  0.768  0.699  F­SCORE  0.493  0.401  0.521  0.672  0.683  0.677  EMPTY%  1.6%  33.8%  1.6%  6.9%  2.8%  5.6%      Figure 2​. Graphs of recall (top left), precision (bottom left), and F​1​­score (right) for each variant of PTA including                                      default configuration used by TagME on 1000 text fragments.    15  The results indicate that if the objective is to match as many as possible gold standard annotations,                                  without concern for unrecognized annotations, TagME’s default or PTA’s apathetic solution would be                          the best option, as they most closely achieve C(T) ⊆ A(T). However, if the goal is to avoid                                    uncertainty, i.e., A(T) ⊆ C(T), then either optimistic, delta, or stochastic PTA would be the best                                candidates. Finally, for the behaviour that closely resembles the gold standard output, i.e., C(T)=A(T),                            stochastic PTA would be a good choice.     V.2 PTA with WikipediaMiner  WikipediaMiner is a toolkit that provides semantic services, including semantic annotation through a                          downloadable software library or Web service. The default configuration along with the solutions                          recommended by the five variants of PTA are given in Table 6. Pessimistic PTA discovered a solution                                  very similar to WikipediaMiner’s default configuration. Apathetic PTA produced the highest number of                          annotations per text fragment A(T) with an average of 14.23 annotations, but at the expense of the                                  highest number of unknowns A(T)\C(T) (11.2 on average). This resulted in a high recall of 0.75, but                                  low precision of 0.35.     Table 6​. Default and recommended values for WikipediaMiner’s tunable parameters (first 3 rows), followed by                              summary statistics (mean values) for comparison metrics computed on 1000 text fragments gold­standard using tunable                              parameters’ default values (1st column) and values recommended by different forms of PTA.      Default  Pessimistic  : ↑ A(T)∩C(T) fc   : ↓ ​A(T)\C(T) fe   Apathetic  : ↑A(T)∩C(T) fc   : ↓​C(T)\A(T) fe   Optimistic  : ↑ A(T) fc   : ↓​A(T)\C(T) fe   Stochastic  : ↑ A(T) fc   : ↓ E[​A(T)\C(T)] fe   Delta  : ↑ ±A(T)­C(T) fc   : ↓​C(T)\A(T) fe   *minProbability  0.50  0.53  0.023  0.33  0.23  0.46  *disambiguation  Policy  strict  strict  loose  strict  loose  strict  *weight  0.0  0.205  0.0427  0.427  0.663  0.199  C(T)  4.031  4.031  4.031  4.031  4.031  4.031  A(T)  3.862  3.862  14.229  4.435  6.027  4.185  A(T)∩C(T)  2.099  2.099  3.029  2.303  2.718  2.217  C(T)\A(T)  1.932  1.932  1.002  1.728  1.313  1.814  A(T)\C(T)   1.763  1.763  11.2  2.132  3.309  1.968  PRECISION  0.525  0.525  0.245  0.519  0.468  0.519  RECALL  0.511  0.511  0.755  0.566  0.674  0.543  F­SCORE  0.485  0.485  0.350  0.508  0.525  0.498  EMPTY%  15.3%  16.8%  2.9%  11.6%  5.1%  13.4%     The recall graph of Figure 3 demonstrates that apathetic and stochastic PTA perform the best in this                                  metric, followed by optimistic, delta, default, and pessimistic performing relatively the same. The                          results were opposite for the precision, with pessimistic, default, delta, and optimistic performing well                            followed closely by stochastic then lastly apathetic due to its high unknown count. Stochastic PTA                              gave the best overall solution with comparable precision to the default configuration, but with                            significantly better recall. In addition, stochastic PTA generated the second­least number of empty                          annotations with 5.1% compared to the second­worst score of 15.3% from the default configuration.    16        Figure 3​. Graphs of recall (top left), precision (bottom left), and F​1​­score (right) for each variant of PTA including                                      default configuration used by WikipediaMiner on 1000 text fragments.    V.3 PTA with DBpedia Spotlight  We tested PTA on the DBPedia Spotlight ​candidates ​Web service. The candidates service is an                              annotation service that returns a ranked list of annotations per mention. This ranking includes output                              statistics entitled ​contextual score, percentage of second rank, and ​final score ​with the intent that the                                application using the annotator service will prune the ranked list based on the chosen thresholds for                                these statistics. Consequently, the goal of PTA is to discover what threshold values to use. Therefore, it                                  was not surprising that the default configuration gave the best results for recall (0.52, Table 7) but poor                                    precision (0.19), and the highest number of annotations per text fragment (14.8). Pessimistic PTA had                              the most difficulty with the lowest F​1​­score (0.02) caused by a high number of unrecognized                              annotations and low number of matching gold standard answers. Stochastic PTA would have provided                            the best precision if not for the many empty solutions it gave (Figure 4 and Table 7). Overall, optimistic                                      PTA gave the second best recall with the best precision and thus is the recommended solution with the                                    top F​1​­score of 0.38.     V.4 PTA with Yahoo Content Analysis (YCA)  The Yahoo Content Analysis API provides named entity linking to Wikipedia entities through its Web                              service. PTA mean statistics show apathetic PTA leads in seven of the eight metrics (Table 8).                                However, the difference is statistically inconsequential and did not translate to a significant                          improvement over the default configuration. The AUC graphs of Figure 5 reveal stochastic, apathetic,                            17  and delta forms of PTA perform equally well as YCA’s default configuration using different local                              maxima solutions. Optimistic PTA was a close second while pessimistic struggled.     Table 7​. Default and recommended values for Spotlight’s tunable parameters (first 5 rows), followed by summary                                statistics (mean values) for comparison metrics computed on 1000 text fragments gold­standard using tunable                            parameters’ default values (1st column) and values recommended by different forms of PTA.      Default  Pessimistic  : ↑ A(T)∩C(T) fc   : ↓ ​A(T)\C(T) fe   Apathetic  : ↑A(T)∩C(T) fc   : ↓​C(T)\A(T) fe   Optimistic  : ↑ A(T) fc   : ↓​A(T)\C(T) fe   Stochastic  : ↑ A(T) fc   : ↓ E[​A(T)\C(T)] fe   Delta  : ↑ ±A(T)­C(T) fc   : ↓​C(T)\A(T) fe   *confidence  0.20  0.14  0.029  0.29  0.62  0.020  *support  20  48  5  1  1  355  *contextualScore  0.0  0.132  0.0508  0.0  0.0  0.115  *percentageOf  SecondRank  0.0  0.0290  0.596  0.0  0.0  0.459  *finalScore  0.0  0.491  0.128  0.0  0.228  0.004  C(T)  4.031  4.031  4.031  4.031  4.031  4.031  A(T)  14.89  0.072  4.227  5.042  1.496  3.153  A(T)∩C(T)  2.085  0.055  1.477  1.8  0.878  1.022  C(T)\A(T)  1.946  3.976  2.554  2.231  3.153  3.009  A(T)\C(T)   12.8  0.017  2.75  3.242  0.618  2.131  PRECISION  0.190  0.0373  0.403  0.406  0.403  0.334  RECALL  0.522  0.0157  0.372  0.451  0.218  0.263  F­SCORE  0.251  0.0211  0.345  0.384  0.259  0.262  EMPTY%  12.4%  96%  22.6%  17.6%  48.5%  36%            Figure 4​. Graphs of recall (top left), precision (bottom left), and F​1​­score (right) for each variant of PTA including                                      default configuration used by DBpedia Spotlight on 1000 text fragments.  18  Table 8: Default and recommended values for YCA’s tunable parameters (first 2 rows), followed by summary statistics                                  (mean values) for comparison metrics computed on 1000 text fragments gold­standard using tunable parameters’                            default values (1st column) and values recommended by different forms of PTA.      Default  Pessimistic  : ↑ A(T)∩C(T) fc   : ↓ ​A(T)\C(T) fe   Apathetic  : ↑A(T)∩C(T) fc   : ↓​C(T)\A(T) fe   Optimistic  : ↑ A(T) fc   : ↓​A(T)\C(T) fe   Stochastic  : ↑ A(T) fc   : ↓ E[​A(T)\C(T)] fe   Delta  : ↑ ±A(T)­C(T) fc   : ↓​C(T)\A(T) fe   *max  100  9  18  6  99  98  *score  0.0  0.917  0.455  0.594  0.192  0.007  C(T)  4.031  4.031  4.031  4.031  4.031  4.031  A(T)  1.439  0.243  1.457  1.233  1.444  1.443  A(T)∩C(T)  1.071  0.199  1.082  0.938  1.075  1.070  C(T)\A(T)  2.960  3.832  2.949  3.093  2.956  2.961  A(T)\C(T)   0.368  0.044  0.375  0.295  0.369  0.373  PRECISION  0.520  0.160  0.529  0.511  0.525  0.521  RECALL  0.268  0.0491  0.271  0.238  0.268  0.266  F­SCORE  0.331  0.0719  0.335  0.306  0.332  0.329  EMPTY%  39.9%  83.6%  38.7%  42.4%  39.2%  39.6%          Figure 5: Graphs of recall (top left), precision (bottom left), and F​1​­score (right) for each variant of PTA including                                      default configuration used by YCA on 1000 text fragments.    VI. Result Summary     Table 9 provides a summary of the experimental results. For each examined annotator, the table                              presents the forms of PTA that evaluated as the best(+) and the worst(­) performing for mean measures                                  of: most annotations, most annotations matching the gold standard, least unrecognized annotations,                        least empty answers and best precision/recall. Also included is the overall (recommended) PTA variant                            based on AUC graphs. Noteworthy is that the best PTA variant per individual dimensions does not                                19  necessarily indicate best performing overall solution. As an example, consider WikipediaMiner.                      Apathetic PTA scored highest for most annotations, most matched, least empty, and best recall;                            however, the recommended solution is stochastic which was not the top (nor bottom) of any single                                measure. The table also shows that no single variant of PTA is best as this is a function of annotator                                        behavior, testing set, and user assumptions (unknown annotations are mostly right, mostly wrong, or                            partially correct). Nonetheless, it would appear pessimistic PTA is excessively conservative and is                          often surpassed by its probabilistic counterpart: stochastic. Consequently, the pessimistic PTA variant                        should be avoided. From Table 9, the following summary observations can  be made:    ● Apathetic PTA emerged as a good choice for most annotations, most matched and best recall.   ● Optimistic PTA performed well when precision is the primary concern.   ● Stochastic PTA was the prevailing general­purpose strategy with an overall best solution for                          three of the four annotators tested.       Table 9: Summary table indicating the best(+) and the worst(­) solutions for individual measures by averages plus                                  overall recommended solution based on AUC graphs​.     most  Annotations  A(T)  most  matched   A(T)∩C(T)  least   unknown  A(T)\C(T)  least empty  answers  EMPTY %  best   recall  best  precision  Overall  best AUC­F​1  TagME  +default ­pessimistic  +default  +apathetic  ­pessimistic  +pessimistic  ­default   +default  +apathetic  ­pessimistic  +default  +apathetic  ­pessimistic  +optimistic  ­default  +stochastic  ­pessimistic  WikipediaMiner  +apathetic  ­default  ­pessimistic  +apathetic  ­default  ­pessimistic  +default  +pessimistic  ­apathetic  +apathetic  ­pessimistic  +apathetic  ­default  ­pessimistic  +default  +pessimistic  ­apathetic  +stochastic  ­apathetic  DBpedia  Spotlight  +default  ­pessimistic  +default  ­pessimistic  +pessimistic  ­default  +default  ­pessimistic  +default  ­pessimistic  +optimistic  ­pessimistic  +optimistic  ­pessimistic  Yahoo (YCA)  +apathetic ­pessimistic  +apathetic  ­pessimistic  +pessimistic  ­apathetic  +apathetic  ­pessimistic  +apathetic  ­pessimistic  +apathetic  ­pessimistic  +default  +apathetic  +stochastic  +delta  ­pessimistic    Our experimental results demonstrate that PTA is effective. In all annotators tested, PTA either: 1)                              suggested an improved configuration, or 2) validated the default configuration as a local maximum                            solution. PTA successfully found alternative configurations that better fitted the testing set than the                            default parameters of TagME, WikipediaMiner, and DBpedia Spotlight.     VII. Related Literature    Semantic annotation tools offer the possibility of deeper analysis of textual content by disambiguating                            terms and phrases present in the text and linking them to appropriate concepts from a knowledge base,                                  hence enabling more efficient and accurate classification, organization, search and retrieval of textual                          content. The research community has already developed a critical mass of both research prototypes and                              usable software products in this area. Automated semantic annotation tools examined in this paper:                            TagMe (​Ferragina & Scaiella, 2012​), Denote (​Cuzzola et al., 2013​), DBPedia Spotlight (​Mendes,                          20  Jakob, García­Silva & Bizer, 2011​), Wikipedia Miner (​Milne & Witten, 2013​), among others, provide                            means to automatically semantically process textual content and identify relevant semantic concepts.     Recently, these tools have been increasingly referred to as entity linking tools since they link entity                                mentions in the text with the corresponding entry/entries in a knowledge base. Shen, ​Wang & Han                                (2015) have provided a very comprehensive and detailed survey of the state­of­the­art automated entity                            linking (i.e., semantic annotation) tools. They have identified three modules that such tools consist of,                              namely candidate entity generation, candidate entity ranking and unlinkable menton prediction                      modules, and for each module presented and analyzed different methods and techniques that were                            proposed in the literature and applied in existing annotation tools. They also reported that the examined                                tools “differ along multiple dimensions and are evaluated over different data sets”.    Large majority of today’s semantic annotators rely on a general purpose knowledge base (KB) such as                                Wikipedia or more structured and semantically rich KBs like DBpedia, YAGO, and Wikidata.                          However, there are also domain­specific semantic annotators, many of them in the biomedical domain,                            e.g., MetaMap (Aronson & Lang, 2010) and NCBO annotator (Whetzel et al., 2013), that rely on                                domain­specific KBs such as Unified Medical Language System (UMLS) (Bodenreider, 2004),                      DrugBank (Law et al., 2014) or medical ontologies available at NCBO BioPortal (Whetzel et al., 2011).                                In order to assure good domain coverage, both in terms of breadth and depth of the entities covered,                                    some annotation tools rely on more than one KB. For instance, the semantic annotation method                              proposed by Berlanga, ​Nebot & Pérez (2015) relies on the use of several KBs of arbitrary size and                                    domain specificity, and is also independent of any specific characteristic of a KB (e.g., disambiguation                              pages, internal links and other Wikipedia specific features). Some semantic annotators, such as TagMe                            and the one developed by WalmartLabs (Gattani et al., 2013), are specifically designed and developed                              for semantic annotation of social media content that is typically short, fragmented, poorly spelled and                              grammatically incorrect​. ​Since such an annotator requires a global and almost real­time KB, Gattani et                              al. (2013) expanded Wikipedia with data from ​various structured sources (e.g., Adam (health),                          MusicBrainz (albums), City DB, and Yahoo Stocks), as well as new interesting events extracted from                              Twitter data stream.    The algorithms developed as a part of semantic annotation systems, e.g. spot identification and                            disambiguation, often rely on user­dependent fine­tuning of input parameters. These parameters allow                        the user to customize the algorithms for the specific topic or textual content type (​Jovanovic et al.,                                  2014​). While annotation tools do provide a suggested default value on these parameters, they are not by                                  any means optimal for all scenarios. To our knowledge, no existing related work or implemented                              software has attempted to address the problem of automated parameter fine­tuning for semantic                          annotation tools; therefore, our proposed PTA framework serves as a first foundational step in this                              direction.    Having said that, it is important to point out that while researchers have not addressed the problem of                                    parameter fine­tuning for semantic annotators, there have been a few fruitful work on the systematic                              evaluation of semantic annotation systems. For instance, Cornolti, ​Ferragina & Ciaramita ​(2013) have                          developed a framework consisting of metrics and standard datasets for measuring the efficiency and                            21  effectiveness of existing semantic annotation tools. The major contribution of their work is its novel                              approach to the systematic classification of different tasks of a semantic annotator and the development                              of suitable metrics for each of these specific tasks. The work by Steinmetz, ​Knuth & Sack (2013) is                                    also focused on the evaluation of semantic annotation tools; however, their attention is centered more                              on the statistical analysis of different benchmark and dictionary datasets that can be used in the                                evaluation process. Heuss, ​Humm, Henninger & Rippl (2014) compared the performance of several                          state­of­the­art semantic annotation tools on domain specific texts (namely texts about museum                        collections). The study found that, on average, each tool achieved roughly just a third of its F1 score on                                      texts covering general/common topics. The results also showed very high standard deviations for all                            performance measures (recall, precision and F1), indicating not only lower performance than in a                            common case, but wider distribution of the results. This is consistent with the findings of Shen, ​Wang,                                  & Han (2015) who reported that the tools examined in their survey tend to perform very differently for                                    different data sets and domains. It is worth pointing out that existing comparative studies of semantic                                annotation tools have all relied on the suggested default parameter settings for the compared systems;                              therefore, they do not necessarily reflect the best case performance of the annotator tool on the objects                                  of the experiment. It could very well be the case that if the optimal parameter values were chosen (as                                      opposed to the default values) that the obtained results could be significantly different.    It should be mentioned that besides automated semantic annotation tools, there are also semi­automated                            semantic annotators that allow for user’s intervention during the annotation process. This intervention                          often takes the form of choosing the best option from a list of candidate annotations, or removing some                                    of the proposed annotations that the user considers incorrect or irrelevant. While considerable research                            and development efforts were put into the design and development of semi­automated annotation tools                            (​Uren et al., 2005​), (Oliveira & Rocha, 2013), their reliance on human active participation impacts their                                efficiency, and thus they have been largely superseded by fully automated tools. Still, there are some                                usage scenarios, e.g., scholarly reading, where, due to their mixed­initiative annotation approach,                        semi­automated annotators are preferred. For instance, based on a study of scholarly annotation                          practices, Müller­Birn, Klüwer, Breitenfeld, Schlegel, & Benedix (2015) have designed and developed                        Neonion, a lightweight annotation tool for creating, sharing and reusing annotation data. Neonion users                            can accept, reject or modify annotations recommended by the tool; annotations take the form of                              references to appropriate Wikidata (Vrandečić & Krötzsch, 2014) entities. This feedback that users                          provide is leveraged for improving subsequent recommendations. Annotation of medical texts is                        another domain where human involvement is often needed to assure the accuracy of automatically                            produced annotations. For example, RapTAT is a semi­automated semantic annotation tool based on an                            interactive and iterative machine learning approach, and aimed at assisting end users with annotation of                              various kinds of medical texts (Gobbel et al., 2014). In each iteration, the tool annotates potentially                                relevant phrases within a document, presents the annotations to a reviewer for correction, and then uses                                the obtained feedback (i.e., corrected annotations) to re­train its machine learning model before                          annotating subsequent document.    Although work on finding optimal parameter values for semantic annotation tools is novel, it is                              important to point out that the use of evolutionary algorithms as Genetic Algorithms for optimal                              parameter estimation in control systems has been a commonplace (​Chang, 2006​). For instance, Seng et                              22  al (1999) used Genetic Algorithms to simultaneously tune the parameters of a self­tuning fuzzy logic                              control system. Similarly, Yao & ​Sethares (1994) used Genetic Algorithms for optimizing the structure                            and parameters of feedforward and recurrent neural networks and shown to be able to reduce estimation                                error in probability to zero. Genetic Algorithms have also been widely used for parameter tuning in                                Proportional­Integral­Derivative (PID) controllers (​Panda, 2011​), power system optimization (​Kothari,                  2012​) and HVAC systems (​Kusiak, Tang & Xu​, 2011). They were also used to deal with challenges of                                    image annotation (​Bahrami & Abadeh, 2014​). In the Semantic Web domain, Genetic Algorithm­based                          approaches can be observed in a variety of applications (​Chen, Wu & Cudré­Mauroux​, 2012) such as                                finding optimal ontology alignments between multiple ontologies (​Martinez­Gil, Alba &                    Aldana­Montes​, 2008); identification and alignment of datasets of the Linked Open Data cloud                          (Gunaratna, Lalithsena & Sheth, 2014); RDF query answering (​Oren, Guéret & Schlobach​, 2008); and                            semantic Web service discovery (Sangers, Frasincar, Hogenboom & Chepegin, 2013) and composition                        (​Fanjiang & Syu, 2014​), among others, and have shown to be effective for such optimization problems.     VIII. Limitations and Future Work    We have proposed a method, called Parameter Tuning Architecture (PTA), for finding suitable values                            for tunable parameters of semantic annotators (Section IV) and demonstrated this method on four                            semantic annotators (Section V). Note that the annotators themselves were not trained by PTA; instead,                              alternative parameter values were proposed that better suited the targeted evaluation/testing set                        (​wiki-annot30​). PTA could be extended for semantic annotator training, namely, as a tool to decide on                                the fallback parameter defaults when no alternative values are supplied. This would simply require the                              use of a training set rather than a testing set with PTA. However, since we were unable to acquire the                                        exact training set used for each of the four annotators in our experiments, we were not able to compare                                      how general purpose default values suggested by PTA would perform relative to the defaults currently                              chosen by the annotators’ designers. Even if the training sets were available, there may be ​internal                                tunable parameters that are only accessible to the designers, and not exposed through the annotator’s                              public interface. Moreover, since each of the tested annotators were trained with different gold                            standards on (most­likely) different versions of Wikipedia, subsequent work should include                      experimentation with other datasets such as Microsoft’s “​Entity Recognition and Disambiguation                      Challenge​”  and/or ​ClueWeb  for further validation of our PTA framework.  6 7   Future work may also include validation with other knowledge based annotation systems                        (non­Wikipedia) that rely on life sciences, biomedical, or other ontologies. Additionally, we propose an                            extension to our fitness function (equation 1) with the introduction of weights :δ , )( c δe     IT(T) rg max δ  where δ  F = a νεV c f (T)[ c ] 2 − δe f (T)[ e ] 2 c + δe = 1     This would allow for an emphasis toward either finding more correct annotations or less annotations                       fc       in error . The weights themselves could be part of the tunable parameter vector in which    fe         δ , )( c δe                  v     6 http://web­ngram.research.microsoft.com/ERD2014/  7 http://lemurproject.org/clueweb09/FACC1/  23  PTA would be tasked to not only find suitable parameter values for the annotator in question but also                                    suggesting parameter values for itself. Finally, our future work will investigate other evolutionary                          algorithms such as swarm optimization and compare its performance with the currently used genetic                            algorithm (Figure 1).    IX. Conclusion    In this paper we present Parameter Tuning Architecture (PTA), a general method to determine                            “best­fit” values of configuration parameters for semantic annotators. We explain the caveats of                          supervised testing specific to semantic annotators and devised PTA variants to tackle the uncertainty of                              unlabeled annotations. We tested these variants on four well­known semantic annotators and provided a                            method for selecting the best solution using a genetic algorithm and area­under­the­curve metric.                          Experimental results indicate that PTA is capable of suggesting configurable parameters that improve                          upon specific individual areas of most annotations, most matched gold standard answers, and least                            uncertainty. Finally, our tests demonstrate that PTA can consistently find a configuration that provides                            an overall best solution, i.e., solution with the best precision versus recall trade­off for many semantic                                annotators. We balance our findings by acknowledging the limitations of our work and propose five                              future research directions for further study (Section VIII).    Since the PTA fitness function, the core component of the proposed method, does not rely on any                                  annotator­specific feature, our PTA method is applicable to any semantic annotator, and can be used to                                enhance the annotator’s performance on any specific annotation task. Besides being directly beneficial                          for semantic annotation tools, the proposed method might also be indirectly useful to ​any intelligent                              system that relies on semantic­rich annotation of textual content, such as text search and retrieval                              systems, various content­based recommender systems, systems that rely on semantics of textual content                          to support personal or business decision making and the like.      References    A. R. Aronson and F.­M. Lang. (2010). ​An overview of MetaMap: historical perspective and recent advances​, 17(3),                                  pp.229­236, JAMIA. http://metamap.nlm.nih.gov/    S. Atdağ, and V. Labatut. (2013). ​A Comparison of Named Entity Recognition Tools Applied to Biographical Texts​,                                  (pp. 228­233). 2nd International Conference on Systems and Computer Science.    S. Bahrami and M.S. Abadeh, (2014). ​Automatic image annotation using an evolutionary algorithm (IAGA)​, pp.320 ­                                325, 7th International Symposium on Telecommunications (IST 2014).    R. Berlanga, V. Nebot, M. Pérez. (2015). ​Tailored semantic annotation for semantic search​, 30, pp. 69­81, Web                                  Semantics: Science, Services and Agents on the World Wide Web.    O. Bodenreider. (2004). ​The Unified Medical Language System (UMLS): integrating biomedical terminology​, 32                          (Database­Issue), pp. 267­270, Nucleic Acids Research. http://www.nlm.nih.gov/research/umls    24  W.D. Chang, (2006). ​An improved real-coded genetic algorithm for parameters estimation of nonlinear systems, 20(1),                              pp. 236­246. Mechanical Systems and Signal Processing.    H. Chen, Z. Wu & P. Cudré­Mauroux, (2012). ​Semantic Web Meets Computational Intelligence: State of the Art and                                    Perspectives,​ 7(2), pp. 67­74, IEEE Computational Intelligence Magazine.    R. Chiong, and O.K. Beng, (2007). ​A Comparison between Genetic Algorithms and Evolutionary Programming based                              on Cutting Stock Problem.​ vol. 14, pp. 72­77.  Engineering Letters.    M. Cornolti, P. Ferragina, M. Ciaramita, (2013), ​A Framework for Benchmarking Entity-Annotation Systems, pp.                            249­260. 22nd International World Wide Web Conference​.    J. Cuzzola, D. Gasevic, E. Bagheri, Z. Jeremic, J. Jovanovic, R. Bashash, (2013). ​Semantic Tagging with Linked Open                                    Data,​ vol. 1054, pp 52­53. 4th Canadian Semantic Web Symposium (CSWS 2013).    X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, et al. (2012), ​Knowledge vault: a web-scale approach to                                      probabilistic knowledge fusion, pp. 601­610, 20th ACM SIGKDD international conference on Knowledge discovery                          and data mining (KDD '14).    Y. Fanjiang and Y. Syu, (2014). ​Semantic-based automatic service composition with functional and non-functional                            requirements in design time: A genetic algorithm approach​, 56(3), pp. 352­373, Information and Software Technology.    P. Ferragina, U. Scaiella, (2012). ​Fast and Accurate Annotation of Short Texts with Wikipedia Pages. 29(1), pp. 70­75,                                    IEEE Software.    A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, et al. (2013). ​Entity extraction, linking, classification, and                                      tagging for social media: a wikipedia-based approach​,  pp. 1126­1137, Proc. VLDB Endow. 6, 11 (August 2013).    G. Gobbel, J. Garvin, R. Reeves, R.M. Cronin, J. Heavirland et al. (2014). ​Assisted annotation of medical free text                                      using RapTAT​, 21(5), pp.833­841, J Am Med Inform Assoc.    J. Grefenstette, (1992). ​Genetic Algorithms for Changing Environments, pp. 139­146, ​Parallel Problem Solving from                            Nature 2.    K. Gunaratna, S. Lalithsena, and A.P. Sheth, (2014). ​Alignment and Dataset Identification of Linked Data in Semantic                                  Web​, 4 (2), pp. 139­151, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.    B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J.R. Curran, (2013). ​Evaluating Entity Linking with Wikipedia​,                                  194, pp. 130­150, Journal of Artificial Intelligence.    T. Heuss, B. Humm, C. Henninger, and T. Rippl. (2014). ​A comparison of NER tools w.r.t. a domain-specific                                    vocabulary​, pp.100­107, In Proceedings of the 10th International Conference on Semantic Systems (SEM '14), H. Sack,                                A. Filipowska, J. Lehmann, and S. Hellmann (Eds.). ACM, New York, NY, USA.    E. Hovy, R. Navigli, S.P. Ponzetto, (2013). Collaboratively built semi-structured content and Artificial Intelligence:                            The story so far​, 194, pp. 2­27.  Journal of Artificial Intelligence.    25  J. Jovanovic, E. Bagheri, J. Cuzzola, D. Gasevic, Z. Jeremic, R. Bashash, (2014). ​Automated Semantic Annotation of                                  Textual Content​,  16(6), pp. 38­46. IEEE IT Professional.    D.P. Kothari, (2012). ​Power system optimization, pp. 18­21, 2nd National Conference on Computational Intelligence                            and Signal Processing (CISP).    A. Kusiak, F. Tang, & G. Xu, (2011). ​Multi-objective optimization of HVAC system with an evolutionary computation                                  algorithm,​ ​36(​5), pp. 2440­2449, Energy.    V. Law, C. Knox, Y. Djoumbou, T. Jewison, A.C. Guo, et al. (2014). ​DrugBank 4.0: shedding new light on drug                                        metabolism​, 42(1), pp.D1091­7, Nucleic Acids Res.    B. Liu, (2012). ​Sentiment Analysis and Opinion Mining (Synthesis Lectures on Human Language Technologies)​.                            Morgan & Claypool Publishers.    J. Martinez­Gil, E. Alba, & J.F. Aldana­Montes, (2008). ​Optimizing ontology alignments by using genetic algorithms,                              Workshop on Nature based Reasoning for the Semantic Web.    D. Maynard, (2008). ​Benchmarking Textual Annotation Tools for the Semantic Web. 6th International Conference on                              Language Resources and Evaluation.    P.N. Mendes, M. Jakob, A. García­Silva, and C. Bizer, (2011). ​DBpedia spotlight: shedding light on the web of                                    documents.,​ pp. 1­8, 7th International Conference on Semantic Systems. ACM.    D. Milne, & I.H. Witten, (2013). ​An open-source toolkit for mining Wikipedia. ​vol. 194, pp. 222­239, Journal of                                    Artificial Intelligence​.    D. Moriarty, A. Schultz​, J. Grefenstette​, (1999), ​Evolutionary Algorithms for Reinforcement Learning, 11, pp. 241­276,                              Journal Artificial Intelligence Research (JAIR).    C. Müller­Birn, T. Klüwer, A. Breitenfeld, A. Schlegel, and L. Benedix. (2015). ​Neonion: Combining Human and                                Machine Intelligence​, pp. 223­226, Proceedings of the 18th ACM Conference Companion on Computer Supported                            Cooperative Work & Social Computing (CSCW'15 Companion).     P. Oliveira, and J. Rocha. (2013). ​Semantic annotation tools survey​, pp.301­307, IEEE Symposium on Computational                              Intelligence and Data Mining (CIDM).     E. Oren, C. Guéret, & S. Schlobach, (2008). ​Anytime query answering in RDF through evolutionary algorithms, pp.                                  98­113, 7th International Semantic Web Conference (ISWC 08).    S. Panda, (2011). ​“Multi-objective PID controller tuning for a FACTS-based damping stabilizer using Non-dominated                            Sorting Genetic Algorithm-II,​, 33(7), pp. 1296­1308, International Journal of Electrical Power & Energy Systems.    L. Ratinov and D. Roth, (2009). ​Design challenges and misconceptions in named entity recognition​, (pp. 147­155).                                Thirteenth Conference on Computational Natural Language Learning (CoNLL '09).    J. Sangers, F. Frasincar, F. Hogenboom, and V. Chepegin, (2013). ​Semantic Web service discovery using natural                                language processing techniques​, 40(11), pp. 4660­4671, Expert Systems with Applications.  26  http://www.informatik.uni-trier.de/~ley/pers/hd/s/Schultz:Alan_C=.html http://www.informatik.uni-trier.de/~ley/pers/hd/g/Grefenstette:John_J=.html   T.L. Seng, M. Bin Khalid, & R. Yusof, (1999). ​Tuning of a neuro-fuzzy controller by genetic algorithm, 29( 2), pp.                                        226­236, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Trans.,.    W. Shen, J. Wang, and J. Han. (2015). ​Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions, ​27(2),                                      pp.443­460, IEEE Transactions on Knowledge & Data Engineering.    N. Steinmetz, M. Knuth & H. Sack, (2013). ​Statistical Analyses of Named Entity Disambiguation Benchmarks, pp.                                21­25, 1st International Workshop on NLP and DBpedia.    H. Szczerbicka, M. Becker, M. Syrjakow​, (1998). ​Genetic Algorithms: a Tool for Modelling, Simulation, and                              Optimization of Complex Systems,​ 29(7), pp. 639­659, Cybernetics and Systems​.    V. Uren, P. Cimiano, J. Iria, S. Handschuh, M. Vargas­Vera, E. Motta, S. Ciravegna, (2005). ​Semantic annotation for                                    knowledge management: Requirements and a survey of the state of the art. ​ 4(1). pp. 14­28. Journal of Web Semantics.    D. Vrandečić, M. Krötzsch. (2014). ​Wikidata: A Free Collaborative Knowledge Base​. 57(10). pp. 78–85.                            Communications of the ACM.    J. Weston, A. Bordes, O. Yakhnenko, N. Usunier, (2013). ​Connecting Language and Knowledge Bases with                              Embedding Models for Relation Extraction, (pp.1366­1371), Conference on Empirical Methods in Natural Language                          Processing.    P.L. Whetzel, N.F. Noy, N.H. Shah, P.R. Alexander, C. Nyulas, et al. (2011). ​BioPortal: enhanced functionality via                                  new Web services from the National Center for Biomedical Ontology to access and use ontologies in software                                  applications​, 39, pp.W541­5. Nucleic Acids Res. ​http://bioportal.bioontology.org/    P.L. Whetzel and NCBO Team. (2013). ​NCBO Technology: powering semantically aware applications​, 4 (Suppl 1),                              S8, J. Biomed. Semantics.    Y. Yan, N. Okazaki, Y. Matsuo, Z. Yang, and M. Ishizuka, (2009). ​Unsupervised relation extraction by mining                                  Wikipedia texts using information from the web​. vol. 2, pp. 1021­1029. Joint Conference of the 47th Annual Meeting of                                      the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.    L. Yao, & W. A. Sethares, (1994). ​Nonlinear parameter estimation via the genetic algorithm,​, 42(4), pp. 927­935,                                  Signal Processing, IEEE Transactions on.    27  http://www.informatik.uni-trier.de/~ley/pers/hd/s/Syrjakow:Michael.html http://bioportal.bioontology.org/