doi:10.1016/j.eswa.2005.07.034 XKey: A tool for the generation of identification keys M. Delgado Calvo-Flores a , W. Fajardo Contreras a , E.L. Gibaja Galindo b,*, R. Pérez-Pérez a a E.T.S. de Ingenierı́a Informática, Departamento de Ciencias de la Computación e Inteligencia Artificial, Universidad de Granada, C/Periodista Daniel Saucedo Aranda, 18071 Granada, Spain b Escuela Politécnica Superior, Departamento de Informática y Análisis Numérico, Universidad de Córdoba, Campus de Rabanales, Edificio Albert Einstein, 3a planta, 14071 Córdoba, Spain Abstract This paper presents the development of XKey, a tool for generating taxonomical identification keys by means of decision tree construction. The tool is based on an XML standard for the representation of general taxonomical information, which makes it ideal for different fields of application. The article analyses the problem by examining the adaptation of machine learning techniques to the sphere of biology so as to incorporate the viewpoints of biologists and computer science experts. It also analyses the effect of using various division criteria on a set of real data: the Gymnosperm plant groups present in the Iberian peninsula. q 2005 Elsevier Ltd. All rights reserved. Keywords: Dichotomous keys; Decision trees; XML; Division criteria; Interactivity 1. Introduction Identification keys are fundamental in the study of biodiversity. This term, which arises from concern for the environment, and the need for conservation and the sustainable use of natural resources, is defined as the variability of living organisms on earth and includes all organizational levels: from the simplest of individuals to ecosystems (even the entire planet). The first step in its study is to identify the types of organisms present in the biotype being analyzed (in the field, the researcher finds an organism, but in a database register, its identification is stored). This identifi- cation (or determination) consists in recognizing characters in a sample, so as to give it a name which was previously given to a similar organism. Given the age of taxonomy as a science, the most used identification tool in the last 200 years is the printed key; a tool which must also be included in field guides and floras. In general, experts design their keys manually, without the use of any computer support tool. The development of a key is 0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2005.07.034 * Corresponding author. E-mail addresses: mdelgado@decsai.ugr.es (M. Delgado Calvo- Flores), egibaja@uco.es (W. Fajardo Contreras), egibaja@uco.es (E.L. Gibaja Galindo). therefore time-consuming, and also extremely costly when an error is detected: the later the error is detected, the more it will cost to rectify it. Since the computer tools developed to support this process have been suggested by specialist biologists, we find that considerable improvements could be made. This article presents XKey, a tool for generating identification keys, and is organized in the following way: Section 2 details the background of the problem, focusing on the aspects related to the representation of taxonomical knowledge and the tools for the generation of previously designed keys; Section 3 analyses the problem and describes the division criteria used by the tool; Section 4 examines the characteristics of XKey and its contribution to its field of application; Section 5 outlines the test procedure and the results obtained; and finally, Section 6 presents a series of conclusions and the references consulted to produce this work. 2. Background 2.1. Identification keys An identification key is a tree-shaped structure which presents the user with a series of choices at each step. We can distinguish between: † Printed keys. Most printed keys have a treelike structure (see Fig. 1). They are very powerful tools when dealing Expert Systems with Applications 30 (2006) 337–351 www.elsevier.com/locate/eswa http://www.elsevier.com/locate/eswa Single leaves Leaves which are green on both sizes Leaves which are tomentous on the back Myrtus communis (myrtle) Olea europaea (wild olive) Composite leaves Trifoliate leaves Pinnate-composite leaves Jasminum fructicans (jasmine) Pistacia lentiscus (lentisk) Fig. 1. Example of an identification key (Morales, Quesada, & Baena, 2001). M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351338 with primary identification characters (Fortuner, 1989), or rather characters which are capable of differentiating between species or groups of species with a very small risk of error. They proceed by elimination, and, for this reason, identification is prevented as they do not have the necessary information to go on to the next level. In order to avoid this, multi-entry or tabular keys are used whereby the elimination of species or groups depends on several characters. The problem of this type of key is that it is difficult to use with large genera. † Computerized keys. This type of key is characterized by its interactivity: a computer enables a multi-entry key to be easily used. Within this group, we should also include the graphic key, in which the choices are presented in image format to represent the options. It is not possible to say that one type of key is better than another; this aspect will depend on the sphere of its use. We shall focus our study on the development of tree-like printed keys, which are widely used and whose development is not particularly automated. Fig. 2. General structure of an SDD document 1.0. In TDWG-SDD (2005). 2.2. Digital representation of taxonomical knowledge The automatic generation of identification keys requires highly structured descriptive information to be used. In spite of having described certain models such as Lif (UBio, 2003), Delta (Dallwitz, 1974; Dallwitz, Paine, & Zurcher, 2000) and Nexus (Maddison, Swofford, & Maddison, 1997), there is no widely accepted and used standard model, and this prevents a greater use of digital taxonomical infor- mation and leads to substantial inefficiency for taxonomy as a whole. At the present moment in time, information globalization is being pursued so that it may be used by different authors and with different ends (identification, key production, floras, field guides, etc.). Delta has been standard of the IUBIS-TDWG since 1991, and in September 1998, after analyzing the new challenges for the scientific community, the subgroup SDD was set up to develop an XML-based standard for representing and handling descrip- tive information of organisms. The aim is to design a sufficiently standardized knowledge representation model which is accepted by the entire community of biologists, whatever their discipline of origin (zoology, botany, etc.). SDD shall attempt to provide a flexible and independent platform to facilitate the exchange of data sets without loss of information between applications, in addition to using a same description for different purposes. Fig. 2 shows the general structure of an SDD document. 2.3. Tools for the generation of identification keys There are certain tools for generating interactive keys manually. This is the case of LucID (CBIT, 1994), Linnaeus II (Schalk & Heijman, 1996; Schalk & Troost, 1999), Xid (Intelsys, Inc., 2000***–2001), Meka (Duncan & Meacham, 1986; Meacham, 1986–1996), NaviKey (Bartley & Cross, 2000; University of Toronto, Department of Botany, 1996). This type of tool does not give any information about the suitability of the selected characters for branching nor does it check the existence of inconsistencies in the key (unclassified objectives, dead ends, etc.). Pankey (Pankhurst, 1991; Pankhurst & Pullan, 1994) is a commercial system based on Delta and available for MS-Dos which incorporates two tools: Key3m3 (to generate keys totally automatically), and Kconi (to generate keys under the user’s supervision). It is capable of showing the partial key resulting from the choice of characters made by the user and to suggest the best option at each step. We should also mention Delta which is a general system for the field of systematics based on the model with the same name which includes Key, a printed key generation program M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351 339 which uses text files to register the user’s preferences and thereby offer some interactivity. This review reveals the shortage of tools, and also the fact that until recently, they were developed by biologists and have become obsolete. In addition, they incorporate few facilities for interactivity, and can be improved from the computing point of view. Consequently, we have developed XKey, an SDD-based tool which combines proven artificial intelligence techniques with the specific functional needs of taxonomy. It has been designed and implemented within the framework of a multidisciplinary team and from the study of previously developed tools, and therefore represents an integratory vision from the viewpoints of biologists and computer specialists. 3. Analysis of the problem 3.1. Construction of identification keys An identification key is a tree-like structure in which each node represents a character, and the arcs its possible values. Because of its structure, it is constructed by performing recursive partitions of the set of training cases in order to finally obtain a decision tree. In short, when a node is constructed or expanded, the subset of training cases belonging to each class is considered. If all the examples belong to one class, or if some stopping rule is verified, the node is a leaf of the tree. Otherwise, the training set is divided into subsets (using a rule of heuristic division), and the same procedure is applied to each subset of the training set. Due to the special characteristics of biology, an identification key does not totally fit the definition of a decision tree in the classic sense. Below, we shall describe the characteristics of these keys and how they are adapted by our algorithm to generate identification keys, as outlined in Fig. 3. 3.2. Objectives of the division criterion The general aim is to obtain a compact decision tree. In accordance with the principle of Occam’s razor, a small decision tree enables a better understanding of the classification model obtained. This principle, although allowing for the construction of easily understandable models, does not guarantee that these are better than other apparently more complex ones (Domingos, 1998, 1999). From the point of view of taxonomy, a good division rule (Dallwitz et al., 2000): 1. divides the taxa into uniform subgroups; 2. selects characters with high reliability and low intra- taxon variability (we understand intra-taxon variability to be the set of characters which can distinguish between individuals or populations belonging to a same taxonomical category); 3. discriminates the individuals which are going to be identified more frequently in the first steps (this frequency is also called abundance). Although many division criteria have been proposed, we shall focus our study on four of these: the first three because they are classic and widely used, and the fourth because it has been proposed from the area of taxonomy. 3.3. Classic division criteria Within this first group of criteria, we can distinguish: † Entropy (Quinlan, 1986). This measures the impurity of a node of the tree and reaches its minimum when all the elements of the node are the same. It satisfies the requirements established by Dallwitz, i.e. it divides the taxa into more uniform groups and minimizes the intra- taxon variability. It also considers the abundance of an individual, so that the most frequent will appear in the first steps of the key. The entropy normally constructs decision trees with a high degree of branching, which favors questions with the most possible results. This aspect does not always reflect the work methodology of the expert biologist, who, depending on the objective with which the key is developed, may have other criteria which are different from the minimization of the length. † The gain ratio criterion (Quinlan, 1993). This measure- ment is based on the entropy which normalizes the information gain obtained in order to avoid the construc- tion of decision trees which classify the cases using their keys. It has been observed that, like the entropy, it favors partitions of the training set which are very uneven in size when any of them is very pure (all the cases which it includes correspond to the same class) even if it is not particularly significant (covering very few training cases). † The Gini diversity index (Breiman, Friedman, Olshen, & Stone, 1984). This is a measurement for class diversity in a tree node which attempts to minimize the impurity existing in the subsets of training cases generated when the decision tree is branched. 3.4. Division criteria in taxonomy In the area of taxonomy, very few criteria have been described for generating identification keys, and one such example is that used by Dallwitz et al. (2000). This criterion calculates the cost of using a certain character i according to the expression in Formula 1. Since it concerns a cost measurement, the aim is to minimize it, and consequently characters with a low cost will be preferred. where: † s is the number of values for the attribute; † nj is the number of taxa in the jth subgroup; START Are there cases in the training set? Can tree branching continue? Calculate division criterion and select best attribute Do all cases belong to the same class c Generate leaf node labeled class c END No No No Yes Yes Yes For each subset resturn to START Generate intermediate node labeled with th selected attribute Add the confirmatory characters to the node Subdivide the training set into as many subsets as the number of values in the attribute If an example presents null values and the interpretation is: “unknown”, add the example to all subsets “inapplicable”, do not add the example to any subset If several attributes ar candidates, determine their priority: from the information in the data set by asking the user Interactive mode? Ask the user for the attribute and the branching node No .. .. Fig. 3. Algorithm for the generation of identification keys. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351340 † c is the cost of a character. This is related to the reliability using the expression cZRbase5Kr, where: B r is the value ascribed to the reliability of the character (a value fixed by the expert) and takes values in [0–10]. B Rbase is a constant which takes values in the interval [1,5]. † cmin is the lowest cost for the characters being considered; † fj is the total frequency of the items in the jth group. (0) Flexible leaves; Discolorous leaves Yes; Exserted bracts; Color of the branches ash-gray; Pubescent branches .............Species Abies alba (0) Rigid leaves; Discolorous leaves No; Included bracts; Color of the branches reddish-brown; Glabrous branches .............Species Abies pinsapo main character confirmatory characters Fig. 4. Key with confirmatory characters. Author: Eva Gibaja Galindo. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351 341 The frequency f of an item is given by the expression where: B a is the abundance of the item (a value fixed by the expert) and takes values in [0–10] B Abase is a constant which takes values in the interval [1,5]. † V controls the effect of the intra-taxon variability. In its expression: B Varywt is a constant value. If its value is 0, those characters which have any intra-taxon variability are excluded from the key. If, on the other hand, the value of this parameter is 1, this aspect is not penalized. It takes values in the interval [0–1]. B n is the number of taxa. 3.5. Treatment of null values When a key is generated, it is common for null values to appear in the data set. The final result of the process depends on the interpretation made by the algorithm of these. In taxonomy, there could be three different interpretations for the appearance of null values: 1. The value does not appear because the attribute is inapplicable for a certain taxon. In this case, the taxon is not added to any branch of the tree. It should be observed that when this interpretation of the null values is used, the taxon to which the null value corresponds could remain unclassified. 2. The value does not appear because it is indifferent. This means that, for a certain taxon, the attribute can take any valid value. If this character is used for branching the decision tree, it is necessary to add the taxon with the null value to each branch of the tree. 3. The value does not appear because it is unknown. In this case, and in order to avoid loss of information, the same procedure is followed as in the previous case. 3.6. Reliability of a character In taxonomy, not all the characters have the same relevance: some are more important than others, for example, owing to the ease with which they are observed, because they are distinguishing characters, etc. This aspect is called the reliability of a character (Dallwitz, 1974). If the expert provides information about the most reliable character for branching, it is possible to combine the measurement given by a division criterion and expert knowledge about the domain, in order to decide in those cases in which several characters present the same value of the division criterion. In this way, the key will adapt better to the characteristics of each taxonomical group and will be more satisfactory. 3.7. Confirmatory characters It is normal in a key to include more than one character in the same node of the classification tree (see Fig. 4). This means that the node contains a main character and set of characters called confirmatory characters (Dallwitz et al., 2000). This is a fundamental aspect since it enables: † identification to be continued when the main character is not available; † confirmation of the decision to continue identification by a certain node. The main character and the confirmatory characters are equivalent, and any of these can be used to branch the tree. This means that, within the taxa group being considered, the two characters: 1. have the same value for the division criterion; 2. have the same number of possible values; 3. generate identical taxa groups since they are used as the branching character. 3.8. Construction of knowledge bases By taking advantage of the construction of the decision tree, it is possible to generate a knowledge base comprising rules from this. In botany, the classification problem and subsequent identification has singular characteristics: a fold does not, in every case, have the complete individual for its identification, and the samples gathered frequently depend on seasonality, age of the individual, etc. This can lead to a very well-informed consultation not offering conclusions if there is no value for any character. In order to tackle this aspect, various decision trees are constructed (which shall subsequently be converted into rules), taking a different generator node each time. For this reason, it is necessary to include a consistency reinforcer, and since the syntax of a rule is sufficiently restrictive, it systematically analyses each rule in the knowledge base to detect and correct unnecessary if conditions, redundant rules, conflictive rules, and subsumed rules (see Fig. 5). 3.9. Treatment of uncertainty Given the characteristics of taxonomical identification, the application of some method for the treatment of Fig. 5. Algorithm for the XKey consistency reinforcer. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351342 uncertainty is extremely useful in order to enrich knowledge base consultation. There are various theories for treating uncertainty such as the theory of evidence (Dempster, 1967; Shafer, 1976), or the probability or fuzzy set theory (Zadeh, 1965). Due to the absence of historical data and the difficulties of defining fuzzy variables and probability mass functions, we have chosen the theory of certainty factors (Shortliffe & Buchanan, 1975). This model has been used successfully in many other recently published systems (Cabrero-Carnosa, Castro-Pereiro, Graña-Ramos, Hernández-Pereira, Moret-Bonillo and Martı́n-Egaña, 2003; Mahaman, Harizanis, Filis, Antonopoulou, Yialouris and Sideridis, 2002). Furthermore, the theory of certainty factors is a computationally simple and more natural model for a non-mathematical user than other approaches (Harrison & Kovalchic, 1998). Taking all this into consideration, each rule in the knowledge base has an associated certainty factor (CF). This value, which is determined during generation and remains unchanged unless redefined by an expert, is ascribed according to the following criteria: † if a pure leaf node is generated (all the elements belong to the same class), the associated rule has an associated certainty factor with a value of C1; † if the leaf node is not pure, the result is that of a rule with disjunction in the consequent, which is fractionized into rules with simple consequents. The certainty factor of the original rule is also fractionized by the following Formula 2 where † p(x) is the probability of the objective in question occurring in the initial table; † p(x/a) is the probability of objective x occurring knowing that antecedent a occurs, regardless of whether this is simple or compound. 4. Characteristics of XKey 4.1. Use of different division criteria XKey enables the division criterion to be selected, and includes the four criteria presented in Sections 3.3 and 3.4: entropy, the gain ratio criterion, the Gini diversity index, and the division rule proposed by Dallwitz. This allows generated keys to be compared from a same set of data but using different criteria and aids a better adaptation to each specific problem. The criterion is selected from the user options menu (Fig. 6). 4.2. Treatment of null values In its operation mode, XKey reflects the three different interpretations which we can give to the appearance of null values in taxonomy (see Fig. 5) and enables the distinction to be made between: 1. not applicable (the taxon with the null value is not added to any tree branch); 2. indifferent (XKey adds the taxon with the null value to each of the tree branches); 3. unknown (XKey acts as in Case 2). SDD also enables explicit representation of null values. For this, it uses the global states ‘Unknown’ and ‘NotApplicable’. An attribute with the special value ‘Unknown’ will be treated by XKey as unknown (Case 3), whereas an attribute with the special value ‘NotApplicable’ will be treated as inapplicable (Case 1) independently of the general interpretation with which XKey has been configured. 4.3. Control of the reliability of a character The SDD data sets do not include information about the reliability of a character, so when two attributes have Fig. 6. Execution options of XKey. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351 343 the same value for the division criterion, the construction of the tree depends on the order in which the data are represented. In order to avoid this situation, XKey shows the user the different alternatives, the current state of the tree, and the objectives which must still be classified so that he/she may select the attribute which they consider to be the most suitable. It is also possible to ascribe a priority value in Fig. 7. Attribute selection execution time to a specific attribute. This weight will be remembered by the system throughout execution and will be used to decide between several branching attributes with the same value for the division criterion (see Fig. 7). At certain times, the data set can be too large to apply this strategy, or it could be that the expert is not interested in making this type of choice. For this reason, users can tell in execution time. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351344 XKey at any time that they no longer want to be asked about this aspect. 4.4. Inclusion of confirmatory characters In order to adjust to this feature, XKey includes an option in the execution options menu so that users can indicate whether they want to include this type of character and the maximum number of confirmatory characters per tree node. The choice of the value will depend on the user’s criterion and on the final use of the key; the final quantity included will depend on the characteristics of each data set. It is therefore possible to indicate that the inclusion of confirmatory characters is not required. 4.5. Operation modes One difference between the learning of traditional classification models and identification keys is interactivity in character selection. The best character from the point of view of the division criterion may not be the best character from the expert’s point of view: it might be difficult to visualize, at times it is preferable to eliminate the exceptions in the first steps of the key, etc. For these reasons, XKey offers various operation modes: † Automatic mode. This applies the algorithm for the generation of decision trees without the user’s interven- tion and uses the branching criterion selected in the options menu. The algorithm for the automatic gener- ation of XKey matches the general guidelines of a TDIDT algorithm (Top-Down Induction On Decision Trees). What XKey offers is its capacity to add confirmatory characters, to treat null values according to the interpretation established by the user, and to select, in execution time, the best branching attribute (if several attributes have the same value for the division criterion). † Semi-automatic mode. This applies the algorithm for the generation of decision trees automatically, but it consults the user in those cases where there are ties in the selection of the branching attribute. XKey shows the list of all the characters which can be used for branching, ordered according to the division criterion. In this way, it provides very useful statistical information which the expert does not have access to when generating keys manually with other tools. † Interactive mode (supervised by the user). This follows the schema for automatic generation, to which inter- action facilities are added. The operations permitted during execution in interactive mode are: † Add node. The user selects the node to branch and the branching attribute, and generates the subtree corresponding to this node. XKey again shows the list of all the characters which can be used for branching, ordered according to the division criterion. † Delete node. In this case, the node and all its descendants are eliminated. For this, it is necessary to maintain an execution memory for each node so that it may return to previous states. † End. This option enables the user to establish where certain characters will appear so that the system can subsequently finish the execution automatically. When this operation is selected, XKey analyzes the classification model in order to detect the leaf nodes which do not contain classes of the example (intermediary nodes) so as to finish construction of these nodes automatically. 4.6. Diversity of output formats Once the identification key has been generated, it is presented to the user in tree form (see Fig. 8). It is possible to save this key in the formats described below: † Text format. The key is saved in a plain text file. This format facilitates the modification of the key with other more sophisticated text editors, and in turn, does not limit the use of any of these. † XML format. The key is saved in an XML format file, with a similar structure to the expected format of the SDD keys. This format enables the keys to be edited once they have been generated and for information to be easily exchanged. 4.7. Construction of a knowledge base XKey incorporates the generation of a knowledge base into its functionality from the data set as described in Sections 3.8 and 3.9. If generation of a knowledge base is chosen, this can be saved in: † CLIPS format (Giarratano, 1998). The key is saved in the form of rules for this well-known expert system shell, with which we can intercommunicate the standard XML model for taxonomical description with one of the most popular shells within the field of expert systems. † GREEN format (Fajardo, Gibaja, Bailon, & Moral, 2003). We have mentioned that as well as generating keys, XKey can generate sets of rules which can be used directly by the GREEN system. 4.8. Execution results In addition to the keys, the execution of XKey returns a set of measurements which facilitate analysis of the results obtained. This information includes: † average length of the key; † typical deviation of the length of the key (in order to compare some keys with others, the system returns Pearson’s skewness coefficient); Fig. 8. Interactive selection of attributes with XKey. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351 345 † maximum and minimum length of the key; † number of terminal and non-terminal nodes; † number of OR terminal and exclusive nodes (the appearance of OR nodes indicates that it is not possible to clearly identify a particular node, and, therefore, that the data set is incomplete); † number of total attributes in the data set and number of attributes used in the key; † number of confirmatory attributes included in the key. Table 1 Subgroups of Gymnospermae being studied Division Gymnospermae Family Araucariaceae Family Cephalotaxaceae Family Cupressaceae Family Cycadaceae Family Ephedraceae Family Ginkgoaceae Family Pinaceae Family Taxaceae Family Taxodiaceae Genus Abies Genus Calocedrus Genus Cedrus Genus Chamaecyparis Genus Cryptomeria Genus Cupressus Genus Cycas Genus Ephedra Genus Juniperus Genus Larix Genus Picea Genus Pinus Genus Platycladus Genus Pseudotsuga Genus Sequoia Genus Sequoiadendron Genus Taxodium Genus Tetraclinis 5. Test of the XKey tool 5.1. Execution data for the XKey tool and methodology followed The information provided by XKey enables the keys obtained for the same data set to be compared by using, for example, different division criteria. In our case, experiments have been carried out with a real group: Gymnosperms present in the Iberian peninsula. The evaluation of the keys generated with XKey for this group offers us a sufficiently wide set of tests for determining what criteria, in what cases or conditions, and why they enable the most ideal key to be generated. It should be noted that in order to increase the diversity of casuistries which can be presented, not only have wild or naturalized taxa been included, but also ones which have been cultivated and which are widely used in gardening. The subgroups of Gymnosperms used are described in Table 1. In order to carry out this experiment, a knowledge acquisition process has been performed with the partici- pation of two expert botanists. Bibliographical sources have also been consulted, and in particular (López Gonźlez, 2001; López Gonźlez & Do Amaral, 1986), which are essential reference books for the taxonomical group being studied. In our study, we have analyzed a set of qualitative aspects which are directly related to the reliability of the generated key: 0 5 10 15 20 25 30 35 40 G im n o sp e rm a s P in u s Ju n ip e ru s C u p re ss a ce a e P in a ce a e T a xo d ia ce a e C u p re ss u s E p h e d ra A b ie s C yc a s C e d ru s Entropy Gain Gini Dallwitz Expert Average 0 10 20 30 40 50 60 70 p e rm a s P in u s n ip e ru s ss a ce a e in a ce a e d ia ce a e p re ss u s p h e d ra A b ie s C yc a s C e d ru s Entropy Gain Gini Dallwitz Expert Average A B M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351346 † Average length of the key. This is an indicator of the average number of steps for identification. This is an important aspect since the expert often searches from the key which leads to the identification in the least number of steps. † Typical deviation of the average length. This is an indicator of the equilibrium of the keys: if the typical deviation is large, the different paths of the tree have very different lengths. † Terminal nodes/number of objectives ratio. This is an indicator of the degree of branching of the keys: if there are many more leaf nodes than objectives to be classified, the tree is extremely branched. † Internal nodes/confirmatory characters ratio. This is an indicator of the power of generating confirmatory characters of a certain division criterion. Those criteria which identify a greater quantity of confirmatory characters are best. † Characters used/total number of characters ratio. With this, it is possible to determine the ratio of attributes in the data set which have been used to generate the key and what these attributes are. † In addition to the quantitative interpretation of the results, another important aspect has been the analysis of the adaptation to the reality of the keys obtained. G im n o s Ju C u p re P T a xo C u E Fig. 10. Number of external (A) and internal nodes (B). 5.2. Observations about the average length of the keys generated From the expert botanist’s point of view, the common factor is the inadequacy of the character selection of the keys when the Gini diversity index is used as the division criterion. This criterion always produces longer keys (see Fig. 9) and trees with a greater number of internal and external nodes (see Fig. 10). If we consider the sum of differences in relation to the minimum average length (Fig. 11), the keys generated with the entropy criterion are practically in all the cases of minimum length. In second place is the profit gain ratio, followed by Dallwitz’s criterion, which does not produce as good results due to the fact that the measurement it uses for 0 1 2 3 4 5 6 G ym n o sp e rm a e P in u s Ju n ip e ru s C u p re ss a ce a e P in a ce a e T a xo d ia ce a e C u p re ss u s E p h e d ra A b ie s C yc a s C e d ru s Entropy Gain Gini Dallwitz Expert Minimum Fig. 9. Average length of the keys. counting the intra-taxon variability is cruder than that used by the two previous criteria: Dallwitz only considers the number of different taxa in each partition, whereas the other two also count the frequency of each taxon. 5.3. Observations about the equilibrium of the keys The deviation of the length of the branches in relation to the average is always greater for the Gini measure, which indicates that the length of the tree branches generated with 0 1 2 3 4 5 6 7 8 Entropy Gain Gini Dallwitz Fig. 11. Sum of the differences in relation to the minimum average length. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 G ym n o sp e rm a e P in u s Ju n ip e ru s C u p re ss a ce a e P in a ce a e T a xo d ia ce a e C u p re ss u s E p h e d ra A b ie s C yc a s C e d ru s Entropy Gain Gini Dallwitz Expert Minimum Fig. 12. Typical deviation of the average length. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 G ym n o sp e rm a e P in u s Ju n ip e ru s C u p re ss a ce a e P in a ce a e T a xo d ia ce a e C u p re ss u s E p h e d ra A b ie s C yc a s C e d ru s Entropy Gain Gini Dallwitz Minimum Fig. 14. Terminal/objective ratio. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351 347 this criterion is more variable. The most balanced keys are those generated with the gain ratio criterion, followed by those generated with entropy and Dallwitz’s division criterion (Fig. 12). The sum of the differences in relation to the minimum typical deviation (Fig. 13) is less for the gain ratio, followed by entropy. 5.4. Observations regarding the degree of branching The gain ratio and entropy criteria generate the least branched key, i.e. they tend to detect the smallest number of possible paths for a certain objective and, therefore, to generate more compact models than Gini and Dallwitz’s criteria (Figs. 14 and 15). 5.5. Observations about the number of confirmatory characters included The entropy criterion produces a better number of confirmatory attributes/number of internal nodes ratio, followed by the profit gain criterion (see Fig. 16). If we observe the differences regarding the maximum of the confirmatory/internal nodes ratio in each case, we can see 0 0.5 1 1.5 2 Entropy Gain Gini Dallwitz Fig. 13. Sum of the differences in relation to the minimum typical deviation. how the entropy is the one which presents a smaller difference, followed by the profit gain criterion and, some way behind, by the division criteria of Dallwitz and Gini (see Fig. 17). 5.6. Observations about the number of characters used The entropy uses the least number of characters in the generation of keys, i.e. it locates the smallest amount of information in order to carry out identification. In second place is the gain ratio criterion, followed by Dallwitz’s criterion. The Gini diversity index, in addition to producing longer keys, includes a greater number of attributes in the generation of keys (Figs. 18 and 19). 5.7. Qualitative comparison of the criteria The tests carried out show that entropy and gain ratio offer the best results for all the measures proposed. We can see a summary in Table 2: 4 is the score given to the best criterion, 1 is the score for the worst criterion, and 2 and 3 are the scores for intermediary cases. Table 2 shows that it is precisely entropy and profit gain which are the best criteria. The measure used by Dallwitz also produces good results, but in some unwanted cases since it measures the intra-taxon variability more crudely. The difference with Dallwitz’s division criterion can be clearly seen in the case 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Entropy Gain Gini Dallwitz Fig. 15. Sum of the differences in relation to the minimum terminal/objec- tive ratio. 0 0.5 1 1.5 2 2.5 G ym n o sp e rm a e P in u s Ju n ip e ru s C u p re ss a ce a e P in a ce a e T a xo d ia ce a e C u p re ss u s E p h e d ra A b ie s C yc a s C e d ru s Entropy Gain Gini Dallwitz Expert Maximum Fig. 16. Number of confirmatory characters/internal node. 0 10 20 30 40 50 60 70 80 90 100 G ym n o sp e rm a e P in u s Ju n ip e ru s C u p re ss a ce a e P in a ce a e T a xo d ia ce a e C u p re ss u s E p h e d ra A b ie s C yc a s C e d ru s Entropy Gain Gini Dallwitz Minimum Fig. 18. Ratio of total number of characters/characters used. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351348 of the keys of the genus Cedrus: in order to separate two taxa, Dallwitz gives five paths for C. atlantica and three for C. deodara (Table 3). With the entropy, we have two classes of Cedrus and one path to arrive at each species (Table 4). 5.8. Discussion and biological interpretation of the results obtained/effect of interactivity The entropy produces more optimized keys in terms of the number of questions, but in the field of biology, this aspect must be qualified. At times, there is no reason why the distinguishing character from the point of view of information theory need be the best character from the biological point of view, due, for example, to the difficulty in its observation (need for microscopic observation, temporality, size, layout, etc.). This is a fundamental aspect as the identification process is interrupted if no character is available. It is therefore necessary for those characters which as well as having a high distinguishing power can be easily observed to be included in the first steps of the key. All this entails considering not only the division criteria but also the characteristics of the taxonomical group being studied and the future receptors of the keys. The possibility of choosing between the four criteria set out in our article enables different types of keys to be 0 0.2 0.4 0.6 0.8 1 1.2 Entropy Gain Gini Dallwitz Fig. 17. Sum of the differences in relation to the maximum number of confirmatory characters/internal node. produced according to the type of work, research, and end user: † If the key is going to be all the information about the taxa which shall be available in the end, we shall need long keys which gather the greatest number of possible characters. † If the keys are the tool to access information and, once the taxon has been identified, this is accompanied by an exhaustive description, we can use more precise keys with a high level of discrimination. † The generation of keys for experts, which assumes an advanced knowledge of the group and of the plant morphology and characteristics, prefers criteria such as that of the entropy. † The case of considering keys as a tool for an intermediary user would probably entail using the ideal criterion according to the information theory, but combined with interactivity. It might be that the attributes on which the entropy has generated its key are not the ones which an expert in the field would consider to be the most suitable. In these cases, the combination of interactivity with a division criterion 0 20 40 60 80 100 120 Entropy Gain Gini Dallwitz Fig. 19. Sum of the differences in relation to the minimum of the total number of characters/characters used ratio. Table 2 Comparison of different division criteria on the Gymnospermae group Length Balance Branching Confirmatory No. of characters Total Entropy 4 3 3 4 4 18 Profit 3 4 4 3 3 17 Gini 1 1 1 1 1 5 Dallwitz 2 2 2 2 2 10 M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351 349 such as the entropy produces very satisfactory results: the entropy suggests and orders the characters, but it is the expert who ultimately decides which character to select by using his/her knowledge about the specific taxonomical group. In the case of the keys for Gymnosperm families, we have managed to reduce the average length of the key in relation to the entropy by means of the use of expert knowledge (see Fig. 20). A similar case occurs with the keys of the genus Pinus. In this case, the character in which the key is based is ‘characteristics of the apophysis’, which rapidly separates the different species of pine trees. It is not easy, however, to see this character as it refers to the pine cone, a character which is not always present and which supposes knowledge of the plant morphology which is not within the reach of any user. According to the experts’ opinion, a good attribute is ‘leaf shape’. The interactivity can be used on other levels. Continuing with the example of Pinus, the second attribute was made to be the ‘presence/absence of pine nuts’. The final result is shown in Table 5, and this became the key which best responded to the expert’s expectations (Fig. 21). 6. Conclusions and final notes Throughout this article, we have observed that the results can be better adapted to the biological reality by adding Table 3 Key for the genus Cedrus generated according to Dallwitz’s criterion Genus Cedrus /Dallwitz (0) Size of the leaves (long) between 2 and 2.5 cm 1 (0) Size of the pine cone (long) between 8 and 12 cm Species Cedrus deodara (0) Size of the pine cone (long) between 4 and 6 cm Species Cedrus atlantica (1) Size of the pine cone (long) between 6 and 8 cm Species Cedrus atlantica (0) Size of the leaves (long) between 2.5 and 3 cm 2 (2) Size of the pine cone (long) between 8 and 12 cm Species Cedrus deodara (2) Size of the pine cone (long) between 4 and 6 cm Species Cedrus atlantica (2) Size of the pine cone (long) between 6 and 8 cm Species Cedrus atlantica (0) Size of the leaves (long) between 3 and 5 cm Species Cedrus deodara (0) Size of the leaves (long) between 1 and 2 cm Species Cedrus atlantica meta-knowledge. For example, by ascribing a reliability value for the characters, it is possible to decide in those cases in which the system does not have the necessary information to make a decision. This alternative has its disadvantages: firstly, all this information must be gathered in the knowledge representation model and this makes design more complicated; secondly, the large number of characters normally being dealt with; and thirdly, all the additional meta-knowledge must be entered by the expert developing the data set. It is necessary to find a balance between the quantity of the additional information and the functionality of the tools. We should also highlight that the adaptation of the results also depends on the characteristics of the data set and on the specific problem being tackled. A good selection of identification characters will result in keys which are much more satisfactory and better adapted to reality (Fig. 22). We shall end this description of our work with a set of conclusions: 1. We have developed XKey, a tool for generating identification keys which operate directly from data sets developed with SDD. XKey can also operate with descriptions in DeltaAccess and Delta format (by means of the use of translation utilities). 2. We have shown the adaptation of artificial intelligence techniques for automatic learning and treatment of uncertainty to the area of taxonomical identification. The expert’s satisfaction with the keys generated by XKey endorse the suitability of the techniques used. 3. The output of the tool is presented in various formats: text format, XML format, and CLIPS format, and is supplemented with statistical information which facili- tates the study and comparison of the results obtained. 4. XKey generates identification keys rapidly and con- veniently, which means a considerable saving in time for the expert. It is also versatile, since it enables different division criteria to be selected, the meaning of null Table 4 Key for the genus Cedrus generated according to minimum entropy criterion Genus Cedrus/entropy (0) Size 6000 cm; Tree guide: recurved; Hanging branches: Yes Species Cedrus deodara (0) Size 5000 cm; Tree guide: not recurved; Hanging branches: Yes Species Cedrus atlantica 0 1 2 3 4 5 6 Gymnospermae Pinus Entropy Gain Gini Dallwitz Expert 1 Expert 2 Average Fig. 20. Average length of the keys for the Gymnospermae division and the genus Pinus. Fig. 21. Division criterion proposed by Dallwitz. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351350 values to be configured, distinguishing characters to be included, and weights to be ascribed to the characters in execution time. This added functionality produces keys with a more suitable biological meaning than those generated with classic decision trees. 5. In addition to generating identification keys, Xkey also generates complete, consistent knowledge bases which are compatible with the CLIPS system for subsequent use in expert systems. Table 5 Key for the genus Pinus generated according to the expert’s criterion Genus Pinus/expert criterion (0) Number of leaves per fascicle 2 1 (1) With pine nut: No; Crown shape: pyramidal; Persistent, winged seed: Yes 2 (2) Characteristics of the apophysis: prominent and sharp Species Pinus pinaster (2) Characteristics of the apophysis: not very prominent 3 (3) Colour of the bark (rhytidome): Ash-grey; Shiny pine cones: Yes; Leaf colour: intense green; Leaves: flexible Species Pinus nigra subsp. Salzmannii (3) Colour of the bark (rhytidome): Reddish-brown; Shiny pine cones: No; Leaf colour: light green; Leaves: rigid Species Pinus sylvestris (2) Characteristics of the apophysis: very prominent, hooked Species Pinus uncinata (2) Characteristics of the apophysis: not very convex Species Pinus halepensis (2) Characteristics of the apophysis: very prominent and sharp Species Pinus radiata (1) With pine nut: Yes; Crown shape: umbrella-shaped; Persistent, winged seed: No Species Pinus pinea (0) Number of leaves per fascicle 3 4 (4) Size 2500 cm; Characteristics of the apophysis: very prominent, hooked Species Pinus uncinata (4) Size 4000 cm; Characteristics of the apophysis: very prominent and sharp Species Pinus radiata (4) Size 6000 cm; Characteristics of the apophysis: prominent Species Pinus canariensis 6. The adaptation of the key to reality has been shown to depend to a large extent on the attributes selected for its branching. In view of various equivalent branching alternatives, XKey stops its execution and resorts to the user’s criterion. This semi-automatic mode of execution can be deactivated in order to operate totally automati- cally. 7. In addition to the automatic and semi-automatic execution modes, XKey has been provided with the capability of generating keys interactively. In this way, the user can select what node to branch and what attribute to use at any given moment; it is also possible to eliminate nodes from the tree, and to change from interactive to automatic mode at any time so as to finish the key generation. 8. In order to help the user, XKey presents the available characters in each step, ordered according to the division criterion. This aspect is particularly important as it enables the discrimination capacity of division rules (such as the entropy) to be combined with the human expert’s criterion and for the keys obtained to have a biological content which is more satisfactory to the expert. 9. Finally, a comparative study has been carried out of the effect of several division criteria on the generation of dichotomic keys with a sufficiently complex, real taxonomical group: Iberian Gymnosperms. This study reveals that the division criterion of the entropy is the one which produces the shortest keys and which favors the inclusion of a greater number of distinguishing characters. The gain ratio criterion generates somewhat longer keys, but in return, they are more balanced (the length of the paths is less variable). These two criteria detect the most important information for the classifi- cation of a set of taxa. Dallwitz’s criterion does not offer as good results in terms of the average length of the key, power of generating confirmatory characters, etc. Fig. 22. Calculation of the certainty factor of a rule with OR in the consequent. M. Delgado Calvo-Flores et al. / Expert Systems with Applications 30 (2006) 337–351 351 In spite of this, it can be useful in cases where the objective is not to minimize the length of the key. The same happens with the Gini diversity index, with the observation that this last criterion is not advisable with large data sets since it produces excessively complicated keys. References Bartley, M., & Cross, N. (2000). Navikey 2.0 (En linea, Consulta: 03/06/2003, http://www.huh.harvard.edu/databases/legacy/navikey/ index.html). Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. California: Wadsworth. Cabrero-Carnosa, M., Castro-Pereiro, M., Graña-Ramos, M., Hernández- Pereira, E., Moret-Bonillo, V., Martı́n-Egaña, M., et al. (2003). An intelligent system for the detection and interpretation of sleep apneas. Expert Systems with Applications, 24, 335–349. CBIT (Centre for Biological Information Technology, University of Queensland) (1994). LucID version 2.1 (En lı́nea. Consulta: 25/05/2003. http://www.lucidcentral.com/default.htm). Dempster, A. (1967). Upper and lower probabilities induced by a multi- valued mapping. Annals of Mathematical Statistics, 38(2), 325–399. Dallwitz, M. J. (1974). A flexible computer program for generating identification keys. Systematic Zoology, 1, 50–57. Dallwitz, M. J., Paine, T. A., & Zurcher, E. J. (2000). User’s guide to the DELTA system: A general system for processing taxonomic descrip- tions, 4.12 edition. Canberra: CSIRO Division of Entomology (En lı́nea, Consulta: 25/05/2003, http://biodiversity.uno.edu/delta/). Domingos, P. (1998). Occams’s two razors: The sharp and the blunt Proceedings of the fourth international conference of knowledge discovery and data mining (KDD-98), August 27–31, New York City, USA pp. 34–37. Domingos, P. (1999). The role of occams’s raxor in knowledge disvorvery. Data Mining and Knowledge Discovery Volume, 3, 409–425. Duncan, T., & Meacham, C. A. (1986). Multiple-entry-keys for the identification of angiosperm families using a microcomputer. Taxon, 35, 492–494. Fajardo, W., Gibaja, E., Bailon, A., & Moral, P. (2003). An application of expert systems to botanical taxonomy. Expert Systems with Appli- cations, 25/3, 425–430 (Elsevier/2003). Fortuner, R. (1989). Nematode identification and expert system technology. New York: Plenum Publishing Corp.. Giarratano, J. E. (1998). CLIPS user’s guide. NASA, Artificial Intelligence Section, L.B. Johnson Space Center. Harrison, P. R., & Kovalchik, J. G. (1998). Expert systems and uncertainty. In J. Liebowitz (Ed.), Handbook of applied expert systems (pp. 1–11). West Palm Beach, FL: CRC Press. Intelsys Inc. (2000–2001). XID authoring system 3.0 (Demo Version) (En lı́nea, Consulta: 3/06/2003, http://www.xidservices.com). López González, G. (2001). Los árboles y arbustos de la Penı́nsula Ibérica e islas Baleares Tomo I. Madrid: Ediciones Multiprensa. 861 pp. López González, G., & Do Amaral Franco, J. (1986). In S. Castroviejo, et al., Gymnospermae. Flora Ibérica (Vol. I) (pp. 161–195). Madrid: Real Jardı́n Botánico, CSIC. Maddison, D. R., Swofford, D. L., & Maddison, W. P. (1997). NEXUS: An extensible file format for systematic information. Systematic Biology, 46. Mahaman, B. D., Harizanis, P., Filis, I., Antonopoulou, E., Yialouris, C. P., & Sideridis, A. B. (2002). A diagnostic expert system for honeybee pests. Computers and electronics in agriculture, 36, 17–31. Meacham, C. A. (1986–1996). Meka version 3.0 (En lı́nea, Consulta: 3/06/2003, http://ucjeps.berkeley.edu/meacham/meka/). Morales, C., Quesada, C., & Baena, L. (2001) Guı́as de la Naturaleza. Árboles y Arbustos. Diputación de Granada, Granada. Pankhurst, R. J. (1991). Practical taxonomic computing. Cambridge: Cambridge University Press. Pankhurst, R. J., & Pullan, M. (1994). Pandora user guide version 3.1 (En lı́nea, Consulta:8/06/2003, http://www.ibiblio.org/pub/academic/ biology/ecologyCevolution/software/pandora). Quinlan, J. R. (1986). Induction on decision trees. Machine Learning, 1, 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Los Altos, CA: Morgan Kaufmann1-55860-238-0. Shafer, G. (1976). Mathematical theory of evidence. Princeton, NJ: Princeton University Press. Schalk, P. H., & Heijman, R. P. (1996). ETI’s linnaeus II taxonomic software: A new tool for interactive education, Vol. 3. UniSer- ve.Science News, University of Sydney (En lı́nea, Consulta 7/5/2003, http://www.eti.uva.nl/Home/Articles.html). Schalk, P. H., & Troost, D. G. (1999). Computer tools for accessing biodiversity information. Nature and Resources, 35(3), 31–38. Shortlife, E., & Buchanan, B. (1975). A model of inexact reasoning in medicine. Mathematical Biosciences, 23, 351–379. TDWG-SDD (2005). SDD part 0: Introduction and primer to the SDD standard (En lı́nea, Consulta 7/02/2005, http://160.45.63.11/Projects/ TDWG-SDD/Primer/index.htm). UBio (2003). X:ID, version (En lı́nea, Consulta 30/7/2003, http://ubio.org/ offerings/applications/key/index.html). University of Toronto. Department of Botany; the University of Toronto Libraries; the Royal Ontario Museum (1996). Pollyclave a multi-entry identification key version 1.6 (En lı́nea, Consulta 6/06/2003, http://prod. library.utoronto.ca/pollyclave/index.html). Zadeh, L. A. (1965). Fuzzy Sets. Information and Control, 8(3), 338–353. http://www.elsevier.com/locate/eswa http://www.elsevier.com/locate/eswa http://www.lucidcentral.com/default.htm http://www.biodiversity.uno.edu/delta/ http://www.xidservices.com http://www.ucjeps.berkeley.edu/meacham/meka/ http://www.ibiblio.org/pub/academic/biology/ecology+evolution/software/pandora http://www.ibiblio.org/pub/academic/biology/ecology+evolution/software/pandora http://www.eti.uva.nl/Home/Articles.html http://www.160.45.63.11/Projects/TDWG-SDD/Primer/index.htm http://www.160.45.63.11/Projects/TDWG-SDD/Primer/index.htm http://www.ubio.org/offerings/applications/key/index.html http://www.ubio.org/offerings/applications/key/index.html http://www.prod.library.utoronto.ca/pollyclave/index.html http://www.prod.library.utoronto.ca/pollyclave/index.html XKey: A tool for the generation of identification keys Introduction Background Identification keys Digital representation of taxonomical knowledge Tools for the generation of identification keys Analysis of the problem Construction of identification keys Objectives of the division criterion Classic division criteria Division criteria in taxonomy Treatment of null values Reliability of a character Confirmatory characters Construction of knowledge bases Treatment of uncertainty Characteristics of XKey Use of different division criteria Treatment of null values Control of the reliability of a character Inclusion of confirmatory characters Operation modes Diversity of output formats Construction of a knowledge base Execution results Test of the XKey tool Execution data for the XKey tool and methodology followed Observations about the average length of the keys generated Observations about the equilibrium of the keys Observations regarding the degree of branching Observations about the number of confirmatory characters included Observations about the number of characters used Qualitative comparison of the criteria Discussion and biological interpretation of the results obtained/effect of interactivity Conclusions and final notes References