Microsoft Word - Hierarchies 717 for dke 14sep04.doc S. Levachkine & A Guzman-Arenas 1 of 31 Hierarchy as a new data type for qualitative variables Serguei Levachkine, Adolfo Guzman-Arenas* Centro de Investigación en Computación, Instituto Politécnico Nacional “Lopez Mateos” Campus, Mexico City, MEXICO sergei@cic.ipn.mx, aguzman@ieee.org SUMMARY. Qualitative variables take symbolic values such as cat, orange, California, Africa. Often these values can be arranged in levels of deeper detail. For example, the vari- able place_of_birth takes as level-1 values Africa, Asia... as level-2 values Nigeria, Japan... as level-3 values California, Massachusetts... These values are organized in a hierarchy H, a mathematical construct among these values. Over H, the following are defined: (1) the function confusion resulting when using a symbolic value instead of another; (2) the close- ness to which object o fulfills predicate P; (3) a method which allows precision-controlled retrieval for relational databases whose objects have symbolic values. INDEX TERMS: hierarchy, ontology, approximate queries, confusion, knowledge represen- tation. 1. Introduction What is the capital of Germany? Berlin is the correct answer; Frankfurt is a close miss, Madrid a fair error, and sausage a gross error. What is closer to a cat, a dog or an orange? Can we measure these errors and similarities? Can we retrieve objects in a data base that are close to a desired item? Yes, by arranging these symbolic (that is, non-numeric) values in a hierarchy. For the sake of completeness, four different definitions of hierarchy are * Corresponding author. S. Levachkine & A Guzman-Arenas 2 of 31 given in §2. At least one of them is original. These definitions are important for under- standing what a hierarchy is. However, the confusion does not depend on particular defini- tion. 1. This arrangement allows the definition of confusion (§3.1) to measure the error when using one symbolic value associated with a node in a hierarchy in place of the (in- tended, correct) symbolic value associated with other node in the hierarchy. Variations of the definition are: a. When the values represent sets with some size, as population in {France, Italy, Spain, Sweden} we talk about percentage hierarchies (§3.1.1). b. When the values can be ordered, as temperature in {frigid, cold, warm, hot, burning} we talk about ordered hierarchies (§3.1.2). 2. Confusion is also defined (§3.3) for hierarchies whose nodes are associated with predi- cates (called variables in the paper) in addition to hierarchies whose nodes are associ- ated with values of a variable (item 1). 3. Confusion among values (item 1) and among variables (item 2) enables measurement of how close a given object o fulfils a given predicate P (§3.2.2), and we write Pε(o) for this measure. The main contributions are (1)-(3), pertaining to symbolic values. Of course, errors, dis- tances and approximate answers are well understood and developed for numerical values. The rest of the introduction discusses related work. Section 2 gives several definitions for hierarchies. Section 3.1 introduces confusion. Section 3.2 presents predicates on hierar- chies. Discussion of overall paper’s results is in Section 4 and conclusions form Section 5. S. Levachkine & A Guzman-Arenas 3 of 31 Related work. Artificial Intelligence, Natural Language and Knowledge Representation communities have been gauging the distance, proximity or “relatedness” between symbolic values. Relevant efforts: A. Hierarchies. The concept of a (generalization) hierarchy is not new. Hierarchies are used in data warehousing and data mining; see, for instance, the H-sets of Bhin [2]. A practical use of hierarchies in symbolic processing is Clasitex [9], which finds the themes of an article written in Spanish or English. It uses the concept tree, and a word (not in the tree) suggests the topic of one or more concepts in the tree. BiblioDigital© [4], a recent development, uses a large taxonomy (although not a hierarchy) to classify text documents; a (distributed) crawler in it retrieves “external” documents residing elsewhere in the Web. If a document is about (Cf. Clasitex) war, Iraq and President Bush, its URL will be stored in these three nodes in the concept tree. Hierarchies are simpler than ontologies, albeit very useful [13, 21]. The data modeling community, through the entity-relationship model, also organize items by their nature, properties and the relations among them. B. Natural Language. Linguists (see, for instance, Proceedings of CICLING 04, LNCS 2945, as referenced in [11]) have proposed many versions of semantic closeness, simi- larity, and other measures among words. Everett [6] identifies conceptually similar documents using a single ontology. Sidorov [8] does the same using a topic hierarchy: a kind of ontology. Montes y Gómez [19] builds trees of words, and by graph matching retrieves similar texts. Another common idea twisting around is to regard the represen- tation space with a “universal” measure of proximity of space’s elements and then an attempt to adapt it to different subject domains [16] [24]. Comments on this in §4. S. Levachkine & A Guzman-Arenas 4 of 31 WordNet [26] organizes information in logical groupings called synsets; each syn- set is a list of synonymous words or collocations (e.g., “fountain pen”, “take in”), and pointers that describe the relations between this synset and other synsets. A word or collocation may appear in more than one synset, and in more than one part of speech. The words in a synset are logically grouped such that they are interchangeable in some context. Nouns and verbs are organized into hierarchies based on the hy- pernymy/hyponymy relation between synsets. Two kinds of relations are represented by pointers: lexical and semantic. Lexical relations hold between word forms; semantic re- lations hold between word meanings. These relations include (but are not limited to), antonymy, entailment, and meronymy/holonymy. Additional pointers are used to indi- cate other relations. Budanitsky [3] compares five measures of similarity or semantic distance in Word- Net: Jiang and Conrath's measure (the best in the comparison: a spelling-corrector on real data); that of Hirst-St-Onge (seriously over-related), that of Resnik (seriously un- der-related [23]), and those of Lin [16] and of Leacock-Chodorow (in between). Note that all the measures except those of Hirst and St-Onge are similarity (not relatedness) measures considering only the hyponymy hierarchy of WordNet. The main problem (§4) with these approaches is that they use distances, thus obeying the symmetric prop- erty d(a,b) = d(b,a), while conf (§3.1) does not. C. Ontologies. At least three approaches appear when measuring similarity or relatedness of concepts (nodes in the ontology): S. Levachkine & A Guzman-Arenas 5 of 31 1. Syntactic approach. Methods that take into account only the organization of the tree or data structure of the ontology; for instance [19], those based on XML, or the “ontology merging” of Protégè [20]. 2. Standard ontology. Use of a common or agreed-upon ontology. Clearly, if dif- ferent people (or agents) use the same ontology, similarities among concepts will be consistently measured across users. CYC [7] was an early attempt to build the concept tree for common concepts. A common ontology is predicted in [11]; conceptually similar documents are identified in [6] by using a single ontology. In contrast, point (3) following shows use of different ontologies. 3. Measuring similarity across ontologies. LIA, a language for agent interaction [10, 13, 21], has an ontology comparator COM, that maps a concept from one ontology into the closest corresponding concept in another ontology. COM is used in sim of §3.4. By repeated use of sim, the degree of understanding du(B, OA) of agent B (with ontology OB) about ontology OA is found in [22]. Instead of using ontologies, this paper works on arbitrary hierarchies (§2). Why? Be- cause the problem-oriented interaction can be easier to maintain if the hierarchical structure is not a priori rigid as in the case of common hierarchies or ontologies. D. Pattern Classifiers. Our predicates with controlled precision or confusion (§3.2.1) are similar to Pattern Classifiers [18], but these classify objects according to the values of their properties, whereas hierarchies help to classify these values, when they are non- numeric. E. Distances and ultradistances. Traditionally [1, 25], the representation space is re- garded as a metric space with some “exotic” distance (e.g., ultrametric distance to S. Levachkine & A Guzman-Arenas 6 of 31 measure the “distances” between members of a hierarchy). Thus, §2.4 develops ul- trametric distances for hierarchies. However, often is not the case that such a distance meets the needs of the classification problem under consideration. Thus, we lean to- wards functions like conf (§3.1) that are not distances. 2. Theory This section continues with the focus on distances of item (E) of §1: we show how to build an ultradistance from a hierarchy (§2.4.1), how to build a hierarchy from an ultradistance (§2.4.2), whereas in Section 3 we move to a new approach that does not use distances. Element set E. A set whose elements are explicitly defined. ♦ 1 Example: {red, blue, white, black, pale}. Ordered set. An element set whose values are ordered by a < (“less than”) relation. ♦ Ex- ample: {very_cold, cold, warm, hot, very_hot}. Covering. K is a covering for set E if K is a set of subsets si ⊂ E, such that ∪ si = E. ♦ Every element of E is in some subset si ∈ K. If K is not a covering of E, we can make it so by adding a new sj to it, named “others”, that contains all other elements of E that do not belong to any of the previous si. Exclusive set. K is an exclusive set if si ∩ sj = ∅ , for every si, sj ∈ K. ♦ Its elements are mutually exclusive. If K is not an exclusive set, we can make it so by replacing every two overlapping si, sj ∈ K with three: si - sj, sj - si, and si ∩ sj. Partition. K is a partition of set E if it is both a covering for E and an exclusive set. ♦ 1 This symbol means: end of definition. S. Levachkine & A Guzman-Arenas 7 of 31 Symbolic value. A value that is not numerical, vector or quantitative. ♦ Example: red. Representation. A symbolic value v represents a set E, written v ∝ E, if v can be consid- ered a name or a depiction of E. ♦ v is associated with E. Example: strings ∝ {violin, viola, cello, guitar}. Qualitative variable. A single-valued variable that takes symbolic values. ♦ Its value can- not be a set,2 although such value may represent a set. father_of (v). In a tree, f is the father_of v if f is the node immediately following v in the path from v to the root. ♦ f is “the node from which v hangs.” We say that v is a son_of (f). ♦ Similarly, grand_father_of (v), brothers_of, aunt, ascendants, descendants... are defined, when they exist. ♦ The root is the only node that has no father. 2.1 Hierarchy Definition 1. A hierarchy H of an element set E is a tree whose root is E and if a node has sons then these form a partition of their father. ♦ Definition 2. For an element set E, a hierarchy H of E is a tree of nodes; each node n is either an element of E or a set of symbolic values vi, for i=1,…n, where vi ∝ Ei, and {E1, E2,…, En} is a partition of E. ♦ Example: for E = {chair table bed shirt loafer moccasin hammer paintbrush broom saw}, a hierarchy is (figure 1) H1 = {furniture ∝ {chair table bed} apparel ∝ {shirt shoe ∝ {loafer moccasin}} tool ∝ {hammer brush ∝ {paintbrush broom} saw}}. A hierarchy groups E into smaller sets of alike symbolic values. 2 Variable, attribute and property are used interchangeably. Some objects have an attribute (such as weight) while others do not: the weight of blue does not make sense, does not exist. A variable (color, height) describes an aspect of an object; its value (blue, 2 Kg) is such description (symbolic value) or measurement (numeric value). S. Levachkine & A Guzman-Arenas 8 of 31 merchandise furniture hammer tool chair table bed shirt shoe apparel brush saw loafer moccasin paintbrush broom Definition 1 emphasizes that the nodes of H are sets, while for definition 2 the nodes are symbolic values (such as furniture). To reconcile, we use the former definition v ∝ E, where a symbolic value v represents a set E. Some times, we add node “others” to certain level of a hierarchy, when we are not sure whether the nodes already present at that level will collectively exhaust their father. Thus, for instance, in Figure 1, we could add to the second level node “other_merchandise”, if we are not sure that furniture, apparel and tool comprise all the merchandise we are interested. A hierarchical variable is a single-valued qualitative variable whose values belong to a hierarchy. ♦ The data type of a hierarchical variable is hierarchy. Example: trades_in that takes values from H1, as trades_in = furniture, trades_in = broom. Fig. 1. Hierarchy H1 of articles for sale. 2.2 Partitions of a finite set Let E be a set of n elements. A partition P of E is a set of k subsets Ci of E such that (1) Ci∩Cj=∅ ; (2) ∪ iCi=E. ♦ Two elements x and y of E are equivalent in a partition P if they belong to the same class Ci; this is denoted by xPy. ♦ S. Levachkine & A Guzman-Arenas 9 of 31 Let P(E) be the set of all partitions of E; an order relation among the members of P(E), denoted by <, can be defined thus: for any two partitions P and P’, Pk’. ♦ Example: let E={ a,b,c,d,e,f} . Then E is less fine (i.e. coarser) than {{ a} ,{ b,c,d} ,{ e,f}} which in turn is less fine than {{ a} ,{ b,c} ,{ d} ,{ e} ,{ f}} . A lattice structure for P(E) can be based on the order relation. For every pair of parti- tions P and P’ there is a least upper bound (l.u.b.) P∨ P’, and a greatest lower bound (g.l.b.) P∧ P’. ♦ Let us call Pk a partition of k classes where k is the level of Pk. A partition P’ is said to cover a partition P if and only if P’ results from combining two classes of P. ♦ Note that P’=bcd,a does not cover P=ab,c,d, because P’ cannot be obtained from the union of two classes of P, which would in fact give P’1=abc,d, P’2=abd,c and P’3=ab,cd, but not P. A chain in the lattice is a sequence of partitions in order, e.g. (P1, P2,...,Pj) where P1