Introduction Similarity measure models and algorithms for hierarchical cases Dianshuang Wu, Jie Lu, Guangquan Zhang Decision Systems & e-Service Intelligence (DeSI) Lab Centre for Quantum Computation & Intelligent Systems (QCIS) Faculty of Engineering and Information Technology, University of Technology, Sydney, P.O. Box 123, Broadway, NSW 2007, Australia Corresponding author: Jie Lu, Tel. : +61 02 95141838 E-mail: sd_wds@hotmail.com (D. Wu), jielu@it.uts.edu.au (J. Lu), zhangg@it.uts.edu.au (G. Zhang) Abstract Many business situations such as events, products and services, are often described in a hierarchical structure. When we use case-based reasoning (CBR) techniques to support business decision-making, we require a hierarchical-CBR technique which can effectively compare and measure similarity between two hierarchical cases. This study first defines hierarchical case trees (HC-trees) and discusses related features. It then develops a similarity evaluation model which takes into account all the information on nodes’ structures, concepts, weights, and values in order to comprehensively compare two hierarchical case trees. A similarity measure algorithm is proposed which includes a node concept correspondence degree computation algorithm and a maximum correspondence tree mapping construction algorithm, for HC-trees. We provide two illustrative examples to demonstrate the effectiveness of the proposed hierarchical case similarity evaluation model and algorithms, and possible applications in CBR systems. Keywords: Hierarchical similarity, Hierarchical cases, Tree similarity measuring, Case-based reasoning 1. Introduction Case-based reasoning (CBR) is the process of solving new problems based on the solutions for similar past problems (Aamodt & Plaza, 1994). CBR provides a powerful learning ability to use past experiences as a basis for dealing with new problems, and facilitates the knowledge acquisition process by reducing the time required to elicit solutions from experts. It is represented by a four-step (4Rs) cycle: retrieve, reuse, revise and retain (Aamodt & Plaza, 1994). In the first ‘R’ stage, when a new problem is input, CBR retrieves the most similar case from the case base. mailto:sd_wds@hotmail.com mailto:jielu@it.uts.edu.au mailto:zhangg@it.uts.edu.au Obviously, designing an effective case similarity evaluation method to identify the most similar cases is a key issue in CBR. Many models and algorithms have been developed to measure similarity between two cases, which are described in a set of attributes (Falkman, 2000). However, in practice, some cases can only be described by hierarchical tree structures. Therefore, we need to explore effective similarity measures for hierarchical cases in order to apply CBR systems. Fig. 1 shows an example of an avian flu case describing the infection situation of birds in an area at a given time (Zhang, Lu & Zhang 2009). Obviously, it is a hierarchical case and viewed as a tree structure. This tree case has seven nodes called “wild birds”, “farm poultry”,…, “water bird” and “no water bird”. The “water bird” node indicates that 40% of water birds were infected. From its edge, we can see 70% of farm poultry are water birds. Similarly, there are 60% of birds in the farm poultry area. A bird flu case Wild birds Farm poultry Long distance migratory Short distance migratory 0.4 0.6 0.60.4 Water bird No water bird 0.7 0.3 (0.3) (0.4) (0.4) (0.4) Fig.1. A hierarchical case: bird flu From this example, we can summarize the following features of tree cases: (1) every node is associated with a concept; (2) all concepts represented by nodes of a tree case, construct a hierarchical structure. Nodes at different depths represent concepts with different abstraction levels. The child nodes can be viewed as a refinement of the concept expressed by their parent node; (3) all leaves of a tree case are assigned values. Other nodes’ values can be assessed by aggregating their children’s; (4) every node is assigned a “weight” to represent its importance relative to its parent node. As different cases may arise from different sources at different times, the tree structures, nodes’ concepts, weights and values in different trees are probably not all the same. To evaluate the similarity between these tree-structured hierarchical cases, all the information should be considered. The research in this paper is related to work on tree similarity measure and structured case similarity measure. Tree structured data are used in many fields, such as e-business (Bhavsar, Boley, & Yang, 2004), bioinformatics (Tran, Nguyen, & Hoang, 2007), XML Schema matching (Jeong, Lee, Cho, & Lee, 2008), document classification and organization (Rahman & Chow, 2010) and case-based reasoning (Ricci & Senter, 1998). The similarity measure of tree structured data is essential for many applications. One widely used tree similarity measure is tree edit distance (Zhang, 1993; Kailing, Kriegel, Schonauer, & Seidl, 2004), in which edit operations including insertion, deletion and re-labeling with costs are defined, and the least cost of a sequence of edit operations needed to transform one tree to another is used as the similarity measure between the two trees (Bille, 2005). The main difference between various tree edit distance algorithms lies in the set of allowed edit operations and their related cost definitions (Yang, Kalnis, & Tung, 2005). In (Xue, Wang, Ghenniwa, & Shen, 2009), the conceptual similarity measure between labels was introduced in the cost of edit operations to compare concept trees of ontology. Another kind of tree similarity measure is based on a maximum common sub-tree (MCS) or sub-tree isomorphism between two trees (Akutsu & Halldorsson, 2000). This method uses the size of MCS between two trees, or metrics defined by MCS as the similarity measure. In (Torsello, Hidovic, & Pelillo, 2004), four novel distance measures for attributed trees based on the notion of a maximum similarity sub-tree isomorphism were proposed. In (Lin, Wang, McClean, & Liu, 2008), the number of all common embedded sub-trees between two trees was used as the measure of similarity. The methods mentioned above mostly deal with node-labeled trees. In (Bhavsar, Boley, & Yang, 2004), node-labeled, arc-labeled, arc-weighted trees were used as product/service descriptions to represent the hierarchical relationship between the attributes. To compare these trees, a recursive algorithm to perform a top-down traversal of trees and the bottom-up computation of similarity was designed. However, their trees had to conform to the same standard schema, i.e. the trees should have the same structure and use the same labels, though some sub-trees were allowed to be missing (Yang, Sarker, Bhavsar, & Boley, 2005). As trees for hierarchical cases in our research are different to previous ones, we need to develop a new similarity measure method for them. Structured case similarity measure in the literature is usually based on the maximal common sub-graph or sub-graph isomorphism (Burke, MacCarthy, Petrovic, & Qu, 2000; Sanders, Kettler, & Hendler, 1997). In (Ricci & Senter, 1998), the similarity measure on tree structured cases, taking into account both the tree structures and node labels’ semantics, was researched. A sub-tree isomorphism with the minimum semantic distance was constructed and the minimum semantic distance was used as the similarity measure. This research is closely related to ours. However, the positions of corresponding nodes are not restricted in their sub-tree isomorphism, and this is not suitable for our hierarchical cases, because nodes at different depths represent concepts at different abstraction levels. Also, nodes’ values are not involved in their similarity measure. In this paper, we present a comprehensive similarity evaluation model considering all the information on nodes’ structures, concepts, weights and values to compare tree structured hierarchical cases. To express the concept correspondence between nodes in different trees, the concept correspondence degree is defined. A maximum correspondence tree mapping based on nodes’ structures and concepts is constructed to identify the corresponding nodes between two trees. Based on the mapping, the values of corresponding nodes are compared. Finally, the similarity measure of trees is evaluated by aggregating both the conceptual and value similarities. This paper is organized as follows. In Section 2 we describe the features of hierarchical case trees by mathematical formulas. Section 3 presents a similarity evaluation model to compare any two hierarchical case trees. A set of algorithms to compute the similarity between hierarchical cases is provided in Section 4. Section 5 presents two examples to demonstrate the effectiveness of the proposed hierarchical case similarity evaluation model and algorithms, and possible applications in CBR systems. It also compares the proposed HC-tree similarity model with other approaches. Section 6 concludes the paper and discusses tasks for our further study. 2. Hierarchical case trees A tree is defined as a directed graph ),( EVT = where the underlying undirected graph has no cycles and there is a distinguished root node in V , denoted by )(Troot , so that for all nodes Vv ∈ , there is a path in T from )(Troot to node v (Valiente, 2002). In real applications, the definition can be extended to represent practical objects. To express the concepts, values and weights associated with the nodes of hierarchical cases and the hierarchical relationships between nodes, the original tree structure is enriched and a hierarchical case tree (HC-tree) is defined. Definition 2.1: HC-tree. An HC-tree is a structure ),,,,( RWAEVT = , in which V is a finite set of nodes, E is a binary relation on V where each pair Evu ∈),( represents the parent-child relationship between two nodes Vvu ∈, , A is a set of attributes assigned to each node in V , W is a function to assign each node a weight to represent its degree of importance to its siblings, thereby satisfying the sum of the weights of all the children of one node is 1, and R is a function to assign a value to every leaf node to describe the degree of its relevant attribute. Two features of the HC-tree should be highlighted. First, all nodes in the HC-tree represent concept meanings, which are obtained from the attributes. As a hierarchical structure, the concept of one node depends, not only on the attribute itself, but also its children’s. Therefore, nodes at different depths represent concepts at different abstraction levels, and nodes at higher layers represent more significant concepts than lower nodes. Secondly, every node in the HC-tree has a value. The leaves’ values are indicated by R , and the internal nodes’ values can be computed by aggregating their children’s. resident bird v1 v2 0.5 0.5 T1 A bird flu case v3 migratory bird (0.8) long distance migratory 0.50.5 (0.7)(0.3) v4 v5 short distance migratory long distance migratory short distance migratory 0.60.4 (0.2) u1 u2 0.4 0.6 T2 A bird flu case wild birds farm poultry water bird no water bird 0.7 0.3 (0.3) (0.3) (0.3) u3 u6 u7u4 u5 Fig.2. Two examples of HC-trees Two examples of HC-trees, both describing the situation of bird flu, are illustrated in Fig. 2. The labels beside the nodes represent their attributes. The number beside the edge is the weight of the child. The number under each leaf represents its value. In 1T , “A bird flu case” is described by two aspects, “migratory bird” and “resident bird”, with both taking the same weight. Similarly, “migratory bird” is described by two sub-aspects, “long distance migratory” and “short distance migratory”. From “long distance migratory”, we can see that 30% of birds were infected. As 1T and 2T are from different sources, their structures and nodes’ weights are different. Their attribute terms are also not identical. To evaluate the conceptual correspondence between attributes in different HC-trees, a conceptual similarity measure between attributes is introduced as in (Xue, Wang, Ghenniwa, & Shen, 2009). Definition 2.2: Attribute Conceptual Similarity Measure. An attribute conceptual similarity measure 21 , AA sc is a set of mappings from two attribute sets 1A , 2A used in different HC-trees to the set [0, 1], ]1,0[: 21, 21 →× AAsc AA , in which each mapping denotes the conceptual similarity between two attributes. For convenience, the sub-script 21 , AA is omitted so that there is no confusion. For 11 Aa ∈ and 22 Aa ∈ , we say 1a and 2a are similar if 0),( 21 >aasc , and the larger ),( 21 aasc is, the more similar the two attributes are. Conceptual similarity between two attributes can be given by domain experts or calculated based on linguistic analysis methods. As an example, we define the conceptual similarity between the attributes of 1T and 2T in Fig. 2 as follows: sc(migratory bird, wild birds)=0.7, sc(migratory bird, farm poultry)=0.1, sc(resident bird, wild birds)=0.6, sc(resident bird, farm poultry)=0.8, sc(resident bird, water bird)=0.4, sc(resident bird, no water bird)=0.4, sc(long distance migratory, water bird)=0.1, sc(long distance migratory, no water bird)=0.2, sc(short distance migratory, water bird)=0.2, sc(short distance migratory, no water bird)=0.2. 3. A similarity evaluation model for HC-trees A similarity evaluation model for HC-trees is proposed in this section. In the model, maximum correspondence tree mapping is constructed to identify the corresponding node pairs of two HC-trees based on nodes’ structures and concepts, and the conceptual similarity between two HC-trees is evaluated. Based on the mapping, the value similarity between two HC-trees is evaluated, and the final similarity measure between two HC-trees is assessed as a weighted sum of their conceptual and value similarities. 3.1 Maximum correspondence tree mapping To identify two corresponding nodes in different HC-trees, both their structures and concepts should be considered. There are two structural restrictions. First, as nodes at different depths represent concepts at different abstraction levels, it is reasonable to assume that the corresponding nodes in the mapping should be at the same depth. Therefore, the roots of two HC-trees should be in the mapping. Secondly, as the children nodes can be viewed as a refinement of the concept expressed by the parent node, two separate sub-trees in one tree should be mapped to two separate sub-trees in another. In addition to satisfying structural restrictions, it is important that the corresponding nodes have a high conceptual similarity degree. To express the concept correspondence between two nodes in two HC-trees respectively, the following definition is introduced: Definition 3.1: Node Concept Correspondence Degree. Let 1V and 2V be node sets of 1T and 2T respectively. A node concept correspondence degree cord is a set of mappings from 1V and 2V to the set [0, 1], ]1,0[: 21 →×VVcord , in which each mapping denotes the concept correspondence between two nodes of two HC-trees. cord is symmetric, i.e. for 1Vv ∈ and 2Vu ∈ we have ),( uvcord = ),( vucord . Let v and u be two nodes of 1T and 2T respectively. There are three cases: (1) both v and u are leaves, (2) v is a leaf and u is an internal node, (3) both v and u are internal nodes. In the first case, as nodes’ concepts are derived from the attributes assigned to them, the concept correspondence degree between v and u can be defined as the conceptual similarity of their attributes. In the other two cases, as the internal node’s concept is also affected by its children, the children’s concepts should be considered in the definition. Thus, they should be defined recursively. The definitions of cord for the three cases are presented respectively as follows: Definition 3.2: Concept Correspondence Degree between Two Leaves. Let v and u be two leaves of 1T and 2T respectively. The concept correspondence degree between v and u , ),( uvcord is defined as: ).,.(),( auavscuvcord = (3.1) where av. and au. represent attributes of v and u respectively. For example, 4v and 6u are two leaves of 1T and 2T respectively in Fig. 2. The attribute of 4v is “long distance migratory” and of 6u is “water bird”. The concept correspondence degree between 4v and 6u is defined by sc(long distance migratory, water bird), and ),( 64 uvcord =0.1. Definition 3.3: Concept Correspondence Degree between a Leaf and an Internal Node. Let v be a leaf of 1T , u be an internal node of 2T , and },...,,{)( 21 quuuuC = be u ’s children set. The concept correspondence degree between v and u , ),( uvcord is defined as: ∑ = ⋅⋅−+⋅= q i ii uvcordwauavscuvcord 1 2 ),()1().,.(),( aa (3.2) where a is the influence factor of the parent node and iw2 is the weight of iu . For example, 3v is a leaf node of 1T and 3u is an internal node of 2T in Fig. 2. The concept correspondence degree between 3v and 3u is computed by the formula )),(3.0),(7.0()1().,.(),( 73633333 uvcorduvcordauavscuvcord ⋅+⋅⋅−+⋅= aa . In the formula, ).,.( 33 auavsc is 0.8. ),( 63 uvcord and ),( 73 uvcord can be computed by Definition 3.2, and ),( 63 uvcord =0.4 and ),( 73 uvcord =0.4. If a is 0.5, we can achieve ),( 33 uvcord =0.6. Definition 3.4: Concept Correspondence Degree between Two Internal Nodes. Let v and u be two internal nodes of 1T and 2T respectively, and },...,,{)( 21 pvvvvC = and },...,,{)( 21 quuuuC = be v and u ’s children sets, respectively. Let ),(, EVG uv = denote the bipartite graph induced by v and u , which is constructed as follows: )()( uCvCV ∪= , )}(),(:),{( uCtvCstsE ∈∈= . The weights of edges are defined as ),(, tscordweight ts = . uvMWM , is the maximum weighted bipartite matching of uvG , . Then, the correspondence degree between v and u , ),( uvcord is defined as: ),()( 2 1 )1().,.(),( ,),( 21 ji MWMuv ji uvcordwwauavscuvcord uvji ∑ ∈ ⋅+⋅−+⋅= aa (3.3) where iw1 is the weight of iv in 1T and jw2 is the weight of ju in 2T . In Definition 3.4, the maximum weighted bipartite matching uvMWM , identifies the most correspondence node pairs amongst v and u ’s children. The contribution of their children can therefore be fully considered when evaluating their concept correspondence degree. For example, 2v and 3u are two internal nodes of 1T and 2T respectively in Fig. 2. To compute their concept correspondence degree, a bipartite graph 32 ,uv G is constructed as Fig. 3 (a), in which the numbers beside the edges represent their weights. The maximum weighted bipartite matching of 32 ,uv G is illustrated in Fig. 3 (b). Then, The concept correspondence degree between 2v and 3u is computed by the formula ),( 32 uvcord = ).,.( 32 auavsc⋅a + )),()2)7.05.0((),()2)3.05.0((()1( 6574 uvcorduvcord ⋅++⋅+⋅−a . In the formula, ).,.( 32 auavsc is 0.1. ),( 74 uvcord and ),( 65 uvcord are computed by Definition 3.2, and ),( 74 uvcord =0.2 and ),( 65 uvcord =0.2. Let a be 0.5, so that we achieve ),( 33 uvcord =0.15. u6 u7 v4 v5 0.1 0.2 0.2 0.2 u6 u7 v4 v5 0.2 0.2 (a) (b) Fig.3. A bipartite graph 32 ,uv G and its maximum weighted bipartite matching With the above definitions, the concept correspondence degree of any node pair between two HC-trees can be evaluated. The maximum correspondence tree mapping considering both the structural restrictions and nodes’ concept correspondence is defined as follows. Definition 3.5: Maximum Correspondence Tree Mapping. Let 1V and 2V be node sets of HC-trees 1T and 2T , respectively. A mapping 21 VVM ×⊆ is a maximum correspondence tree mapping if it satisfies the following conditions: 1. 2121 uuvv =⇔= for any pair ),( 11 uv , Muv ∈),( 22 2. MTrootTroot ∈))(),(( 21 3. Muparentvparent ∈))(),(( for all non-root nodes 1Vv ∈ and 2Vu ∈ with Muv ∈),( 4. 0),( >uvcord for all nodes 1Vv ∈ and 2Vu ∈ with Muv ∈),( 5. MMWM uv ⊂, for all nodes 1Vv ∈ and 2Vu ∈ with Muv ∈),( , where uvMWM , is the maximum weighted bipartite matching of bipartite graph uvG , constructed of v and u ’s children with edges weighted by their children’s concept correspondence degree. In the above Definition 3.5, the first condition ensures that the mapping is a one-to-one mapping. Conditions 2 and 3 ensure the mapping satisfies the structural restrictions. The last two conditions represent the conceptual restrictions. Condition 5 ensures that most correspondence node pairs are in the mapping. As an example, the maximum correspondence tree mapping between 1T and 2T in Fig. 2 is illustrated in Fig. 4, in which corresponding nodes are connected by dashes. The construction process of the mapping will be described in Section 5.1. resident bird v1 v2 0.5 0.5 T1 A bird flu case v3 migratory bird (0.8) long distance migratory 0.50.5 (0.7)(0.3) v4 v5 short distance migratory long distance migratory short distance migratory 0.60.4 (0.2) u1 u2 0.4 0.6 T2 A bird flu case wild birds farm poultry water bird no water bird 0.7 0.3 (0.3) (0.3) (0.3) u3 u6 u7u4 u5 Fig.4. Maximum correspondence tree mapping between 1T and 2T From the recursive definitions of node concept correspondence degree, it is obvious that ))(),(( 21 TrootTrootcord is computed by aggregating the cord of all corresponding node pairs, which reflects the conceptual similarity between two HC-trees. We can define the conceptual similarity between two HC-trees as follows: Definition 3.6: Conceptual Similarity between HC-trees. Let 1T and 2T be two HC-trees. The conceptual similarity between 1T and 2T , ),( 21 TTsct is defined as ),( 21 TTsct = ))(),(( 21 TrootTrootcord . Taking 1T and 2T in Fig. 2 as an example, their conceptual similarity, ),( 21 TTsct is computed as ),( 11 uvcord . 3.2 Value similarity between HC-trees Based on the maximum correspondence tree mapping M , the values of two HC-trees can be compared. The value similarity between two corresponding nodes in M is evaluated first. As only leaf nodes are assigned values in HC-trees initially, for any Muv ∈),( , there are two cases: (1) v is a leaf node, or none of v ’s children are in M , (2) some of v ’s children are in M . We provide the computation formulas of the value similarity between v and u , ),( uvsvM for the two cases respectively. For case 1, ),( uvsvM is computed as: ))(),((),( uvaluevvaluesuvsvM = (3.4) where )(vvalue denotes v ’s value and )(⋅s denotes a value similarity measure. If v is a leaf node, )(vvalue is assigned initially. Otherwise, it is computed by aggregating its children’s values. )(⋅s can be defined according to the specific applications. For example, let two attributes’ values be 1a and 2a , and their value range be r ; then their similarity measure can be defined as raaaas 2121 1),( −−= . In the example in Fig. 4, as values of nodes are all within [0, 1], the similarity between two values is calculated as one minus the distance between them. For 3v and 3u in Fig. 4, )( 3vvalue is assigned initially as 0.8, and )( 3uvalue can be computed as 0.3. The value similarity between 3v and 3u is then computed as 0.5. In case 2, let pvvv ,...,, 21 be v ’s children and quuu ,...,, 21 be u ’s children. ),( uvsvM is computed as: ),()( 2 1 ),( 21 ),( jiMji Muv M uvsvwwuvsv ji ⋅+= ∑ ∈ (3.5) where iw1 is the weight of iv in 1T and jw2 is the weight of ju in 2T . Take 2v and 2u in Fig. 4 as an example. Their value similarity is computed as ),( 22 uvsvM = ),()2)4.05.0(( 44 uvsvM⋅+ + ),()2)6.05.0(( 55 uvsvM⋅+ . In the formula, ),( 44 uvsvM and ),( 55 uvsvM can be calculated by Formula (3.4), and ),( 22 uvsvM =0.725. With Formula (3.4) and (3.5), the value similarity between any corresponding nodes in M can be computed. As the recursive characteristic of Formula (3.5), the value similarity between the roots of two HC-trees is computed by aggregating the value similarity of all corresponding node pairs, which represents the value similarity of the two HC-trees. Therefore, we define the value similarity between two HC-trees as follows. Definition 3.7: Value Similarity between HC-trees. Let 1T and 2T be two HC-trees, and M be their maximum correspondence tree mapping. The value similarity between 1T and 2T , ),( 21 TTsvt is defined as ),( 21 TTsvt = ))(),(( 21 TrootTrootsvM . Taking 1T and 2T in Fig. 4 as an example, their value similarity, ),( 21 TTsvt is computed as ),( 11 uvsvM . 3.3 Similarity measure of HC-trees Based on the conceptual similarity and value similarity of two HC-trees, the similarity measure of HC-trees is defined as follows. Definition 3.8: Similarity Measure of HC-trees. The similarity between 1T and 2T is defined as: ),(),(),( 21221121 TTsvtTTsctTTsim ⋅+⋅= aa (3.6) where 21 aa + =1. In this definition, both the concepts and values of two HC-trees are comprehensively considered. 1a and 2a are weights of the two parts, which can be defined according to the specific applications. 4. Similarity measurement algorithms for HC-trees Algorithms to compute the similarity between two HC-trees are presented in this section. The flowchart in Fig. 5 shows the entire process. Fig.5. Flowchart to compute the similarity between two HC-trees We can see from Fig.5 that ),( 21 TTsct is firstly computed by calling )),(),(( 21 BTrootTrootcord , where B is a node set list which is indexed by the nodes in 1T . All the maximum weighted bipartite matching solutions during computing ))(),(( 21 TrootTrootcord are recorded in B . The maximum correspondence tree mapping M is then constructed based on B by calling ),,( 21 TTBapConstructM . ),( 21 TTsvt is computed based on M . Finally, the similarity of 1T and 2T is returned by aggregating their conceptual and value similarities. The algorithm is illustrated as follows. Start Call )),(),(( 21 BTrootTrootcord to compute ),( 21 TTsct , record the maximum weighted bipartite matching solutions in B Initialization Call ),,( 21 TTBapConstructM to construct the maximum correspondence tree mapping M between 1T and 2T Compute ),( 21 TTsvt based on M ),(),(),( 21221121 TTsvtTTsctTTsim ⋅+⋅= aa End Algorithm 1. Similarity measure algorithm for HC-trees similarity( 1T , 2T ) input: two trees 1T and 2T output: similarity between 1T and 2T 1 for all Vv ∈ Φ←)(vB 2 )),(),(( 21 BTrootTrootcordsct ← 3 ←M ),,( 21 TTBapConstructM 4 ))(),(( 21 TrootTrootsvsvt M← 5 return svtsct ⋅+⋅ 21 aa The algorithm of concept correspondence degree computation function ),,( Buvcord is illustrated by algorithm 2 as follows. Algorithm 2. Node concept correspondence degree computation algorithm ),,( Buvcord input: two nodes v and u output: concept correspondence degree between v and u 1 if both v and u are leaves 2 return ).,.( auavsc 3 else if u is an internal node, and quuu ,...,, 21 be u ’s children, 4 return ∑ = ⋅⋅−+⋅ q i ii Buvcordwauavsc 1 2 ),,()1().,.( aa 5 else ←)(vC v ’s children pvvv ,...,, 21 6 ←)(uC u ’s children quuu ,...,, 21 7 for i=1 to p 8 for j=1 to q 9 ),,( Buvcordc jiij ← 10 ←m )),()(( cuCvCchingComputeMat ∪ 11 for each muv lk ∈),( , if 0>klc 12 }{)()( lkk uvBvB ∪← 13 return klmuv lk cwwauavsc lk∑ ∈ ⋅+⋅−+⋅ ),( 21 )2)(()1().,.( aa A recursive process follows from the definition of concept correspondence degree. The most important part in the procedure are lines 5-13, where both v and u are internal nodes. A bipartite graph is constructed, taking their children as nodes, and the correspondence degrees between their children as the weights of edges. Function )(⋅chingComputeMat (Jungnickel, 2008) returns the maximum weighted bipartite matching, which identifies most correspondence node pairs among v and u ’s children. The matches are recorded in B , which are local maximum correspondence matches. For one node in 1T , there may be more than one node matching it during the computation process. However, as proved in (Valiente, 2002), there is a unique maximum correspondence tree mapping 21 VVM ×⊆ so that BM ⊆ . Given B , the corresponding maximum correspondence tree mapping M can be reconstructed as follows: Set ))(( 1TrootM to )( 2Troot and, for all nodes 1Vv ∈ in pre-order, set )(vM to the unique node u with Buv ∈),( and Buparentvparent ∈))(),(( . The reconstruction procedure is illustrated by algorithm 3 (Valiente, 2002). Algorithm 3. Maximum correspondence tree mapping construction algorithm ),,( 21 TTBapConstructM input: node set list B , two HC-trees 1T and 2T output: maximum correspondence tree mapping M from 1T to 2T 1 )())(( 21 TrootTrootM ← 2 list )(_ 1TtraversalpreorderL ← 3 for all Lv ∈ http://find.lib.uts.edu.au/search.do?N=4294340440 http://find.lib.uts.edu.au/search.do?N=4294340440 4 if v is nonroot and Φ≠)(vB 5 for all )(vBu ∈ 6 if )())(( uparentvparentM == 7 uvM ←)( 8 break 9 return M 5. Two illustrative examples and comparison with other approaches The proposed HC-tree similarity model and algorithms are to be used in CBR systems, such as CBR-based warning systems (Zhang, Lu & Zhang 2009), CBR-based recommender systems (Lu et al 2010) and web mining systems (Wang, Lu & Zhang, 2007). To show the effectiveness of our model, two examples are provided in this section. In the first example, the process of computing the similarity between two HC-trees in Fig. 2 is presented to show the behavior of the proposed algorithms in Section 4. In the second example, our similarity model is used in the retrieve stage of a simple CBR system to demonstrate the effectiveness of the model. The proposed model is then compared with other tree similarity evaluation methods. 5.1 Similarity measure computation between two HC-trees The similarity between 1T and 2T in Fig. 2 is computed by the proposed similarity measurement algorithms as follows. First, the conceptual similarity between 1T and 2T , ),( 21 TTsct is computed by calling ),,( 11 Buvcord . Let the coefficient a be 0.5, ),( 21 TTsct is computed as 0.856. During the recursive computation process, many maximum weighted bipartite matching problems are resolved, and the solutions are recorded in B : }{)( 22 uvB = , }{)( 33 uvB = , },{)( 744 uuvB = , },{)( 655 uuvB = . Secondly, given B , the maximum correspondence tree mapping M between 1T and 2T is constructed by calling ),,( 21 TTBapConstructM : }{)( 11 uvM = , }{)( 22 uvM = , }{)( 33 uvM = , }{)( 44 uvM = , }{)( 55 uvM = . The mapping is illustrated in Fig. 4. Based on the mapping M , the value similarity between 1T and 2T , ),( 21 TTsvt is evaluated by computing ),( 11 uvsvM . The computation uses Formulas 3.4 and 3.5 to achieve ),( 21 TTsvt as 0.6. Finally, let the weights 1a and 2a be both 0.5; the final similarity measurement between 1T and 2T , ),( 21 TTsim is computed by ),(5.0),(5.0 2121 TTsvtTTsct ⋅+⋅ =0.73. 5.2 Similar cases retrieval The proposed similarity model is used to retrieve similar cases in a CBR system in the following example. R A B C D E F G 0.4 0.4 0.2 0.4 0.4 0.2 0.5 (0.6) (0.2) (0.4) (0.3) (0.3) T1 0.5 H (0.3) r a b c d e f 0.5 0.4 0.1 0.6 0.4 0.5 (0.7) (0.1) (0.2) (0.6) Ta 0.5 g (0.3) R A B C E F G 0.4 0.4 0.2 0.4 0.4 0.2 0.5 (0.1) (0.8) (0.9) (0.8) (0.4) T2 0.5 H (0.9) D R’ A’ B’ C’ D’ E’ F’ G’ 0.4 0.4 0.2 0.4 0.4 0.2 0.5 (0.6) (0.2) (0.4) (0.3) (0.3) T3 0.5 H’ (0.3) R A B C D E F G 0.3 0.2 0.5 0.1 0.3 0.6 0.5 (0.6) (0.2) (0.4) (0.3) (0.3) T4 0.5 H (0.3) R A JC D E F K 0.4 0.40.2 0.4 0.4 0.2 0.5 (0.6) (0.2) (0.4) (0.3) (0.3) T5 0.5 H (0.3) Fig.6. A new case aT and five existing cases in a case base As illustrated in Fig. 6, HC-tree aT represents a new problem to be resolved and 1T ,…, 5T represent five solved problems in a case base. The conceptual similarity between their attributes is defined as follows: sc(r,R)=0.7, sc(a,A)=0.9, sc(a,B)=0.6, sc(b,A)=0.5, sc(b,B)=0.8, sc(d,D)=1, sc(d,E)=0.5, sc(d,G)=0.4, sc(e,D)=0.5, sc(e,E)=0.9, sc(e,H)=0.4, sc(f,F)=1, sc(f,H)=0.6, sc(f,G)=0.7, sc(g,G)=0.9, sc(g,H)=0.7, sc(r,R’)=0.6, sc(a,A’)=0.7, sc(a,B’)=0.6, sc(b,A’)=0.5, sc(b,B’)=0.7, sc(d,D’)=0.9, sc(d,E’)=0.4, sc(d,G’)=0.3, sc(e,D’)=0.4, sc(e,E’)=0.8, sc(e,H’)=0.3, sc(f,F’)=0.7, sc(f,H’)=0.6, sc(f,G’)=0.6, sc(g,G’)=0.7, sc(g,H’)=0.6. To retrieve the most similar cases to aT , the similarities between aT and cases in the case base are evaluated using the similarity model proposed in this paper. Let the coefficients a , 1a and 2a be all 0.5 in the model, and the similarity between two values be calculated as one minus the distance between them. The results are illustrated in Table 1. As can be seen in Table 1, 1T is most similar to aT , so 1T is retrieved. Table 1 Similarity between aT and cases in the case base 1T 2T 3T 4T 5T ),( ia TTsct 0.703 0.703 0.600 0.623 0.548 ),( ia TTsvt 0.745 0.304 0.745 0.537 0.365 ),( ia TTsim 0.724 0.504 0.672 0.580 0.456 As seen from Fig. 6, 2T and 1T are the same except for their values. Therefore, the conceptual similarity ),( 1TTsct a and ),( 2TTsct a are equal. However, as 1T ’s values are much closer to aT ’s than 2T ’s, ),( 1TTsvt a is larger than ),( 2TTsvt a , which makes 1T more similar to aT than 2T . 3T and 1T are different in terms of attributes. The concepts of 1T ’s attributes are more similar to aT ’s than 3T ’s, which makes 1T more similar to aT than 3T . 4T and 1T have different attribute weights. The weights of nodes corresponding to aT in 4T are smaller than those in 1T , which makes 4T less similar to aT than 1T . The example shows that our similarity model takes into account all the information on nodes’ structures, concepts, weights and values and it can be used to retrieve the most similar cases effectively in CBR systems. 5.3 Comparison with other approaches From the above examples, it can be seen that the proposed similarity evaluation model for HC-trees has five features: (1) it considers nodes’ conceptual similarities; (2) it considers the hierarchical relations between concepts; (3) it compares corresponding nodes’ values; (4) it considers the influence of nodes’ weights; (5) it considers the semantics of nodes’ structures. We compare our method with other tree similarity evaluation methods for these five aspects. We take into account the methods of Ricci, & Senter’s (1998), Xue, Wang, Ghenniwa, & Shen’s (2009) and Bhavsar, Boley, & Yang’s (2004), as they can represent different types of methods, respectively. The comparison results are illustrated in Table 2, where “√” represents that the method has the related feature. Table 2 demonstrates that none of the earlier methods can compare the tree structured data as comprehensively as our method. However, these features are essential to evaluate the similarity between complex tree structured hierarchical cases. As different HC-trees usually have different structures and attribute terms, the corresponding nodes between two HC-trees must be identified by evaluating their conceptual similarity. As attributes in hierarchical cases construct a hierarchical structure, the hierarchical relations between concepts and the semantics of nodes’ structures must be considered. Nodes’ values and relevant weights are essential to describe the case, so they must be compared when comparing two cases. With the above five features, the HC-trees can be compared comprehensively and accurately, and the most similar cases can be retrieved. Therefore, the proposed HC-tree similarity evaluation model is extremely suitable for retrieval of similar cases in CBR systems. Table 2 Comparison between our proposed method and other methods Method Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Our method √ √ √ √ √ Ricci’s method √ √ Xue’s method √ √ Bhavsar’s method √ √ √ 6. Conclusion and future work This paper defines the hierarchical case trees (HC-trees) to represent hierarchical cases. A similarity evaluation model to compare HC-trees is proposed and the related algorithms are presented. The concept correspondence degree between nodes is defined in the model and the conceptual similarity between trees is evaluated; a maximum correspondence tree mapping based on nodes’ structures and concepts is proposed to identify the corresponding nodes of two trees; the value similarity between two trees is computed based on the mapping; the final similarity measure between two trees is evaluated by aggregating their conceptual and value similarities. Two illustrative examples show that our method is highly effective for use in CBR systems. Our future research includes (1) to define fuzzy-HC-trees and a fuzzy similarity evaluation model based on our previous study (Lu, Zhang, Ruan & Wu, 2007) and propose related algorithms in order to improve inference accuracy in CBR systems; (2) to develop software based on the proposed similarity evaluation model for integration into real CBR systems, such as our BizSeeker recommender system (Lu et al. 2010) helping measuring similarity between two business on their product trees and our CBR-based avian influenza risk early warning (Zhang, Lu & Zhang, 2009). Acknowledgements The work presented in this paper was supported by the Australian Research Council (ARC) under Discovery Project DP0880739 and the China Scholarship Council. References Aamodt, A., & Plaza, E. (1994). Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Communication, 7(1), 39-59. Akutsu, T. & Halldorsson, M.M. (2000). On the approximation of largest common subtrees and largest common point sets. Theoretical Computer Science, 233, 33-50. Bhavsar, V.C. Boley, H. & Yang, L. (2004). A weighted-tree similarity algorithm for multi-agent systems in e-business environments. Computational Intelligence, 20(4), 584-602. Bille, P. (2005). A survey on tree edit distance and related problems. Theoretical Computer Science, 337(1-3), 217-239. Burke, E. MacCarthy, B. Petrovic, S. & Qu, R. (2000). Structured cases in case-based reasoning--re-using and adapting cases for time-tabling problems. Knowledge-Based Systems, 13(2-3), 159-165. Falkman, G. (2000). Similarity measures for structured representations: a definitional approach. In E. Blanzieri, & L. Portinale (Eds.), Advances in Case-Based Reasoning: 5th European Workshop, EWCBR 2000, Trento, Italy, September 6-9, 2000 Proceedings. Lecture Notes in Artificial Intelligence, 1898, 380-392. Springer-Verlag Berlin Heidelberg. Jeong, B. Lee, D. Cho, H. & Lee, J. (2008). A novel method for measuring semantic similarity for XML schema matching. Expert Systems with Applications, 34(3), 1651-1658. Jungnickel, D. (2008). Graphs, networks, and algorithms, Berlin: Springer, 419-430. Kailing, K. Kriegel, H.P. Schonauer, S. & Seidl, T. (2004). Efficient similarity search for hierarchical data in large databases. In E. Bertino et al. (Eds.), Advances in Database Technology -EDBT 2004: 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece, March 14-18, 2004 Proceedings. Lecture Notes in Computer Science, 2992, 676-693. Springer-Verlag Berlin Heidelberg. Lin, Z. Wang, H. McClean, S. & Liu, C. (2008). All common embedded subtrees for measuring http://find.lib.uts.edu.au/search.do?N=4294340440 tree similarity. 2008 International Symposium on Computational Intelligence and Design, 1, 29-32. Lu, J., Zhang, G., Ruan, D. and Wu, F. (2007). Multi-objective group decision making: methods, software and applications with fuzzy set techniques, Imperial College Press, London. Lu J., Shambour, Q., Xu, Y., Lin, Q. and Zhang, G. (2010). A hybrid semantic recommendation system for personalized government-to-business e-services, Internet Research (Acceptance date: 30 January 2010). Rahman, M. & Chow, T.W. (2010). Content-based hierarchical document organization using multi-layer hybrid network and tree-structured features. Expert Systems with Applications, 37(4), 2874-2881. Ricci, F., & Senter, L. (1998). Structured cases, trees and efficient retrieval. In B. Smyth, & P. Cunningham (Eds.), Advances in Case-Based Reasoning: 4th European Workshop, EWCBR-98, Dublin, Ireland, September 23-25, 1998 Proceedings. Lecture Notes in Artificial Intelligence, 1488, 88-99. Springer-Verlag, Berlin Heidelberg New York. Sanders, K.E. Kettler, B.P. & Hendler, J.A. (1997). The case for graph-structured representations. In D.B. Leake, & E. Plaza (Eds.), Case-based Reasoning Research and Development: Second International Conference on Case-Based Reasoning, ICCBR-97 Providence, RI, USA, July 25–27, 1997 Proceedings. Lecture Notes in Artificial Intelligence, 1266, 245-254. Springer Berlin Heidelberg. Torsello, A. Hidovic, D. & Pelillo, M. (2004). Four metrics for efficiently comparing attributed trees. 17th International Conference on Pattern Recognition (ICPR'04), 2, 467-470. Tran, T. Nguyen, C.C. & Hoang, N.M. (2007). Management and analysis of DNA microarray data by using weighted trees. Journal of Global Optimization, 39(4), 623-645. Valiente, G. (2002). Algorithms on trees and graphs, New York: Springer, 16-22, 206-224. Wang, C., Lu, J. and Zhang, G. (2007), Mining key information of web pages: a method and its application, Expert Systems with Applications. Vol. 33, 425-433 Xue, Y. Wang, C. Ghenniwa, H.H. & Shen, W. (2009). A new tree similarity measuring method and its application to ontology comparison. Journal of Universal Computer Science, 15(9), 1766-1781. Yang, L. Sarker, B.K. Bhavsar, V.C. & Boley, H. (2005). A weighted-tree simplicity algorithm for similarity matching of partial product descriptions. Proceedings of ISCA 14th International Conference on Intelligent and Adaptive Systems and Software Engineering, 55-60. Yang, R. Kalnis, P. & Tung, A. (2005). Similarity evaluation on tree-structured data. Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 754-765. Zhang, J., Lu, J. and Zhang, G. (2009), Case-based reasoning in avian influenza risk early warning, The Second Conference on Risk Analysis and Crisis Response (RACR), October, 2009, Beijing, China. Atlantis Press, scientific publishing Paris France, 246-251 Zhang, K. (1993). A new editing based distance between unordered labeled trees. In A. Apostolico et al. (Eds.), Combinatorial Pattern Matching: 4th Annual Symposium, CPM 93, Padova, Italy, June 2-4, 1993 Proceedings. Lecture Notes in Computer Science, 684, 254-265. Springer-Verlag London, UK.