doi:10.1016/j.eswa.2007.05.048 Available online at www.sciencedirect.com www.elsevier.com/locate/eswa Expert Systems with Applications 34 (2008) 3021–3032 Expert Systems with Applications An approach to mining the multi-relational imbalanced database Chien-I Lee a, Cheng-Jung Tsai b,*, Tong-Qin Wu a, Wei-Pang Yang c a Department of Information and Learning Technology, National University of Tainan, Tainan, Taiwan, ROC b Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan, ROC c Department of Information Management, National Dong Hwa University, Hualien, Taiwan, ROC Abstract The class imbalance problem is an important issue in classification of Data mining. For example, in the applications of fraudulent telephone calls, telecommunications management, and rare diagnoses, users would be more interested in the minority than the majority. Although there are many proposed algorithms to solve the imbalanced problem, they are unsuitable to be directly applied on a multi- relational database. Nevertheless, many data nowadays such as financial transactions and medical anamneses are stored in a multi-rela- tional database rather than a single data sheet. On the other hand, the widely used multi-relational classification approaches, such as TILDE, FOIL and CrossMine, are insensitive to handle the imbalanced databases. In this paper, we propose a multi-relational g-mean decision tree algorithm to solve the imbalanced problem in a multi-relational database. As shown in our experiments, our approach can more accurately mine a multi-relational imbalanced database. � 2007 Elsevier Ltd. All rights reserved. Keywords: Data mining; Classification; Imbalance; Relational database 1. Introduction Most structural data nowadays, like financial transac- tions and medical anamneses, are stored in the multi-rela- tional database. A multi-relational database usually consists of numerous data sheets and each of them would contain a number of tuples. A data sheet is also called a relation and there are usually one target relation and several non-target relations in a multi-relational database. Each tuple in the target relation is composed of several attribute values and a target class, but a tuple in the non-target rela- tions contains only the attribute values. Therefore, a tuple in the target relation is also called target tuple. No matter the target or non-target relation, they all possess a primary key and some foreign keys. The foreign key usually con- 0957-4174/$ - see front matter � 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.05.048 * Corresponding author. Tel.: +886 6 2133111x777; fax: +886 6 3017137. E-mail addresses: leeci@mail.nutn.edu.tw (C.-I. Lee), tsaicj@cis.nctu. edu.tw (C.-J. Tsai), wpyang@cis.nctu.edu.tw, wpyang@mail.ndhu.edu.tw (W.-P. Yang). nects to the primary or foreign key in other relations to make the database operations such as join or select feasible. There are mainly two kinds of connection between the rela- tions, which are (a) the connections between a primary key K and some foreign keys pointing to K; (b) the connections between two foreign keys K1 and K2 which connect to an identical primary key K. Unfortunately, the classification techniques of tradi- tional data mining, such as decision trees (Quinlan, 1993), neural networks (Mitchell, 1997) and support vector machines (Burges, 1998), are only applied for a singular data sheet. In order to extract out the practical and valu- able information from the relational databases, many multi-relational classification approaches had been pro- posed in recent years. The most widely used category of approaches to multi-relational classification is Inductive Logic Programming (Lavrac & Dzeroski, 1994) such as FOIL (Quinlan et al., 1993), Golem (Muggleton & Feng, 1990), Progol (Muggleton, 1995), TILDE (Blockeel, De Raedt, & Ramon, 1998), and CrossMine (Yin, Han, Yang, & Yu, 2004). However, the rules mined by these methods will be guided by the negative/major tuples. That is, mailto:leeci@mail.nutn.edu.tw mailto:tsaicj@cis. nctu.edu.tw mailto:tsaicj@cis. nctu.edu.tw mailto:wpyang@cis.nctu.edu.tw mailto:wpyang@mail.ndhu. edu.tw Fig. 1. A typical decision tree. 3022 C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 although it can generate a classifier with high accuracy, it is unable to provide the rules precisely for the positive/minor tuples. For some practical applications such as fraudulent telephone calls and rare diagnoses, users would be more interested in the minor tuples. Although there are also many proposed solutions for the imbalanced problem (Jap- kowicz & Stephen, 2002; Tsai, Lee, Chen, & Yang, 2007), they are unsuitable to be directly applied in a multi-rela- tional database. Thus, in this paper, we propose a multi- relational g-mean decision tree, called Mr.G-Tree, as the new solutions for an imbalanced multi-relational database. As it is named, Mr.G-Tree is a decision-tree-based algo- rithm. We focus on the decision-tree-based approaches since decision tree has several advantages (Rastogi & Shim, 2000), which are (a) decision tree is more efficient for large training data than neural networks which would spend a lots time on thousands of iterations; (b) a decision tree algorithm does not require the domain knowledge or prior knowledge; (c) decision tree displays the good classification accuracy compared to other techniques. For clear explana- tion, in the paper we will use ‘‘the positive’’ or ‘‘the minor- ity’’ to denote the minor but interesting tuples, and on the contrary ‘‘the negative’’ or ‘‘the majority’’ to represent the major but trivial target classes in a database. The remainder of this paper is organized as follows. Sec- tion 2 is the review of the related works. In Section 3, multi- relational g-mean decision tree will be introduced. The performance evaluation of will be shown in Section 4. Finally, the conclusion and future research directions will be demonstrated in Section 5. 2. Related work In this section, we first introduce the main principle for inducing a decision tree in Section 2.1. We then review some approaches which were proposed to mine a multi- relational database in Section 2.2. The imbalanced problem will be discussed in Section 2.3. 2.1. Decision tree A decision tree (Han & Kamber, 2001; Quinlan, 1993) is a flow-chart-like tree structure, which is constructed by a recursive divide-and-conquer algorithm. In a decision tree, each internal node denotes a test on an attributes, each branch represents an outcome of the test, and each leaf node has an associated target class. The top-most node in a tree is called root and each path from the root to a leaf node represent a rule. A typical decision tree is shown in Fig. 1. To classify an unknown example, beginning with the root node, successive internal nodes are visited until this example reaches a leaf node. The class of this leaf node is then assigned to this example as a prediction. For instance, the decision in Fig. 1 will approve a golden credit card application if the applicant has a salary higher than 85000 and his repayment record is good. 2.2. Multi-relational classifier In the current field of the multi-relational classification, the most common one is Inductive Logic Programming (ILP). The formal definition of the ILP problem is: ‘‘Given background knowledge B, positive tuples TP, and negative tuples TN, find a hypothesis H, which is a set of Horn clauses, such that "tp 2 TP : H [ Bj=tp and "tn 2 TN : H [ Bj5tn’’. The well known ILP systems are FOIL (Quinlan et al., 1993), Golem (Muggleton & Feng, 1990), Progol (Muggle- ton, 1995), TILDE (Blockeel et al., 1998), and CrossMine (Yin et al., 2004). First-order inductive learning (FOIL) takes CN2 algorithm (Clark & Niblett, 1989) as the basis and applies top-down and general-to-specific search to establish numerous rules. Each rule will include as many the positive as possible. On the contrary, Golem adopts bottom-up and specific-to-general search, it uses the tech- niques of RLGG (Relative Least General Generalization) to undertake generalization among a number of specific rules. Regarding with Progol, it is based on the AQ algo- rithm (Michalski, 1969) and integrates the searching meth- ods in FOIL and Golem. However, higher calculating cost will be caused when the database is enlarged as the com- mon disadvantage for traditional ILP approaches. In order to reduce the time cost, TOP-down induction of first order logical decision trees (TILDE) is proposed. TILDE is a binary logical decision tree and based on the C4.5 algorithm. Experiments also show that its accuracy is definitely better than the traditional ILP approaches (Boström, 1995). However, TILDE is inapplicable the large-scale databases. To solve the scalability problem, based on FOIL, CrossMine algorithm (Yin et al., 2004) is proposed. In order to reduce the requirement of memory, CrossMine virtually joins target relation and non-target relation together. It propagates the primary key and the target class in target relation to all non-target relations. Accordingly, there are two additional columns ‘‘IDs’’ and ‘‘class labels’’ in each non-target relation. By this propaga- Table 1 The confusion matrix Predicted negative Predicted positive True negative a b True positive c c C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 3023 tion approach, CrossMine can calculate the foil gain in each relation without joining all the relations into an indi- vidual table and therefore the memory cost can be well reduced. Owing to the well consideration of connections among all relations, its accuracy performs much better than FOIL. 2.3. Class imbalanced problem The proposed techniques aiming at the class imbalanced problem so far could be classified into three categories as follows (Barandela, Sanchez, Garcia, & Rangel, 2003): 2.3.1. Sampling-based Over-sampling and under-sampling are two main tech- niques in this category. Over-sampling could be further classified into random over-sampling and focused over-sam- pling (Aha, Kibler, & Albert, 1991; Han, Wang, & Mao, 2005). Random over-sampling approach over-samples the minority class at random until it matches the size of the majority class. Focused over-sampling approach over-sam- ples the minority class only with data close to the bound- aries between the minority class and the majority class. Similarly, under-sampling could be also classified into ran- dom under-sampling and focused under-sampling (Deh- meshki, Karakoy, & Casique, 2003; Derouin, Brown, Beck, Fausett, & Schneider, 1991; Lewis & and Catlett, 1997). The former approach removes the majority class at random until it contains as many examples as the minor- ity class, and the latter one removes the majority examples lying further away. The main idea of focused under-sam- pling is to remove the noise or outlier data and to reduce the size of majority class by sampling. The combination of over-sampling and under-sampling has also been pro- posed (Cohen, Hilario, Sax, Hugonnet, & Geissbühler, 2006; Zhou & Liu, 2006). However, over-sampling will increase the training set size and therefore enlarge the com- putational burden and the impact of noise data; under- sampling has been proven to be ineffective since it results in excluding some useful information (Barandela et al., 2003; Japkowicz & Stephen, 2002). 2.3.2. Cost-based Cost-Modifying approach (Pazzani et al., 1994; Zadrozny & Elkan, 2001) reduces the relative misclassifi- cation cost of the majority class (or increasing that of the minority class) to make it correspond to the size of the minority class. However, it is hard for a user to assign a proper cost when he/she is unfamiliar with the domain knowledge (Japkowicz & Stephen, 2002). 2.3.3. Imbalance-insensitive This approach is more attractive and has been proven to be more effective than the above two approaches (Jap- kowicz & Stephen, 2002). The main idea of this technique is to develop an approach that is insensitive to the imbal- ance problem. Proposed techniques including example weighting, rule removing, attribute correlation analysis, etc. (Anto, Susumu, & Akira, 2000; Ezawa, Singh, & Nor- ton, 1996; Fawcett & Provost, 1996; Kubat, Holte, & Mat- win, 1998; Lawrence, Burns, Back, Tsoi, & Giles, 1998). Among the proposed imbalance-insensitive approaches, some of them are limited to specific dataset and some take a lot of training time due to the natural property of neural network. Comparatively, SHRINK (Kubat et al., 1998) is applicable to most applications with numeric attribute data and eliminates the disadvantage described above. SHRINK was developed by the principle of BRUTE (Rid- dle, Segal, & Etzioni, 1994), it searches for the most accu- racy for not only the minority, but also the majority simultaneously by the use of g-mean to reach the best par- tition. However, its shortcoming is to only take care of numeric attribute and search for the best interval in a single one to each attribute. Once if the positive are distributed in two extremities of attribute values, SHRINK will have a poor accuracy. 3. Multi-relational g-mean decision tree algorithm Although there are many multi-relational database min- ing classifiers as described in Section 2.1, they can not solve the imbalanced problem. Furthermore, the proposed algo- rithms to solve the imbalanced dataset are unsuitable to be directly applied on a multi-relational database either. Thus, in this paper, we propose the Multi-relational g-mean deci- sion tree, called Mr.G-Tree, to solve this problem. Without loss of generality, only the case that there are two target classes in a database will be discussed in our paper. If more than two target classes are involved, the target class with the smallest ratio or the interesting one can be identified as the minority, and the rest will be regarded as the majority. 3.1. G-mean based classification model The confusion matrix in Table 1 was a widely used matrix to illustrate different classification measurements. In this matrix, a means the number of the negative that is accurately classified, b is the number that the negative is classified into the positive, c denotes the number that the positive is classified into the negative, and d signifies the number of the positive that is accurately classified. The traditional accuracy is defined as (a + d)/ (a + b + c + d). However, to handle the imbalanced prob- lem, g-mean in Eq. (1) was introduced to represent the 3024 C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 average proportion for the accuracy of the minority and the majority (Rastogi & Shim, 2000). g ¼fða � dÞ=½ða þ bÞ�ðc þ dÞ�g1=2: ð1Þ As stated in Section 2.3, the shortcomings of SHRINK are (a) it establishes only a best interval for each attribute; (b) it is inapplicable to the categorical attribute. These two properties would reduce the predicted accuracy when there are categorical attributes in the training data and when the minority distribute over several intervals of an attribute (Tsai et al., 2007). In order to build a classifier which can handle the imbalanced data more accurately, based on g-mean measurement, we establish a best interval for each numeric attribute and a best subset for each categorical attribute in every internal node o. Each best interval/best subset of the attribute i denotes a test for a tuple. To give higher importance to a test with smaller error, each test is associated with a weight as in Eq. (2) according to its g-mean value gi wi ¼ log½gi=ð1 � giÞ�: ð2Þ Then all of them are combined as a splitting function as shown in Function 3 for an internal node of Mr.G-Tree. SFðoÞ¼ X i hi � wi; ð3Þ In Eq. (3), hi = 1 if the value of attribute i falls in this inter- val and hi = �1 otherwise. However, to ensure that all weight is larger than 0, the test with gi < 0.5 is discarded. The best interval of a numeric attribute is defined as [1/2(aip + minai), 1/2(aiq + maxai)], where minai and maxai, respectively denotes the minimal and maximal attribute value of the interval with the maximal g-mean, aip is the attribute value of tuple p with the order prior to the tuple with attribute value minai, and aip is the attribute value of tuple q with the order posterior to the tuple with attribute value maxai. The reason is that when predicting the unseen data, the output hi of an example with attribute value lies in the [aip, minai] and [aiq, maxai] is unclear. To deal with cat- egorical attribute, we adopt Set Theory to establish the best subset. First, for each categorical attribute, we classify all tuples by their target classes and establish a corresponding power set. Each power set would contain 2k-2 subsets, where k is the number of values in this categorical attri- bute. For each subset, the tuples belonging to it are regard Procedure Leaf_Node(n) Begin For each leaf node If NP ≤(NN ×TP) / TN then The target class of this node is negati Else The target class of this node is positiv End if End Fig. 2. The pseudo code of the hand as the positive, and are regard as the negative otherwise. Finally, the g-mean of each subset is computed and the sub- set with the maximal g-mean is selected as the best subset. However, when a categorical attribute contains many val- ues, the computational cost for a categorical attribute by our approach would be nontrivial. To solve this problem, when the number of values in a categorical attribute is lar- ger than 10, we use a simpler approach. That is, for each value, the tuples belonging to it are regard the positive, and are regard as the negative otherwise. As a result, there are only k subsets when k is larger than 10. In addition, when a leaf node is not pure, most decision tree algorithms will assign this leaf node a negative class if there are more negative instances in this node. However, such a method may reduce the accuracy of the positive that the users are interested in. Thus, in order to mine the posi- tive more accurately, when assigning the target class to a leaf node, Mr.G-Tree adds in the consideration of the imbalanced problem as shown in Definition 2. Fig. 2 repre- sents the corresponding pseudo code. Definition 1. Given a training dataset T that includes both the minority and majority. For any leaf node Ni on Mr. G-Tree, assume that NP represents the number of the minority in Ni, NN represents the number of the majority in Ni, TP represents the number of all the majority in T, and TN represents the number of all the majority in T; if NP 6 (NN · TP)/TN, then the target class of leaf node Ni will be negative, and vice versa. Example 1. To make readers understand our approach mentioned in this section more clearly, here we take the sin- gle-relational database in Table 2 as an example. This data- base contains 20 tuples in which 2 tuples are the positive and 18 ones are the negative and the column BMI denotes ‘‘body mass index’’. The Mr.G-Tree trained from Table 2 is illustrated in Fig. 3. In the Node 1 of Fig. 3, the best inter- val of Attribute ‘‘age’’ is [48, 66] and its g-mean is [(13/ 18) · (2/2)]1/2 = 0.85, consequently the weight of this best interval is log [0.85/(1 � 0.85)] = 0.75. Similarly, the best interval, g-mean, and weight of Attribute ‘‘BMI’’ is [13, 18], 0.85, and 0.75, respectively. As for the subsets and the corresponding g-mean of Attribute ‘‘sport’’ and ‘‘smoking’’ in Node 1, we detail them in Table 3 in which S(i) denote a subset that contains the attribute-value i. Obviously, the best subset Attribute ‘‘sport’’ is [occasional, ve; e; ling of leaf nodes in Mr.G-Tree. Table 2 A training database Patient Age BMI Sport Smoking Diagnosis 1 41 12 Daily Yes Health 2 64 14 Occasional Yes Cancer 3 57 17 Occasional No Health 4 27 30 Daily Yes Health 5 31 25 Daily Yes Health 6 35 28 Daily Yes Health 7 37 15 Daily Yes Health 8 49 17 Never Yes Cancer 9 52 15 Never No Health 10 60 15 Occasional No Health 11 47 28 Daily Yes Health 12 23 26 Daily Yes Health 13 33 32 Never Yes Health 14 28 33 Occasional Yes Health 15 45 19 Never Yes Health 16 68 26 Occasional Yes Health 17 72 19 Never Yes Health 18 62 21 Daily Yes Health 19 56 16 Occasional No Health 20 24 20 Occasional Yes Health Fig. 3. The classification model constructed by using Table 2. C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 3025 never] and the g-mean and weight of this subset is [(8/ 18) · (2/2)]1/2 = 0.67 and log [0.67/ (1 � 0.67)] = 0.31, respectively. Note that since the g-mean of the best subset S(yes) in Attribute ‘‘smoking’’ is less than 0.5, it is not used in Node 1. Finally, we can get the splitting function SF(o) in each node o in Fig. 3 is Table 3 The power set and corresponding g-mean of Attribute (a) ‘‘sport’’ and (b) ‘‘smoking’’ in Node 1 of Fig. 3 (a) Subset S(d, o) S(o, n) S(d, n) S(d) S(o) S(n) G-mean ffiffiffiffiffiffiffiffiffiffiffi 4 18 � 1 2 q ffiffiffiffiffiffiffiffiffiffiffi 8 18 � 2 2 q ffiffiffiffiffiffiffiffiffiffiffi 6 18 � 1 2 q ffiffiffiffiffiffiffiffiffiffiffi 9 18 � 0 2 q ffiffiffiffiffiffiffiffiffiffiffi 12 18 � 1 2 q ffiffiffiffiffiffiffiffiffiffiffi 14 18 � 1 2 q (b) Subset S(yes) S(no) G-mean ffiffiffiffiffiffiffiffiffiffiffi 4 18 � 2 2 q ffiffiffiffiffiffiffiffiffiffiffi 10 18 � 0 2 q SFð1Þ: 0:75 � hage½48; 66�þ 0:75 � hBMI ½13; 18� þ 0:31 � hsport½occasional; never�; SFð2Þ: 0:66 � hsmoking½yes�: 3.2. Handling an imbalanced multi-relational database In order to mine the multi-relational database and avoid the shortcomings of waste memory by joining all data- sheets to a single one, Mr.G-Tree extended the concepts of propagation introduced in CrossMine to virtually link the target relation and non-target relation together. Yet, CrossMine will generate redundant target classes in each non-target relation, which will result in the incorrect g-mean calculated. For example, assuming there are two tuples t1 and t2 in target relation R1 and both of them link to the tuple a1 in non-target relation R2. If applying Cross- Mine, tuple a1 will possess two target classes simulta- neously, affecting the original distribution of target classes in non-target relation R2. If g-mean is calculated by such propagation outcomes, the unreal values for all attributes in R2 will be obtained owing to two target classes of a1 are considered. Take the attribute ‘‘smoking’’ in Table 4 as the example, there are only 9 tuples in non- target relation ‘‘Physical’’, but if taking the propagation method proposed in CrossMine, the number of tuples will be increased to 18 and then the original class distribution in non-target relation ‘‘Physical’’ will be changed. In order to solve this problem, Mr.G-Tree introduces the g-mean Tuple ID propagation algorithm, also known as GTIP algorithm. GTIP maintains the original data dis- tribution in each non-target relation by restoring the num- ber of target classes of each tuple to a single one as described in Definition 2. Definition 2. Given a non-target relation R0 and assume that each tuple ti in R 0 contain two additional columns ‘‘IDs’’ and ‘‘class labels’’ after the propagation. If the class labels of a tuple are all the positive, this tuple will be regarded as a positive tuple; if the class labels of a tuple are all the negative, it will be regarded as a negative tuple; if tuple ti contains both positive and negative class labels, this tuple would contains CP/(CP + CN) positive class and CN/(CP + CN) negative class, where CP and CN, respec- tively represents the numbers of the positive and that of the negative in the class labels column. By Definition 2, GTIP can prevent the original distribu- tion of target class in non-target relation from the distor- tion and the inaccurate g-mean. Fig. 4 is the pseudo code of GTIP algorithm. Here we again take Table 4c as an example. The result of GTIP is shown in Table 5. In addition, semantic links is always a major issue in the research field of Multi-relational database. The research shows that the relations with longer semantic links will become less important in the Multi-relational database (Liu, Yin, & Han, 2005). Thus, in order to design a classi- fier suitable for the multi-relational database, Mr.G-Tree Table 4 A relational database in which (a) the connection between two relations ‘‘Patient’’ and ‘‘Physical’’; (b) the details of relation ‘‘Patient’’; (c) the details of relation ‘‘Physical’’ (a) Patient Physical Patient-id Physical-id Condition-id ! Condition-id Age BMP Sport Smoking Class label (b) Patient (c) Physical Patient-id Condition-id Age Sport Class Physical-id Condition-id BMP Smoking 1 12 41 Daily Health 21 24 22 Yes 2 14 64 Occasional Cancer 22 14 16 Yes 3 17 57 Occasional Health 23 17 16 Yes 4 26 27 Daily Health 24 14 18 Yes 5 26 31 Daily Health 25 15 24 No 6 28 35 Daily Health 26 15 21 No 7 15 37 Daily Health 27 28 20 No 8 17 49 Never Cancer 28 28 25 No 9 15 52 Never Health 29 26 14 No 10 15 60 Occasional Health 11 24 47 Daily Health 12 24 23 Daily Health 13 24 33 Never Health 14 33 28 Occasional Health 15 19 45 Never Health 16 26 68 Occasional Health 17 19 72 Never Health 18 14 62 Daily Health 19 16 56 Occasional Health 20 20 24 Occasional Health 3026 C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 takes the semantic link into account and defines a non-tar- get relation as the effective relation or non-effective relation as shown in Definition 3. While constructing each node, Mr.G-Tree will only propagate the IDs and class labels to an effective relation and then calculate the best inter- vals/subsets of all attributes in the effective relation. In other words, the tuples used to build the node in an upper layer will be closer to the target relation, which can solve the problem of weaker semantic links. Definition 3. Assume that the length of semantic links between a non-target relation and the target relation is l, and the depth of Mr.G-Tree is d (the depth of root node is 0). While constructing a node in k-layer of Mr.G-Tree, a Procedure GTIP(R’) Begin For each non-effective relation R’ /* a non-eff For each primary key and foreign-key k If R’ can connect to a relation R th R’.k then Propagate IDs and class label End if For each tuple ti in relation R’ Assign tuple ti with CP / (CP+CN) posi class; Return the new target class to each tupl End Fig. 4. The pseudo code of g-mean non-target relation is defined as an effective relation if l 6 d, and is defined as a non-effective relation otherwise. 3.3. The detailed step of Mr.G-Tree Integrating the methods mentioned in Sections 3.1 and 3.2, we detail the steps of multi-relational g-mean decision tree as follows. Fig. 5 is the pseudo code of Mr.G-Tree. 1. Performing GTIP algorithm to all effective relation. 2. Calculating the best interval or best subset of each attribute in every effective relation and compute the g-mean. ective will be defined in Definition 3 in R’ at contain the column “class labels” via s from R to R’; tive class and CN / (CP+CN) negative e; tuple ID propagation algorithm. Procedure Traintree(D) Begin Initial Mr. G-Tree’s current depth d = 0; If D is pure or each weight wi = 0 in D then Return; Else If the Mr.G-Tree get deeper then d = d + Initial Left_D = φ , Right_D = φ ; End if For each non-effective relation R’ Call GTIP(R’); For each effective relation R For every attribute i If the attribute is numeric then Sort all examples according the v For each positive example j Calculate the interval; Calculate the g-mean gi of the i Select the best interval whose g If gi < 0.5 then discard this Elseif gi = 1 then wi = 1; Else wi = log[gi / (1-gi)] End if Else Building the corresponding power For each subset Classify all examples to the co Calculate the g-mean gi of this Select the best subset whose g If gi < 0.5 then discard this Elseif gi = 1 then wi = 1; Else wi = log[gi / (1-gi)] End if End if Classification function SF = hi ×wi; For each training example t∈ D If SF(t) ≥ 0 then Left_D = Left_D ∪t; Else Right_D = Right_D ∪t; End if Traintree (Left_D); Traintree (Right_D); Call Leaf_Node(D); End Fig. 5. The pseudo co Table 5 The propagation results of Table 4c by using GTIP Physical Physical-id Condition-id BMP Smoking IDs Class labels 21 24 22 Yes 11, 12, 13 c:0, h:1 22 14 16 Yes 2, 18 c:1/2, h:1/2 23 17 16 Yes 3, 8 c:1/2, h:1/2 24 14 18 Yes 2, 18 c:1/2, h:1/2 25 15 24 No 7, 9, 10 c:0, h:1 26 15 21 No 7, 9, 10 c:0, h:1 27 28 20 No 6 c:0, h:1 28 28 25 No 6 c:0, h:1 29 26 14 No 4, 5, 16 c:0, h:1 C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 3027 3. Discarding the attribute if its best g-mean is less than 0.5. 4. Calculating the weight of every attribute as wi= log [gi/(1 � gi)]. 5. Let hidenote the output function of attribute i. If an attribute value lies in the best interval then hi = 1; hi = �1 otherwise. 6. Combining the output function and weight of all the attributes to generate the splitting function in each node o as SF(oÞ¼ P ihi � wi. 7. A tuple is allotted to the left child node if the SF is greater than 0; otherwise, to the right one. 8. When other relations are able to link to an effective relation through primary key and foreign key, the data in the column of IDs and class labels will be 1; alue of this attribute; nterval [1/2(aip + minai), 1/2(aiq + maxai); i is the maximal; attribute; set; rresponding subset by their class; subset; i is the maximal; attribute; de of Mr.G-Tree. Fig. 6. The Mr.G-Tree constructed by using the relational database in Table 4. Table 6 The details of two real multi-relational databases Id Database Financial (1) Mutagenesis (2) # Non-target relation 7 3 # Attribute in target relation 4 4 # Positive tuples in target relation 76 64 # Total tuples in target relation 400 2 # Tuples in all relations 75,982 15,218 3028 C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 propagated to this relation. After that, setting the relation as the effective one. 9. Repeating the steps from 1 to 8 until all the nodes are pure or can not be further spilt. 10. Assigning the target class to each leaf node according to Definition 2. Table 7 The parameters of synthetic multi-relational database generator Name Description Parameter jRj Number of relation x T Number of tuples in each relation y Imbalance The percentage of positive tuples in target relation z% Tmin Minimal number of tuples in each relation 50 Amin Minimal number of attributes in each relation 2 Amax Maximal number of attributes in each relation 5 Vmin Minimal value in each attribute 2 Vmax Maximal value in each relation 1000 Fmin Minimal number of foreign keys in each relation 2 Fmax Maximal number of foreign keys in each relation 5 3.4. An example of Mr.G-Tree Here, we clearly introduce the process of Mr.G-Tree by using Table 4 again. In the beginning, Mr.G-Tree calcu- lates the best interval/subset and the corresponding g-mean of Attribute ‘‘age’’ and ‘‘sport’’ in the effective relation ‘‘Patient’’. As described in Example 1, the best interval of the numeric attribute age sport is [48, 66] and the corre- sponding g-mean is 0.85; as for the categorical attribute ‘‘sport’’, its best subset is [occasional, never] and 0.67 for g-mean. The weight for those two attributes is respectively wage = 0.75 and wsport = 0.31. Hence, the splitting function in the root node of Mr.G-Tree is SFð1Þ¼ 0:75�hage½48;66�þ0:31�hsport½occasional; never�: Since the non-target relation ‘‘Physical’’ can link to effec- tive relation ‘‘Patient’’ through foreign key ‘‘condition- id’’, Mr.G-Tree will propagate the data in the column ‘‘IDs’’ and ‘‘class labels’’ to the relation ‘‘Physical’’ in the next loop. The propagation results have been shown in Table 5. In the meanwhile, the relation ‘‘Physical’’ will be set as effective one. Next, Mr.G-Tree calculates the g-mean of Attribute ‘‘age’’, ‘‘sport’’, ‘‘BMP’’, and ‘‘smoking’’ from all effective relation. The result comes that the g-mean of Attribute ‘‘age’’ and ‘‘sport’’ are both less than 0.5, so the two attributes are discarded. However, the best interval of the numeric attribute ‘‘BMP’’ is [15, 19] and the corre- sponding g-mean is 0.84; as for the categorical attribute ‘‘smoking’’, its best subset is [yes] and 0.77 for g-mean. The weight for those two attributes is wsmoking = 0.52 and wBMP = 0.72, respectively. Hence, the splitting function in Node 2 of Mr.G-Tree is SFð2Þ¼ 0:52 � hsmoking½yes�þ 0:72 � hBMP½15; 19�: Finally, taking use of the rules described in Definition 2 to assign the target class to all leaf nodes, we can get the Mr.G-Tree as illustrated in Fig. 6. 4. Experimental analyses In this section, the software and hardware environment and the experimental databases for our experiments will be shown in Section 4.1. The measurement for evaluation of an imbalanced database will be introduced in Section 4.2. Finally, the comparisons among TILDE, CrossMine and Mr.G-Tree are illustrated in Sections 4.3 and Section 4.4. 4.1. The experimental environment and databases We implemented our Mr.G-Tree and two state-of-the- art multi-relational classifiers TILDE and CrossMine in Microsoft Visual C++ 6.0 for performance analysis. All experiments in this paper are implemented under the Win- dows 2000 professional on a PC equipped with Intel Pen- tium IV 3.0 GHz CPU and 512 MB DDR memory. Our experimental databases can be divided into 2 parts as follows. 4.1.1. Real multi-relational databases Two real multi-relational databases: Financial database and Mutagenesis database which had been used in Cross- Mine (Yin et al., 2004) are again applied. Financial data- base includes 1 target relation, 7 non-target relations and 75982 tuples; Mutagenesis database contains 1 target rela- 90 100 TILDE C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 3029 tion, 3 non-target relations and 15,218 tuples. The details of the two databases are listed in Table 6. 0 10 20 30 40 50 60 70 80 A cc ur ac y of t he p os it iv e (% ) CrossMine Mr.G-Tree 4.1.2. Synthetic multi-relational databases To have further experimental analysis, we code a data generator to generate proper synthetic multi-relational dat- abases (Quinlan et al., 1993). Each synthetic multi-rela- tional database contains one target relation and several non-target relations. Table 7 is the related parameter of our data generator. Five parameters: Tmin, Amin, Amax, Vmin and Fmin, will be fixed by referencing the settings applied in CrossMine. 2 Databases 1 Fig. 7. The comparison of the accuracy of the positive in two real relational databases. 4.2. The measurement Since the accuracy which is widely used to evaluate a classifier is not a proper metric (Joshi, 2002); in this paper, we use the accuracy of the minority as shown in Eq. (4), where c and d have been described in Table 1, to evaluate the accuracy of a classifier on the minority. accuracyþ ¼ d=ðc þ dÞ; ð4Þ Furthermore, ZR and ZP (Joshi, 2002) in Eqs. (5) and (6) are utilized to evaluate the overall accuracy between Clas- sifier A and B. In Eqs. (5) and (6), RA, RB, PA, and PB, respectively represents the recall of Classifier A, the recall of Classifier B, the precision of A and that of B; nc denotes the total number of the positive, nA0 is the number of the po- sitive predicted by A, nB0 is the number of the positive pre- dicted of B, R = (RA + RB)/2, and P = [(n A 0 � P AÞ + (nB0 � P BÞ]/(nA0 þ nB0Þ. Z R ¼ RA � RBffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2Rð1 � RÞ=nc p ð5Þ Z P ¼ P A � P Bffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pð1 � PÞð1=nA0 þ 1=nB0Þ p : ð6Þ If ZR P 1.96, the overall accuracy of RA is better than RB(denoted by RA � RB). If jZRj < �1.96, the overall accuracy of RB is better than RA(denoted by RA � RB). If jZRj<1.96, there is no significant difference between Clas- sifier A and B and will be denoted by RA � RB. The way to use the ZP to evaluate two classifiers is similar. By using ZR and ZP, we can compare two classifiers in the aspect of overall accuracy as follows. Table 8 The comparison of overall accuracy in two real relational databases Database-id Classifier ZR 1 Mr.G vs. TILDE 2.348 Mr.G vs. CrossMine 2.159 2 Mr.G vs. TILDE 2.07 Mr.G vs. CrossMine 1.054 (a) If (RA � RB and PA � PB) or (RA � RB and PA � PB) or (RA � RB and PA � PB), then the over- all accuracy of Classifier A is better than the Classifier B (A > B). (b) The method of judgment for (B > A) is identical to the one mentioned above. (c) If (RA � RB and PA � PB), that will be A � B. (d) It may happen that one metric is significantly better but the other is significantly worse. If (RA � RB and PA � PB) or (RA � RB and PA � PB), F-mea- sure as shown in Eq. (7) would be utilized as the way for the accuracy judgment. However, since F-measure does not have any probabilistic interpre- tation, we cannot apply any significance test to its value. So we use a heuristic such that the improve- ment in F-measure by at least 1% is required to call a classifier is better. F -measure ¼ð2 � P A � RAÞ=ðP A � RAÞ ð7Þ 4.3. The comparison among TILDE, CrossMine and Mr.G-Tree on real multi-relational databases In this section, we apply two real multi-relational dat- abases to compare the accuracy of the positive and the overall accuracy among TILDE, CrossMine and Mr. G-Tree. Fig. 7 is the comparison of the accuracy of the positive in two real relational databases and Table 8 is the comparison of overall accuracy. In Table 8, the com- parison is marked in bold if the Condition d mentioned ZP F-measure Overall accuracy �0.722 40 vs. 31.69 Mr.G > TILDE �2.149 40 vs. 41.32 Mr.G � CrossMine �1.181 71.43 vs. 57.14 Mr.G > TILDE �0.731 71.43 vs. 69.72 Mr.G � CrossMine 0 10 20 30 40 50 60 70 80 90 100 10 20 50 100 Number of relations A cc ur ac y of t he p os it iv e (% ) TILDE CrossMine Mr.G-Tree Fig. 8. The comparison of the accuracy of the positive on different number of relations. 0 10 20 30 40 50 60 70 80 200 500 1000 2000 5000 Number of tuples A cc ur ac y of t he p os it iv e (% ) TILDE CrossMine Mr.G-Tree Fig. 9. The comparison of the accuracy of the positive on different number of tuples. 3030 C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 in Section 4.2 is met and consequently the F-measure is used to compare the classifiers. All experiments related to the overall accuracy in the following will be presented in the same way. From Fig. 7, it is clear that Mr.G-Tree per- forms much better than TILDE and CrossMine in predict- ing the positive. From Table 8, it can be noted that the overall accuracy of Mr.G-Tree is greater than TILDE and is comparable to CrossMine. 4.4. The comparison among TILDE, CrossMine and Mr.G-Tree on synthetic multi-relational database In Section 4.3, the number of relations in two real dat- abases is less than 10. However, in a real case, there might be a lot of relations in a multi-relational database. In order to evaluate if the accuracy of each classifier will be influ- enced when the number of relations is increased, we fix the variable y and z in Table 7, respectively as 500 and 10% to generate 4 synthetic multi-relational databases. Fig. 8 shows the comparison for the accuracy of the posi- tive; Table 9 is the comparison of overall accuracy. Note that, since TILDE combines all relations into a single one, when the number of relations is large, it will cause the shortage of memory space and then be infeasible. Such a condition is occurred in our experiments when the Table 9 The comparison of overall accuracy on different number of relations Number of relations Classifier ZR 10 Mr.G vs. TILDE 3.141 Mr.G vs. CrossMine 1.714 20 Mr.G vs. TILDE 3.262 Mr.G vs. CrossMine 2.824 50 Mr.G vs. TILDE 3.079 Mr.G vs. CrossMine 2.67 100 Mr.G vs. TILDE – Mr.G vs. CrossMine 3.231 number of relation is over 50 and we mark this condition by ‘‘–’’ in Table 9 since ZR and ZP are both incomputable. All experiments related to the comparison of overall accu- racy between Mr.G-Tree and TILDE in the following will be presented in the same way. From Fig. 8, it is found that the accuracy of the positive in Mr.G-Tree is always higher than TILDE and CrossMine. In Table 9, it is worth to note that as the number of relations is larger than 10, the overall accuracy of Mr.G-Tree is always better than TILDE and CrossMine. Next, we turn to fix the variable x and z, respectively as 20 and 10% to generate 5 databases for the analysis of the accuracy in each classifier when the tuples are getting increased. The results are illustrated in Fig. 9 and Table 10. As from Fig. 9, it is found that the accuracy of the posi- tive in Mr.G-Tree is always higher than TILDE and Cross- Mine no matter how many tuples there are. From Table 10, it is shown that among the overall accuracy of TILDE, CrossMine and Mr.G-Tree, none is always better than the other two. In other words, Mr.G-Tree displays a com- parable overall accuracy to TILDE, CrossMine. Similarly, when the tuples in every relation are over 1000, we are unable to show the accuracy of TILDE in Fig. 9 and Table 10. ZP F-measure Overall accuracy �7.487 23.48 vs. 25.34 Mr.G � TILDE �2.853 23.48 vs.26.11 Mr.G < CrossMine 0.328 24.13 vs. 13.47 Mr.G > TILDE 0.754 24.13 vs. 14.12 Mr.G > CrossMine �0.708 18.6 vs. 16.49 Mr.G > TILDE 1.285 18.6 vs. 8.43 Mr.G > CrossMine – 36.78 vs. – Mr.G > TILDE 1.22 36.78 vs. 20.74 Mr.G > CrossMine Table 10 The comparison of overall accuracy on different number of tuples Number of tuples Classifier ZR ZP F-measure Overall accuracy 200 Mr.G vs. TILDE 1.803 �0.663 20.3 vs. 14.99 Mr.G � TILDE Mr.G vs. CrossMine 1.035 �0.152 20.3 vs.17.74 Mr.G � CrossMine 500 Mr.G vs. TILDE 2.881 �0.167 28.78 vs. 16.82 Mr.G > TILDE Mr.G vs. CrossMine 2.582 0.091 28.78 vs. 21.23 Mr.G > CrossMine 1000 Mr.G vs. TILDE 6.548 �2.105 20.83 vs. 21.53 Mr.G � TILDE Mr.G vs. CrossMine 5.521 �2.861 20.83 vs. 22.66 Mr.G � CrossMine 2000 Mr.G vs. TILDE – – 24.52 vs. – Mr.G > TILDE Mr.G vs. CrossMine 3.523 �4.246 24.52 vs. 26.2 Mr.G � CrossMine 5000 Mr.G vs. TILDE – – 21.19 vs. – Mr.G > TILDE Mr.G vs. CrossMine 6.065 �11.77 21.49 vs. 32.75 Mr.G > CrossMine 0 20 40 60 80 100 1 10 20 40 Imbalance level (%) A cc ur ac y of t he p os it iv e (% ) TILDE CrossMine Mr.G-Tree 5 Fig. 10. The comparison of the accuracy of the positive on different imbalanced levels. C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 3031 Ultimately, for the evaluation of the imbalanced prob- lem, we fix variable x and y in Table 8, respectively as 20 and 500 to produce 5 databases. The corresponding imbal- anced level of 5 databases is 1%, 5%, 10%, 20% and 40%. As illustrated in Fig. 10, the accuracy of the positive in Mr.G-Tree is better than TILDE and CrossMine apart from 40% as its imbalanced level. This result is not sur- Table 11 The comparison of overall accuracy on different imbalanced levels Imbalance (%) Classifier ZR 1 Mr.G vs. TILDE 0.982 Mr.G vs. CrossMine 0.982 5 Mr.G vs. TILDE 2.45 Mr.G vs. CrossMine 2.139 10 Mr.G vs. TILDE 3.601 Mr.G vs. CrossMine 3.331 20 Mr.G vs. TILDE 2.9 Mr.G vs. CrossMine 2.362 40 Mr.G vs. TILDE �7.681 Mr.G vs. CrossMine �7.66 prised and demonstrates the ability of Mr.G-Tree to mine an imbalanced multi-relational database. Table 11 shows that among the overall accuracy of TILDE, Cross- Mine and Mr.G-Tree, none is always superior to the other two. 5. Conclusion and future research directions Currently, the structural data are all stored in the rela- tional database, so the techniques of traditional data min- ing are not workable any more. Although there have been a number of multi-relational data mining classifiers pro- posed, such as TILDE, FOIL, CrossMine and so on, those approaches are unable to well handle the imbalanced prob- lem. Hence, Mr.G-Tree algorithm is proposed in this paper as the solutions. In order to build a classifier which can handle the imbalanced problem more accurately, Mr. G-Tree hierarchically pick up the positive by using g-mean mentioned in Section 3.1. Then, Mr.G-Tree makes the min- ing in a multi-relational database feasible by using GTIP algorithm proposed in Section 3.2 to propagate necessary information to all non-target relations. More importantly, GTIP algorithm keeps the g-mean reasonable after the ZP F-measure Overall accuracy �0.843 10.44 vs. 8.69 Mr.G > TILDE �0.713 10.44 vs.8.13 Mr.G > CrossMine �4.358 23.01 vs. 27.14 Mr.G < TILDE �2.874 23.01 vs. 24.42 Mr.G � CrossMine �6.073 37.66 vs. 42.15 Mr.G < TILDE �4.135 37.66 vs. 39.97 Mr.G � CrossMine �6.288 46.15.vs. 53.35 Mr.G < TILDE �3.835 46.15 vs. 49.34 Mr.G � CrossMine 0.122 62.14 vs. 85.29 Mr.G < TILDE 1.636 62.14 vs. 81.68 Mr.G < CrossMine 3032 C.-I. Lee et al. / Expert Systems with Applications 34 (2008) 3021–3032 propagation. Finally, by taking the semantic link into account, Mr.G-Tree can accurately and efficiently mine the imbalanced multi-relational databases. In the aspect of experimental analysis, since the accuracy which was widely used to evaluate a classifier is not a proper metric for the imbalanced datasets; in this paper, not only the accuracy of the positive is applied to evaluate the predic- tion ability of a classifier on the minority, but also ZR and ZP and F-measure is utilized to evaluate the overall accuracy. The results shown in Section 4 demonstrate that Mr.G-Tree can reach better classification accuracy of the positive than TILDE and CrossMine. Also, it shows com- parable consequence on the overall accuracy. However, Mr.G-Tree currently is only suitable for the database containing two target classes. Although Mr.G-Tree can set the most minor target class as the posi- tive and the rest as the negative under the condition of mul- tiple target classes, this way will change the original data distribution and consequently influences the mining results. Therefore, we will expand Mr.G-Tree in future to make it can accurately manage the databases contain multiple tar- get classes. Another line of future work is incremental Mr.G-Tree that considers the additions, deletions, and updates of tuples. References Aha, D., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37–66. Anto, S. N., Susumu, K., & Akira, I. (2000). Fog forecasting using self growing neural network CombNET-II – A solution for imbalanced training sets problem. In Proceedings of the international joint conference on neural networks (pp. 429–434). Italy. Barandela, R., Sanchez, J. S., Garcia, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849–851. Blockeel, H., De Raedt, L., & Ramon, J. (1998). Top-down induction of logical decision trees. Artificial Intelligence, 101, 285–297. Boström, H. (1995). Covering vs. divide-and-conquer for top-down induction of logic programs. In Proceedings of the fourteenth interna- tional joint conference on artificial intelligence (pp. 1194–1200). San Mateo. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–168. Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3(4), 261–283. Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbühler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infec- tion. Artificial Intelligence in Medicine, 37(1), 7–18. Dehmeshki, J., Karakoy, M., & Casique, M. V. (2003). A rule-based scheme for filtering examples from majority class in an imbalanced training set. In Proceedings of the 13th international conference on machine learning and data mining in pattern recognition (pp. 215–223). Germany. Derouin, E., Brown, J., Beck, H., Fausett, L., & Schneider, M. (1991). Neural network training on unequally represented classes. Intelligent Engineering Systems Through Artificial Neural Networks, 135–145. Ezawa, K. J., Singh, M., & Norton, S. W. (1996). Learning goal oriented bayesian networks for telecommunications management. In Proceed- ings of the 13th international conference on machine learning (pp. 139– 147). San Francisco. Fawcett, T., & Provost, F. (1996). Combining data mining and machine learning for effective user profiling. In Proceedings of 12th international conference on knowledge discovery and data mining (pp. 8–13). Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufman. Han, H., Wang, W., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of international conference on intelligent computing (pp. 878–887). Hefei, China. Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis Journal, 6(5), 429–450. Joshi, M. V. (2002). On evaluating performance of classifiers for rare classes. In Proceedings of 12th IEEE international conference on data mining. Japan. Kubat, M., Holte, R., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2/ 3), 195–215. Lavrac, N., & Dzeroski, S. (1994). Inductive logic programming: techniques and applications. New York: Ellis Horwood. Lawrence, S., Burns, I., Back, A. D, Tsoi, A. C., & Giles, C. L. (1998). Neural network classification and prior class probabilities. Neural Networks: Tricks of the Trade 1998, Lecture Notes in Computer Science (pp. 299–314). Lewis, D. D., & Catlett, J. (1997). Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the 11th international conference on machine learning (pp. 148–156). San Francisco. Liu, H., Yin, X., & Han, J. (2005). An efficient multi-relational naı̈ve bayesian classifier based on semantic relationship graphs. In Proceed- ings of ACM-SIGKDD workshop on multi-relational data mining. Chicago. Michalski, R. S. (1969). On the quasi-minimal solution of the general covering problem. In Proceedings of the 5th international symposium on information processing (pp. 125–128). Yugoslavia. Mitchell, T. M. (1997). Machine learning. McGraw Hill. Muggleton, S., & Feng, C. (1990). Efficient induction of logic programs. In Proceedings of the 1st conference on algorithmic learning theory (pp. 368–381). Tokyo. Muggleton, S. (1995). Inverse entailment and progol. In New Generation Computing. Special issue on Inductive Logic Programming, 13, 245–286. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994). Reducing misclassification costs. Proceedings of the 11th international conference on machine learning (pp. 217–225). Quinlan, J. R., & Cameron-Jones, R. M. (1993). FOIL: A midterm report. In Proceedings of the European conference on machine learning (pp. 3– 20). Vienna, Austria. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgann Kaufmann. Rastogi, R., & Shim, K. (2000). PUBLIC: A decision tree classifier that integrates building and pruning. Data Mining and Knowledge Discov- ery, 4(4), 315–344. Riddle, P., Segal, R., & Etzioni, O. (1994). Representation design and brute force induction in a boeing manufacturing domain. Applied Artificial Intelligence, 8, 125–147. Tsai, C. J., Lee, C. I., Chen, C. T., & Yang, W. P. (2007). A multivariate decision tree algorithm to mine imbalanced data. WSEAS Transactions on Information Science and applications, 4(1), 50–58. Yin, X., Han, J., Yang, J., & Yu, P. S. (2004). Crossmine: Efficient classification across multiple database relations. In Proceedings of the 20th international conference on data engineering (pp. 399–411). Boston, MA. Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the 7th international conference on knowledge discovery and data mining (pp. 204–213). San Francisco. Zhou, Z. H., & Liu, X. Y. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans- action on Knowledge and Data Engineering, 18(1), 63–77. An approach to mining the multi-relational imbalanced database Introduction Related work Decision tree Multi-relational classifier Class imbalanced problem Sampling-based Cost-based Imbalance-insensitive Multi-relational g-mean decision tree algorithm G-mean based classification model Handling an imbalanced multi-relational database The detailed step of Mr.G-Tree An example of Mr.G-Tree Experimental analyses The experimental environment and databases Real multi-relational databases Synthetic multi-relational databases The measurement The comparison among TILDE, CrossMine and Mr.G-Tree on real multi-relational databases The comparison among TILDE, CrossMine and Mr.G-Tree on synthetic multi-relational database Conclusion and future research directions References