Extended decision template presentation for combining classifiers Expert Systems with Applications 38 (2011) 8414–8418 Contents lists available at ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a Extended decision template presentation for combining classifiers Mehdi Salkhordeh Haghighi ⇑, Abedin Vahedian 1, Hadi Sadoghi Yazdi 1 Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran a r t i c l e i n f o a b s t r a c t Keywords: Decision template Classifiers fusion Classifiers combiner 0957-4174/$ - see front matter � 2011 Elsevier Ltd. A doi:10.1016/j.eswa.2011.01.036 ⇑ Corresponding author. Tel.: +98 511 9151135801 E-mail addresses: haghighi@ieee.org (M.S. Hagh Vahedian), sadoghi@sttu.ac.ir (H.S. Yazdi). 1 Tel./fax: +98 511 8763306. 2 Multiple Classifier System. In this paper, a new method in classifier fusion is introduced for decision making based on internal struc- ture of base classifiers. Amongst methods used in combining classifiers, there are some methods which work on decision template as a tool for modeling behavior of base classifiers in order to label data. This tool models their behavior only based on their final outputs. Our new method, introduces a special struc- ture for decision template such that internal behavior of a neural network base classifier can be modeled in a proper manner suitable for classifiers fusion. The new method builds decision template for each layer of the neural network including all hidden layers. Therefore, the process of making decision in each base classifier is also available for classifiers fusion. Efficiency of the new method is compared with some known benchmark datasets to show how it can improve efficiency of classifiers fusion. � 2011 Elsevier Ltd. All rights reserved. 1. Introduction Classification is one of the most frequently encountered deci- sion making processes in a wide range of applications. A classifica- tion problem occurs when an object needs to be assigned into a predefined group or class based on various properties or features. Many problems in business, science, industry, military, security and medicine can be treated as classification problems. Many classification techniques have been used in various appli- cations over the last decade. Moreover, many optimization tech- niques have also been introduced to enhance efficiency and accuracy of classification. Among these, ensemble techniques or MCSs2 also named classifiers fusion or combining classifiers have special interest. As a general rule, the combinational methods have more power, robustness, resistance, accuracy and generality rather than single classification (Analoui, 2008). The motivation for this procedure is based on the intuitive idea that by combining the out- puts of several individual predictors one might improve performance of a single generic one (Krogh & Vedelsdy, 1995). However, the idea of performance improvement in MCS has been proved to be true only when the combined base classifiers are accurate and diverse enough, which requires an adequate tradeoff between these two conflicting conditions (Navone, Gran- itto, Verdes, & Ceccatto, 2001). Also, Kuncheva (2005) using Con- dorcet Jury theorem (Shapley & Grofman, 1984) has shown that combination of classifiers can usually operate better than a single ll rights reserved. ; fax: +98 511 8763306. ighi), vahedian@um.ac.ir (A. classifier. It means that if classifiers with more diversity are used in the ensemble, then total error can considerably be reduced. Drucker, Schapire, and Simard (1993) attempted to gain a good compromise between these properties including elaborations of bagging (Breiman, 1996), boosting (Freund & Schapire, 1995) and stacking (Wolpert, 1992) techniques. In general, three types of strategies in combining classifiers can be identified (Xu, Krzyzak, & Suen, 1992). In the first type each classifier produces output as a single class label such that these la- bels have to be combined to make final decision (Battiti & Colla, 1994). In the second type outputs of the classifiers are sets of class labels ranked in the order of likelihood (Tumer & Ghosh, 1995) and the third type involves the combination of real valued outputs for each class by the respective classifiers, most often posterior prob- abilities (Jacobs, 1995), sometimes evidences (Rogova, 1994). Nevertheless, there are generally two types of combination named classifier selection and classifier fusion (Woods, Kegelmey- er, & Bowyer, 1997). The presumption in classifier selection is that each classifier is an expert in some local area of the feature space. When a feature vector is submitted to each classifier, the classifier designated for the vicinity of x is given the highest credit in assign- ing the class label to x. Therefore, exactly one classifier is responsi- ble to make the final decision, as in Ng and Abramson (1992), or more than one base classifier, as in Jacobs, Jordan, Nowlan, and Hinton (1991), Alpaydin and Jordan (1996). Classifier fusion as- sumes that all classifiers are trained over the whole input feature space, and are thereby considered as competitive rather than com- plementary (Rastrigin & Erenstein, 1982; Xu et al., 1992). However, between all the methods introduced for combining classifiers, DT based methods have special features suitable for wide range of applications. DT method models behavior of base classifiers for making decision on input data. This modeling http://dx.doi.org/10.1016/j.eswa.2011.01.036 mailto:haghighi@ieee.org mailto:vahedian@um.ac.ir mailto:sadoghi@sttu.ac.ir http://dx.doi.org/10.1016/j.eswa.2011.01.036 http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa Fig. 1. Construction of DT array for Class I data. Fig. 2. Basic DP structure. M.S. Haghighi et al. / Expert Systems with Applications 38 (2011) 8414–8418 8415 scheme is constructed based on final outputs of base classifiers. Operation of the method is based on a set of DT matrices. DTs make a robust classifier fusion scheme that combines classifier outputs by comparing them with a characteristic template for each class. DT based fusion uses all classifier outputs to calculate the final sup- port for each class, which is in sharp contrast to most other fusion methods which use only the support for that particular class to make their decision. DT based methods have been using in wide range of applica- tions for many years. In our research, DT structure is used for mod- eling behavior of base classifiers in a MCS not only based on their output results but also based on internal processing and partial paths taken for making final decision. Using this method to deeply describe partial decisions made inside the structure of base classi- fiers can make classifier combiner operation more robust and effi- cient. The motivation for these properties lies in the fact that the classifiers combiner has more partially processed data with more options to make decisions. 2. Problem definition As mentioned, the primary goal of a MCS is to improve overall efficiency of base classifiers in making decisions over input data. One of the methods widely used in MCSs is DT method. Basic DT method constructs decision template array for each class of train- ing data only based on final outputs of the base classifiers. There- fore, the combiner uses these arrays to make final decisions for test data. Basically, for the combiner to have more primary tools for making decisions, a new structure for constructing DT is pro- posed. The DT is constructed not only based on final outputs of base classifiers, but also based on internal decision paths each base classifier follows to make final decision. In the next section, more details about the new DT are discussed. 2.1. Preliminaries One of the aspects that deeply affect performance of the MCS is that the behavior of base classifiers is modeled in a proper manner such that in almost all input space, the behavior is completely known. However, since such exact model is hard to find, the base classifiers should be trained such that each of them can model part of the input space more efficiently. Moreover, this type of modeling would be more effective when internal path of decision making in each base classifier is also modeled. Therefore, the primary goal of this research is to focus on inter- nal behavior of base classifiers for training and testing data such that the combiner part of MCS posses more effective tools to make decisions for labeling input data. This information is used to im- prove overall generalization capability of the system. For this to happen, neural network based classifiers are used as base classifi- ers. In addition, different strategies for final combiner are tested to find a suitable test bed for comparison. 2.1.1. Standard DT method Generally, in standard DT method, DT is constructed for each class of training data. Fig. 1 shows structure and process of making a standard DT. In this case, to build a DT for class I training data, DP is constructed for each data in this class, according to Fig. 2. Next, average value of these DPs is called DT for class I. The same proce- dure is done for all other classes of training data. The idea behind this DT is to remember the most typical DP for each class. In the test phase, for a test data, DP is constructed, and then compared to each of the DTs based on some similarity measure. Final decision would be the closest match to label the test data. Fig. 3. Structure of the new system. Fig. 4. Structure of a NN based classifier. 8416 M.S. Haghighi et al. / Expert Systems with Applications 38 (2011) 8414–8418 The process exhibits some weaknesses. For instance, final deci- sion is made based on some distance measure between DP of input test data and each one of DTs. These distance measures compare distances between the DP and each DT according to two represen- tative points. Using a point as a representative for a group of data does not yield the required accuracy and robustness while it can- not either describe properties of the data in a proper way. Fig. 5. DP structure used in the new system. 3 Degree of Support. 2.1.2. Fusion system structure Fig. 3 shows structure of our classifiers fusion system from a general view of operation. In this figure, X is a K featured input data, B is the number of base classifiers, and C is the number of classes. The base classifiers used in the system are neural net- works, all of which with the same internal structure, as shown in Fig. 4. This means that the number of hidden layers and the num- ber of neurons in each layer for all the base classifiers are the same. In order to keep diversity, operation of the base classifiers should be different in input space by using different training data. During test step, input vector X is used as input data and DPs for each layer of base classifiers are formed. Since the number of hid- den layers and the number of neurons in each layer for the base classifiers are the same, for a hidden layer with N neurons, DP would be a B � N matrix. However, by using the special structure designed for DP and DT, it would be possible to use base classifiers with different number of neurons in their corresponding hidden layers. It is clear that DTs are formed during base classifiers train- ing. At the end, the combiner makes final decision based on the DTs and DPs formed. Output of the combiner would be a label for the input data. Different strategies for combining these DTs and DPs can be used. In standard DT method, some distance measures are used to determine similarity between each DT and DP. The combining mechanism used in the MCS is based on Euclid- ean distance measure. To do this, in the test step, after forming DPs for all the layers based on X as input data, the value of DOS3 for class m in layer k is calculated by Eq. (1). In the equation, DTK,m is the decision template for class m in layer k, Nk is the number of neu- rons in layer k; DPk is the decision profile for layer k, C is the total number of classifiers. After computing DOS for each class in each layer, maximum DOS is determined. Finally, voting determines final decision among the maximum values of all layers. DOSm;kðXÞ¼ 1 � PB i¼1 PNk j¼1ðDT k;m i;j � DP k i;jÞ 2 B � Nk ð1Þ As a general rule, diversity is a key element in efficiency of MCS. By using a number of different classifiers in a MCS, increase in accuracy and efficiency of the overall system is expected. It is intuitively ac- Fig. 6. Structure of the new DT. Table 1 Properties of some standard datasets. # of samples # of features # of classes Address OCR 1435 64 10 Haghighi@ieee.org Breast 699 9 2 http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ Iris 150 4 3 http://archive.ics.uci.edu/ml/machine-learning-databases/iris/ Glass 214 9 6 http://archive.ics.uci.edu/ml/machine-learning-databases/glass/ Wine 178 13 3 http://archive.ics.uci.edu/ml/machine-learning-databases/wine/ Table 2 Error rate obtained by test data. VT DS DTED DTSD SM MAX PT MIN The new method OCR 3.5 3.75 3.75 3.5 3.5 4.75 4.0 5.75 3.15 Breast 3.33 3.33 3.33 3.33 3.33 3.83 3.33 3.83 3.17 Iris 3.57 3.57 2.86 2.86 3.57 3.57 3.57 3.57 2.86 Glass 25 20.67 20.67 26 25 26 35 21.5 19.33 Wine 6.87 6.87 7.5 6.87 6.87 6.87 6.87 6.87 6.87 M.S. Haghighi et al. / Expert Systems with Applications 38 (2011) 8414–8418 8417 cepted that classifiers to be combined should be diverse. If they were identical, no improvements would result in combining them. Therefore, diversity among the team has been recognized as a key point. Since the main reason for combining classifiers is to improve their performance, there is clearly no advantage to be gained from an ensemble that is composed of a set of identical classifiers or clas- sifiers that show the same patterns of generalizations. Therefore, different initial values and training sets with different training parameters are used for the base classifiers. As a result, different behavior is expected from these classifiers. In the new method introduced here, not only a new structure for DT is proposed, but also a new structure for DP is used to match with the new DT structure in the combiner. The criteria for com- bining is provided in Eq. (2) in which L is the number of layers. DOSmðXÞ¼ maxðDOSm;k; k ¼ 1 . . . LÞ; LabelðXÞ¼ argmaxðDOSiðXÞ; i ¼ 1 . . . CÞ ð2Þ Nevertheless, in our new method, DPs are constructed for each of the hidden and output layers of the base classifiers. Therefore, during training, special structure is needed for DPs and DTs such that in test step, the output combiner can easily use them for making decisions. The special DP structure used is shown in Fig. 5. In this figure, for each classifier, outputs of the neurons of each layer are saved as a vector in corresponding cell of DP array. Since each layer may have different number of neurons, each cell of the DP array is designed such that vectors with different length are stored at the same time. This structure gives the DP more flexibility for using in each MCS with different base classifier structures. Fig. 6 shows details about the structure of the new DT. In this figure, each cell of DT array stores a DP like structure which is con- structed based on the DPs in training step. 3. Implementation and results analysis Performance of our new system is compared with other meth- ods based on some known benchmark datasets listed in Table 1. http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ http://archive.ics.uci.edu/ml/machine-learning-databases/iris/ http://archive.ics.uci.edu/ml/machine-learning-databases/glass/ http://archive.ics.uci.edu/ml/machine-learning-databases/wine/ 8418 M.S. Haghighi et al. / Expert Systems with Applications 38 (2011) 8414–8418 It should be noted that in the system, the structure of all the base classifiers should be the same. Therefore, the number of hidden layers and the number of neurons in each hidden layer should be identical. As a result, comparison of DTs for each layer is done sim- ilarly. Eq. (2) indicates labeling process using the DOS obtained by Eq. (1). For each of the data sets listed in Table 1, different groups of data are selected randomly as training and test. In the first step, base classifiers are trained such that minimum error for each one is produced. Next, using training data, DTs are built for each layer of base classifiers and for each class of training data. These DTs are stored in the data structure shown in Fig. 6. After the training step, it is time to test performance of MCS with respect to other fusion methods. Performance of the system is measured based on the error rate obtained by test data. As seen in Table 2, comparison has been carried out with these fusion methods: Voting (VT), Dempster Shafer (DS), Decision template and Euclidean distance (DTED), Decision template and symmetric difference (DTSD), simple mean (SM), maximum (MAX), product (PT), and minimum (MIN). 4. Conclusion In the new method presented, fusion of the decisions made by base classifiers are affected by the internal paths each base classi- fier follows to produce final decision. Where the base classifiers are neural networks, the intermediate steps each base classifier fol- lows are presented by outputs of hidden layers. In this work, out- puts of all hidden layers were preserved in a new structure for DT specially designed for this purpose. Therefore, during test step, these preserved data were used for fusion step such that the com- biner could better make final decision to assign a label to the input data. It is expected that the more information available for fusion, the more the decisions are robust. Nevertheless, it is generally ex- pected that efficiency of the combiner increases compared to the methods using only output of the base classifiers for fusion. References Alpaydin, E., & Jordan, M. I. (1996). Local linear perceptrons for classification. IEEE Transactions on Neural Networks, 7(3), 788–792. Analoui (2008). CCHR: Combination of classifiers using heuristic retraining. In International conference on networked computing and advanced information management (NCM 2008), Korea, Sep. Battiti, R., & Colla, A. M. (1994). Democracy in neural nets: Voting schemes for classification. Neural Networks, 7(4), 691–707. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Drucker, H., Schapire, R., & Simard, P. (1993). Improving performance in neural networks using a boosting algorithm. In S. J. Hanson, J. D. Cowen, & C. L. Giles, (Eds.), Advances in neural information processing systems (Vol. 5, pp. 42–49). Freund, Y., & Schapire, R. (1995). A decision theoretic generalization of on-line learning and an application to boosting. In Proceedings of the second European conference on computational learning theory (pp. 23–37). Springer Verlag. Jacobs, R. (1995). Method for combining experts’ probability assessments. Neural Computation, 7(5), 867–888. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Krogh, A., & Vedelsdy, J. (1995). Neural network ensembles cross validation, and active learning. In G. Tesauro, D. Touretzky, & T. Leen (Eds.). Advances in neural information processing systems (Vol. 7, pp. 231–238). Cambridge, MA: MIT Press. Kuncheva, L. I. (2005). Combining pattern classifiers, methods and algorithms. New York: Wiley. Navone, H. D., Granitto, P. M., Verdes, P. F., & Ceccatto, H. A. (2001). A learning algorithm for neural network ensembles. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial(12), 70–74. Ng, K.-C., & Abramson, B. (1992). Consensus diagnosis: A simulation study. IEEE Transactions on Systems, Man and Cybernetics, 22, 916–928. Rastrigin, L. A., & Erenstein, R. H. (1982). Method of Collective Recognition, Energoizdat, Moscow, (in Russian). Rogova, G. (1994). Combining the results of several neural network classifiers. Neural Networks, 7(5), 777–781. Shapley, L., & Grofman, B. (1984). Optimizing group judgemental accuracy in the presence of interdependencies. Public Choice, 43, 329–343. Tumer, K., & Ghosh, J. (1995). Order statistics combiners for neural classifiers. In Proceedings of the world congress on neural networks (pp. I:31–34). Washington DC: INNS Press. Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259. Woods, K., Kegelmeyer, W. P., & Bowyer, K. (1997). Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 405–410. Xu, L., Krzyzak, A., & Suen, C. Y. (1992). Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man and Cybernetics, 22, 418–435. Extended decision template presentation for combining classifiers Introduction Problem definition Preliminaries Standard DT method Fusion system structure Implementation and results analysis Conclusion References