This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/authorsrights http://www.elsevier.com/authorsrights Author's personal copy Cluster center initialization algorithm for K-modes clustering Shehroz S. Khan a,⇑, Amir Ahmad b a David R. Cheriton School of Computer Science, University of Waterloo, Canada b Faculty of Computing and Information Technology, King Abdulaziz University, Rabigh, Saudi Arabia a r t i c l e i n f o Keywords: K-modes clustering Cluster center initialization Prominent attributes Significant attributes a b s t r a c t Partitional clustering of categorical data is normally performed by using K-modes clustering algorithm, which works well for large datasets. Even though the design and implementation of K-modes algorithm is simple and efficient, it has the pitfall of randomly choosing the initial cluster centers for invoking every new execution that may lead to non-repeatable clustering results. This paper addresses the randomized center initialization problem of K-modes algorithm by proposing a cluster center initialization algorithm. The proposed algorithm performs multiple clustering of the data based on attribute values in different attributes and yields deterministic modes that are to be used as initial cluster centers. In the paper, we propose a new method for selecting the most relevant attributes, namely Prominent attributes, com- pare it with another existing method to find Significant attributes for unsupervised learning, and perform multiple clustering of data to find initial cluster centers. The proposed algorithm ensures fixed initial cluster centers and thus repeatable clustering results. The worst-case time complexity of the proposed algorithm is log-linear to the number of data objects. We evaluate the proposed algorithm on several cat- egorical datasets and compared it against random initialization and two other initialization methods, and show that the proposed method performs better in terms of accuracy and time complexity. The initial cluster centers computed by the proposed approach are close to the actual cluster centers of the different data we tested, which leads to faster convergence of K-modes clustering algorithm in conjunction to bet- ter clustering results. � 2013 Elsevier Ltd. All rights reserved. 1. Introduction Cluster analysis is a form of unsupervised learning that is aimed at finding underlying structures in the unlabeled data. The objective of a clustering algorithm is to partition a multi- attribute dataset into homogeneous groups (or clusters) such that the data objects in one cluster are more similar to each other (based on some similarity measure) than those in other clusters. Clustering is an active research topic in pattern recognition, data mining, statistics and machine learning with diverse application such as in image analysis (Matas & Kittler, 1995), medical appli- cations (Petrakis & Faloutsos, 1997) and web documentation (Boley et al., 1999). The partitional clustering algorithms such as K-means (Anderberg, 1973) are very efficient for processing large numeric datasets. Data mining applications require handling and explora- tion of data that contains numeric, categorical or both types attributes. The K-means clustering algorithm fails to handle data- sets with categorical attributes because it minimizes the cost function by calculating means and distances. The traditional way to treat categorical attributes as numeric does not always produce meaningful results because generally categorical do- mains are not ordered. Several approaches have been reported for clustering categorical datasets that are based on K-means paradigm. Ralambondrainy (1995) presents an approach by using K-means algorithm to cluster categorical data by converting mul- tiple category attributes into binary attributes (using 0 and 1 to represent either a category absent or present) and treats the bin- ary attributes as numeric in the K-means algorithm. Gowda and Diday (1991) use a similarity coefficient and other dissimilarity measures to process data with categorical attributes. CLARA (Clustering LARge Application) (Kaufman & Rousseeuw, 1990) is a combination of a sampling procedure and the clustering pro- gram Partitioning Around Medoids (PAM). Guha, Rastogi, and Shim (1999) present a robust hierarchical clustering algorithm, ROCK, that uses links to measure the similarity/proximity be- tween a pair of data objects with categorical attributes that are used to merge clusters. However, this algorithm has worst-case quadratic time complexity. Huang (1997) presents the K-modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. The algorithm replaces means of clusters with modes (most 0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.07.002 ⇑ Corresponding author. E-mail addresses: s255khan@uwaterloo.ca (S.S. Khan), amirahmad01@gmail. com (A. Ahmad). Expert Systems with Applications 40 (2013) 7444–7456 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a Author's personal copy frequent attribute value in a attribute), and uses a frequency based method to update modes in the clustering process to minimize the cost function. The algorithm is shown to achieve convergence with linear time complexity with respect to the number of data objects. Huang (1998) also points out that in general, the K-modes algo- rithm is faster than the K-means algorithm because it needs less iterations to converge. In principle, K-modes clustering algorithm functions similar to K-means clustering algorithm except for the cost function it minimizes, and hence suffers from the same draw- backs. Similar to K-means clustering algorithm, the K-modes clus- tering algorithm assumes that the number of clusters, K, is known in advance. Fixed number of K clusters can make it difficult to pre- dict the actual number of clusters in the data that may mislead the interpretations of the results. The K-means/K-modes clustering algorithm falls into problems when clusters are of differing sizes, density and non-globular shapes. The K-means clustering algo- rithm does not guarantee unique clustering due to random choice of initial cluster centers that may yield different groupings for dif- ferent runs (Jain & Dubes, 1988). Similarly, the K-modes algorithm is also very sensitive to the choice of initial cluster centers and an improper choice may result in highly undesirable cluster struc- tures. Random initialization is widely used as a seed for K-modes algorithm for its simplicity, however, this may lead to non-repeat- able clustering results. Machine learning practitioners find it diffi- cult to rely on the results thus obtained and several re-runs of K- modes algorithm may be required to arrive at a meaningful conclusion. In this paper, we extend the work of Khan and Ahmad (2012) and present a multiple clustering approach that infers cluster structure information from multiple attributes by using the attri- bute values present in the data for computing initial cluster cen- ters. This approach focus only on Prominent attributes (discussed in Section 4.2) that are important for finding cluster structures. We also use another unsupervised learning method to find Signif- icant attributes (Ahmad & Dey, 2007a, 2007b) and compare it with the proposed approach. The proposed algorithm performs multiple clustering based on distinct attribute values present in different attributes to generate multiple clustering views of the data that are utilized to obtain fixed initial cluster centers (modes) for K-modes clustering algorithm. The proposed algo- rithm has worst-case log-linear time complexity with respect to the number of data objects. The present paper extends the previ- ous work in terms of: � Using a unsupervised method to compute significant attributes and compare their clustering performance and quality of initial cluster centers against the centers computed by prominent attributes. � Comparing the quality of initial cluster centers by using all attri- butes and prominent attributes. � Analyzing the closeness of initial cluster centers to the actual centers by using prominent, significant and all attributes. � Performing comprehensive experiments, presentation of results, time scalability analysis, inclusion of more datasets, and extended discussions on the multiple clustering challenges from the perspective of the proposed approach. The rest of the paper is organized as follows. In Section 2, we present a short survey of the research work on cluster center ini- tialization for K-modes algorithm. Section 3 briefly discusses the K-modes clustering algorithm. In Section 4, we present the pro- posed multiple attribute clustering approach to compute initial cluster centers along with three different approaches to choose dif- ferent number of attributes to generate multiple clustering views. Section 5 shows the detailed experimental analysis of the proposed method on various categorical datasets and compare it with other cluster center initialization methods. Section 6 concludes the paper with pointers to future work. 2. Related work The K-modes algorithm (Huang, 1997) extends the K-means paradigm to cluster categorical data and requires random selection of initial cluster centers or modes. As discussed earlier, a random choice of initial cluster centers leads to non-repeatable clustering results that may be difficult to comprehend. The random initializa- tion of cluster centers may only work well when one or more ran- domly selected initial cluster centers are similar to the actual cluster centers present in the data. In the most trivial case, the K-modes algorithm keeps no control over the choice of initial clus- ter centers and therefore repeatability of clustering results is diffi- cult to achieve. Moreover, an inappropriate choice of initial cluster centers can lead to undesirable clustering results. The results of partitional clustering algorithms are better when the initial parti- tions are close to the final solution (Jain & Dubes, 1988). Hence, it is important to invoke K-modes clustering with fixed initial clus- ter centers that are similar to the true representative centers of the actual clusters to get better results. There are several research papers reported for computing initial cluster centers for K-modes algorithm, however, most of these methods suffer from either of the following two drawbacks: (a) The initial cluster center computation methods are quadratic in time complexity with respect to the number of data objects – these type of methods mitigate the advantage of linear time complexity of K-modes algorithm and are not scalable for large datasets. (b) The initial cluster centers are not fixed and have randomness in the computation steps – these type of methods fare as good as random initialization methods. We present a short review of the research work done to com- pute initial cluster centers for K-modes clustering algorithm and discuss their associated problems. Khan and Ahmad (2003) use density-based multiscale data con- densation (Mitra, Murthy, & Pal, 2002) approach with Hamming distance to extract K initial cluster centers from the datasets, how- ever, their method has quadratic complexity with respect to the number of data objects. Huang (1998) proposes two approaches for initializing the cluster centers for K-modes algorithm. In the first method, the first K distinct data objects are chosen as initial K-modes, whereas the second method calculates the frequencies of all categories for all attributes and assign the most frequent cat- egories equally to the initial K-modes. The first method may only work if the top K data objects come from disjoint K clusters. The second method is aimed at choosing diverse cluster center that may improve clustering results, however a uniform criteria for selecting K-initial cluster centers is not provided. Sun, Zhu, and Chen (2002) present an experimental study on applying Bradley and Fayyad’s iterative initial-point refinement algorithm (Bradley & Fayyad, 1998) to the K-modes clustering to improve the accuracy and repetitiveness of the clustering results. Their experiments show that the K-modes clustering algorithm using refined initial cluster centers leads to higher precision results that are more reliable than the random selection method without refinement. This method is dependent on the number of cases with refinements and the accuracy value varies. Khan and Kant (2007) propose a method that is based on the idea of evidence accumula- tion for combining the results of multiple clusterings (Fred & Jain, 2002) and only focus on those data objects that are less vulnerable to the choice of random selection of modes and to choose the most S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 7445 Author's personal copy diverse set of modes among them. Their experiments suggest that the computed initial cluster centers outperform the random choice, however the method does not guarantee fixed choice of initial cluster centers. He (2006) presents two farthest point heu- ristic for computing initial cluster centers for K-modes algorithm. The first heuristic is equivalent to random selection of initial clus- ter centers and the second uses a deterministic method based on a scoring function that sums the frequency count of attribute values of all data objects. This heuristic does not explain how to choose a point when several data objects have same scores, and if it ran- domly breaks ties, then fixed centers cannot be guaranteed. The method only considers the distance between the data points, due to which outliers can be selected as cluster centers. Wu, Jiang, and Huang (2007) develop a density based method to compute the K initial cluster centers which has quadratic complex- ity. To reduce the worst case complexity to O(n1.5), they randomly select square root of the total points as a sub-sample of the data, however, this step introduces randomness in the final results and repeatability of clustering results may not be achieved. Cao, Liang, and Bai (2009) present an initialization method that consider dis- tance between objects and the density of the objects. Their method selects the object with the maximum average density as the first initial cluster center. For computing other cluster centers, the dis- tance between the object and the already known clusters, and the average density of the object are considered. A shortcoming of this method is that a boundary point may be selected as the first center that can affect the quality of selection of subsequent initial cluster centers. Bai, Liang, Dang, and Cao (2012) propose a method to com- pute initial cluster centers on the basis of a density function (de- fined by using the average distance of all the other points from a point) and a distance function. The first cluster center is decided by the density function. The remaining cluster centers are com- puted by using the density function and the distance between the already calculated cluster centers and the probable new cluster center. In order to calculate the density of a point they calculate the summary of all the other points. Hence, there is information loss that may lead to improper density calculation, which can affect the results. A major problem with this research paper lies in the evaluation of results. For at least two datasets, the accuracy, preci- sion and recall values are computed incorrectly. From the confu- sion matrix presented in the paper the accuracy, precision and recall values for � Dermatology data should be 0.6584, 0.6969, 0.6841 and not 0.7760, 0.8527, 0.7482 � Zoo data should be 0.7425, 0.7703, 0.8654 and not 0.9208, 0.8985 and 0.8143. The confusion matrix mis-classify almost half of the data objects of first cluster and thus accuracy cannot reach the value indicated in the paper. In comparison to the above stated research works, the proposed algorithm (see Section 4 for details) for finding initial clusters cen- ters for categorical datasets circumvents both the drawbacks dis- cussed earlier i.e. its worst-case time complexity is log-linear in the number of data objects and it provides deterministic (fixed) initial cluster centers. 3. K-modes algorithm for clustering categorical data Due to the limitation of the dissimilarity measure used by tra- ditional K-means algorithm, it cannot be used to cluster categorical dataset. The K-modes clustering algorithm is based on K-means paradigm, but removes the numeric data limitation whilst preserv- ing its efficiency. The K-modes algorithm (Huang, 1998) extends the K-means paradigm to cluster categorical data by removing the barrier imposed by K-means through following modifications: 1. Using a simple matching dissimilarity measure or the Hamming distance for categorical data objects. 2. Replacing means of clusters by their modes (cluster centers). The simple matching dissimilarity measure (Hamming dis- tance) can be defined as following. Let X and Y be two categorical data objects described by m categorical attributes. The dissimilar- ity measure d(X,Y) between X and Y can be defined by the total mis- matches of the corresponding attribute categories of two objects. Smaller the number of mismatches, more similar the two objects are. Mathematically, we can say dðX; YÞ¼ Xm j¼1 dðxj; yjÞ ð1Þ where dðxj; yjÞ¼ 0 ðxj ¼ yjÞ 1 ðxj – yjÞ � , and d(X,Y) gives equal importance to each category of an attribute. Let N be a set of n categorical data objects described by m cat- egorical attributes, M1, M2 , . . . , Mm. When the distance function de- fined in Eq. (1) is used as the dissimilarity measure for categorical data objects, the cost function becomes CðQÞ¼ Xn i¼1 dðNi; Q iÞ ð2Þ where Ni is the ith element and Qi is the nearest cluster center of Ni. The K-modes algorithm minimizes the cost function defined in Eq. (2). The K-modes algorithm assumes that the knowledge of number of natural grouping of data (i.e. K) is available and consists of the following steps (taken from Huang (1997)): 1. Select K initial cluster centers, one for each of the cluster. 2. Allocate data objects to the cluster whose cluster center is near- est to it according to Eq. (2). Update the K clusters based on allo- cation of data objects and compute K new modes of all clusters. 3. Retest the dissimilarity of objects against the current modes. If an object is found such that its nearest mode belongs to another cluster rather than its current one, reallocate the object to that cluster and update the modes of both clusters. 4. Repeat step 3 until no data object has changed cluster membership. 4. Proposed approach for computing initial cluster centers using multiple attribute clustering Khan and Ahmad (2004) show that for partitional clustering algorithms, such as K-Means, � Some of the data objects are very similar to each other, that is why they share same cluster membership irrespective of the choice of initial cluster centers, and � An individual attribute may also provide some information about initial cluster centers He, Xu, and Deng (2005) present a unified view on categorical data clustering and cluster ensemble for the creation of new clus- tering algorithms for categorical data. Their intuition is that the attributes present in a categorical data contribute to the final cluster structure. They consider the distinct attribute values of an 7446 S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 Author's personal copy attribute as cluster labels giving ‘‘best clustering’’ without consid- ering other attributes and create a cluster ensemble. Müller, Günnemann, Färber, and Seidl (2010) defines multiple clusterings as setting up multiple set of clusters for every data object in a dataset with respect to multiple views on the data. The basic objective of multiple clustering is to represent different perspectives on the data and utilize the variation among the clustering results to gain additional knowledge about the structure in the data. They discuss several challenges that arise due to multi- ple clustering of data and merging of their results. One of the major challenges is related to the detection of different clusterings re- vealed by multiple views on the data. This problem of multiple views has been studied in the original data space (Caruana, Elha- wary, Nguyen, & Smith, 2006), orthogonal space (Davidson & Qi, 2008) and subspace projections (Agrawal, Gehrke, Gunopulos, & Raghavan, 1998). Other challenges include given knowledge about known clusterings, processing schemes for clustering, the number of multiple clusterings and flexibility. We take motivation from these research works and propose a new cluster initialization algorithm for categorical datasets that perform multiple clustering on different attributes (in the original data space) and uses distinct attribute values in an attribute as cluster labels. These multiple views provide new insights into the hidden structures of the data that serve as a cue to find consistent cluster structure and aid in computing better initial cluster centers. In the following subsections, we will present three approaches to select different attribute spaces that can help in generating differ- ent clustering views from the data. It is to be noted that all the pro- posed approaches assume that the desired number of clusters, K, is known in advance. 4.1. Vanilla approach A Vanilla approach is to consider all the attributes (m) present in the data and generate M clustering views that can be used for fur- ther analysis (see details for the follow up steps in Section 4.4). 4.2. Prominent attributes Khan and Ahmad (2012) present that only few attributes may be useful to generate multiple clustering views that can help in computing initial cluster centers for K-modes algorithm. These rel- evant attributes are extracted based on the following experimental observations: 1. There may be some attributes in the dataset whose number of attribute values are less than or equal to K. Due to fewer attri- bute values per cluster, these attributes possess higher discrim- inatory power and will play a significant role in deciding the initial cluster centers as well as the cluster structures. The set of these relevant attributes are called Prominent attributes (P). 2. For the other attributes in the dataset whose number of attri- bute values are greater than K, the numerous attribute values in these attributes will be spread out per cluster. These attri- butes add little to determine proper cluster structure and con- tribute less in deciding the initial representative modes of the clusters. Algorithm 1 shows the steps to compute Prominent attributes from a dataset. The number of attributes in the set P is defined as p = jPj. In the algorithm, p = 0 refers to a situation when there are no prominent attributes in the data and p = m means that all attributes are prominent attributes. In both of these scenarios, all the attributes in the data are considered prominent or else a re- duced set, P, of prominent attributes (equals to p) is chosen. Algorithm 1. Computation of Prominent attributes Input: N = data objects, M = Set of attributes in the data, m = jMj = Number of attributes in the data, p = 0 Output: P = Set of Prominent attributes P = / for i = 1 ? m do if Number of distinct attribute values in Mi > 1 && Mi 6 K then Add Mi to P increment p end if end for if p = 0 k p = m then use all attribute and call computeInitialModes(Attributes M) else use reduced prominent attributes and call computeInitialModes(Attributes P) end if 4.3. Significant attributes As discussed in the previous section, we select prominent attri- butes as we expect that these attributes play important role in clustering. The following section is taken from the work of Ahmad and Dey (2007a) that discusses an approach to rank important attributes in a dataset. We use their method to find significant attributes from the dataset. Ahmad and Dey (2007a, 2007b) propose an unsupervised learning method to compute the significance of attributes. On the basis of their significance, important attributes can be se- lected. In this method the most important step is to find out the distance between any two categorical values of an attri- bute. The distance between two distinct attribute values is computed as a function of their overall distribution and co- occurrence with other attributes. The distance between the pair of values x and y of attribute Mi with respect to the attribute Mj, for a particular subset w of attribute Mj values, is defined as follows: /wij ðx; yÞ¼ pðwjxÞþ pð� wjyÞ� 1 ð3Þ where p(wjx) denotes the probability that elements of the dataset with attribute Mi equal to x have attribute Mj value such that it is contained in w and, p(�wjy) denotes the probability that elements of the dataset with attribute Mi equal to y have attribute Mj value such that it is not contained in w. The distance between attribute values x and y for Mi with re- spect to attribute Mj is denoted by /j(x,y) and is given by /jðx; yÞ¼ pðWjxÞþ pð� WjyÞ� 1 ð4Þ where W is the subset of values of Mj that maximizes the quantity p(wjx) + p(�wjy). The distance between x and y is computed with respect to every other attribute. The average value of distances will be the distance /j(x,y) between x and y in the dataset. The average value of all the attribute values pair distances is taken as the signif- S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 7447 Author's personal copy icance of the attribute. Algorithm 2 shows the steps to compute the significance of attributes in the data. Algorithm 2. Computation of significance of attributes Input: D = Categorical Dataset, N = data objects, M = Set of Attributes in the data, m = jMj = Number of attributes in the data Output: S = Set of attributes sorted in order of their significance for every attribute Mi do for every pair of categorical attribute values ðx; yÞ do Sum = 0 for every other attributes Mj do /jðx; yÞ¼ maxðpðwjxÞþ pyi ðwjyÞ� 1 where w is subset of jth attribute values Sum = Sum + /j(x, y) end for Distance /(x, y) between categorical values ðx; yÞ¼ Sumðm�1Þ end for The average value of all the pair distances is taken as the significance of the attribute. end for We provide an example below to illustrate Algorithm 2. Con- sider a pure categorical dataset with three attributes M1, M2 and M3 as shown in Table 1. We compute the significance of attribute M1 by calculating the distance of each pair of attribute value with respect to every other attribute. In this case there is only one pair (L, T), therefore; The distance between L and T with respect to M2 is: maxðpðWjLÞþpð�WjTÞ�1Þ¼pðCjLÞþpð�CjTÞ�1¼1þ 2 3 �1¼ 2 3 where W is the subset of values of M2 Similarly, the distance between L and T with respect to M3 is: maxðpðWjLÞþpð�WjTÞ�1Þ¼pðEjLÞþpð�EjTÞ�1¼1þ 1 2 �1¼ 1 2 The average distance between L and T is: /ðL; TÞ¼ 1 2 2 3 þ 1 2 � � ¼ 0:58 As there is only one pair of values in the attribute M1, the signif- icance of attribute M1 (i.e. the average of distances of all pairs) = 0.58. This method to compute significance of attribute has been used in various K-means type clustering algorithms for mixed numeric and categorical datasets (Ahmad & Dey, 2007a, 2007b, 2011). Gen- erally the cost function of K-means type algorithm give equal importance to all the attributes. Ahmad and Dey (2007a, 2007b, 2011) show that with incorporating these significance of attributes in the cost function, better clustering results can be achieved. Ji, Pang, Zhou, Han, and Wang (2012) show that this approach is also useful for fuzzy clustering of categorical datasets. In this paper, we will use this approach to select the significant attributes from the datasets. 4.4. Computation of initial cluster centers In the preceding sections, we discussed three methods of choos- ing attributes that can be used for computation of initial cluster centers. The Vanilla approach chooses all the attributes whereas the prominent attribute approach (see Section 4.2) has the ability to choose fewer number of attributes depending upon the distribu- tion of attribute values in different attributes in the data. We will discuss the potential problem of choosing all the attributes in Sec- tion 5.3. The method to compute significant attributes (see Sec- tion 4.3) provides a ranking of all the attributes in order of their significance in the dataset. However, there is no straight-forward way to choose the most significant attributes in the data except to use a arbitrary cut-off value. For the experimentation, we choose the number of significant attributes to be the same as prominent attributes and discard rest of them, if all the attributes in the data- set are prominent then all of them are considered significant. The main idea of the proposed algorithm is to partition the data into clusters that corresponds to the number of distinct attribute values for Vanilla/Prominent/Significant attributes, and generate a cluster label for every data object present in the dataset. This clus- ter labeling is essentially a clustering view of the original data in the original space. Repeating this process for all the Vanilla/Promi- nent/Significant attributes yield a number of cluster labels that rep- resent multiple partition views of every data object. The cluster labels that are assigned to a data object over these multiple clus- terings is termed as cluster string and the number of total cluster strings is equal to the number of data objects present in the dataset. As noted in Section 4, some data objects will not be affected by choosing different initial cluster centers and their cluster strings will remain same. The distinct number of cluster strings represents the number of distinguishable clusters in the data. The algorithm assumes that the knowledge of the natural clusters in the data i.e. K is available and if the number of distinct cluster strings are more than K then it merges them into K clusters of cluster strings, such that the cluster strings within a cluster are more similar than others. Lastly, the cluster strings within each K clusters are re- placed by their corresponding data objects and modes of every K cluster is computed that serves as the initial cluster centers for the K-modes algorithm. In summary, the proposed method finds the dense localized regions in the dataset in the form of distin- guishable clusters. If their count is greater than K then it merges them to K clusters (and has the ability to ignore the infrequent clusters) and finds their group modes to be used as initial cluster centers. This process helps in avoiding the outliers contributing to the computation of initial cluster centers. Table 1 Categorical dataset. Attributes M1 M2 M3 L C E L C F T C F T K F T D F Table 2 Cluster strings of different data objects. Data point Cluster string D1 1-1-3-2 D2 2-2-1-1 D3 1-1-3-2 D4 2-2-1-1 D5 1-2-4-2 D6 2-1-4-1 D7 2-2-2-1 D8 1-1-3-2 D9 2-2-2-1 D10 1-2-3-1 7448 S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 Author's personal copy Algorithm 3. computeInitialModes(Attributes A) Input- Dataset N, n = jNj = the number of data objects, and A is the set of categorical attributes with a = jAj = number of attributes. If all the attributes are considered, then A = M and a = m, If prominent/significant attributes are considered, a 6 m i.e. a = p. jaij is the cardinality of the ai attribute and K is a user defined number that represent the number of clusters in the data. Output- K cluster centers Generation of cluster strings for i = 1. . .a do 1. Divide the dataset into jaij clusters on the basis of these jaij attribute values such that data objects with different val- ues (of this attribute ai) fall into different clusters. Com- pute cluster centers of these jaij clusters. 2. Partition the data by performing K-modes clustering that uses the cluster centers computed in above step as initial cluster centers. 3. Assign cluster label to every data object. Sti defines the cluster label of tth data object computed by using ai attri- bute, where t = 1,2. . .n. end for The cluster labels assigned to a data object is considered as a cluster string, resulting in the generation of n clustering strings. 4. Find distinct cluster strings from n strings, count their fre- quency, and sort them in descending order. Their count, K0, is the number of distinguishable clusters. 5. if K0 = K Get the data objects corresponding to these K clus- ter strings, and compute cluster centers of these K clusters. These will be the required initial cluster centers. 6. if K0 > K Merge similar distinct cluster string of K0 strings into K clusters (more details in Section 4.4.1) and compute the cluster centers. These cluster centers will be the required cluster centers. 7. if K0 < K Reduce the value of K and repeat the complete process. The steps to find initial cluster centers by using the proposed approach are presented in Algorithm 3. The computational effi- ciency of step 4 of the proposed algorithm can be improved by using other approaches such as on-line suffix trees (Ukkonen, 1995) that can perform string comparisons in time linear in the length of the string. To illustrate Algorithm 3, we present a descriptive example. Suppose we have 10 data objects D1, D2 , . . . , D10, defined by 4 cate- gorical attributes with K = 2. Let the cardinality of M1, M2, M3 and M4 are 2, 2, 4, 2. For the Vanilla approach, we consider all the attri- butes and first divide the data objects on the basis of attribute M1 and calculate 2 cluster centers because cardinality of M1 is 2. We run K-modes algorithm by using these initial cluster centers. Every data object is assigned a cluster label (either 1 or 2) and the same process is repeated to all other attributes. As there are 4 attributes, each data object will have a cluster string that consists of 4 labels. For example data object D1 has 1-2-2-1 as the cluster string. This means that in the first run (using M1 to create initial clusters) the data object D1 is placed in cluster 1, in the second run (using M2 to create initial clusters) the data object D1 is placed in cluster 2 and so on. We will get 10 different cluster strings corresponding to every data object. Suppose we get the following clustering strings for different data objects as shown in Table 2. We calculate the frequency of all the distinct strings as shown in Table 3. We take 100.5 � 3 most frequent cluster strings (details in Sec- tion 4.4.1 on this step) and cluster them by using hierarchical clus- tering with K = 2. The similar strings 2-2-1-1 and 2-2-2-1 are merged in one cluster. This leads to two clusters containing the cluster strings 1-1-3-2 and 2-2-1-1, 2-2-2-1 with their correspond- ing data objects, i.e. Cluster1 = {D1, D3, D8} Cluster2 = {D2, D4, D7, D9} The data objects belonging to these clusters are to be used to com- pute the required 2 cluster centers as K = 2. The other infrequent cluster strings and their corresponding data objects are assumed to be outliers that do not contribute in computing the initial cluster centers. The centers of these clusters serve as the initial cluster centers for the K-Modes algorithm. For prominent features approach, the attributes with attribute values less than or equal to the number of clusters are selected. In the example shown, attri- butes M1, M2 and M4 attributes are selected and the same proce- dure is followed with these three attributes to compute the initial cluster center. In the significant attributes approach, firstly the significant attributes are calculated, then they are used to cal- culate the initial cluster centers. For example if M1, M2 and M3 are the most significant attributes, then they are used to calculate the initial cluster centers following the above procedure. Algorithm 3 may give rise to an obscure case where the number of distinct cluster strings are less than the chosen K (assumed to represent the natural clusters in the data). This case can happen when the partitions created based on the attribute values of A attri- butes group the data in almost the same clusters every time. An- other possible scenario is when the attribute values of all attributes follow almost same distribution, which is normally not the case in real data. This case also suggests that probably the cho- sen K does not resemble with the natural grouping and it should be changed to a different value. The role of attributes with attribute values greater than K has to be investigated in this case. Generally, in K-modes clustering, the number of desired clusters (K) is se- lected without the knowledge of natural clusters in the data. The number of natural clusters may be less than the number of the de- sired clusters. If the number of the cluster strings (K0) obtained is less than K, a viable solution is to reduce the value of K and then apply the proposed algorithm to calculate the initial cluster cen- ters. However, this particular case is out of the scope of the present paper. 4.4.1. Merging clusters As discussed in step 6 of Algorithm 3, there may arise a case when K0 > K, which means that the number of distinguishable clus- ters obtained by the algorithm are more than the desired number of clusters in the data. Therefore, K0 clusters must be merged to arrive at K clusters. As these K0 clusters represent distinguishable clusters, a trivial approach could be to sort them in order of cluster string frequency and pick the top K cluster strings. A problem with this method is that it cannot be ensured that the top K most fre- quent cluster strings are representative of K clusters. If more than Table 3 Example of frequency computation of distinct cluster strings. String Frequency Data objects 1-1-3-2 3 D1,D3,D8 2-2-1-1 2 D2,D4 2-2-2-1 2 D7,D9 1-2-4-2 1 D5 2-1-4-1 1 D6 1-2-3-1 1 D10 S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 7449 Author's personal copy one cluster string comes from same cluster then the K-modes algo- rithm will give undesirable clustering results. This fact is also ver- ified experimentally and holds to be true. Keeping this issue in mind, we propose to use the hierarchical clustering method (Hall et al., 2009) to merge K0 distinct cluster strings into K clusters. The hierarchical clustering generates more informative cluster structures than the unstructured set of clusters returned by non-hierarchical clustering methods (Jain & Dubes, 1988). Most hierarchical clustering algorithms are deterministic and stable in comparison to their partitional counterparts. How- ever, hierarchical clustering has the disadvantage of having qua- dratic time complexity with respect to the number of data objects. In general, K0 cluster strings will be less than n. However, to avoid extreme case such as when K0 � n, we only choose the most frequent n0.5 distinct cluster strings. This will make the hier- archical algorithm log-linear with the number of data objects (K0 or n0.5 distinct cluster strings here). The Hamming distance (defined in Section 3) is used to compare the cluster strings. The proposed algorithm is based on the observation that some data objects al- ways belong to same clusters irrespective of the initial cluster cen- ters. The proposed algorithm attempts to capture those data objects that are represented by most frequent strings. The infre- quent cluster strings can be considered as outliers or boundary cases and their exclusion does not affect the computation of initial cluster centers. In the best case, when K0 � n0.5, the time complex- ity effect of log-linear hierarchical clustering will be minimal. This process generates K M-dimensional modes that are to be used as initial cluster centers for K-modes clustering algorithm. For merg- ing cluster strings (in Section 5), we use ‘single-linkage’ hierarchi- cal clustering, however other options such as average-linkage, complete-linkage, etc. can also be used. Continuing with the example shown in Section 4.4, we start with n0.5 strings as the lowest level of the tree for the bottom up approach of hierarchical clustering. Similar strings are merged up to the level where the number of clusters is equal to the number of desired cluster, K. The data objects belonging to strings in a clus- ter are used to compute initial cluster center. In the example shown in the previous section, there are three strings, 1-1-3-2, 2- 2-1-1 and 2-2-2-1, that are to be used for computing initial cluster centers. The number of designated clusters is 2. The similar strings 2-2-1-1 and 2-2-2-1 are merged, resulting in two clusters. All the data objects corresponding to these strings within a cluster is used to compute the cluster centers. 4.4.2. Choice of attributes The proposed algorithm starts with the assumption that there exists prominent attributes in the data that can help in obtaining distinguishable cluster structures that can either be used as is or be merged to obtain initial cluster centers. In the absence of any prominent attributes (or if all attributes are prominent), the Vanilla approach, all the attributes are selected to find initial cluster cen- ters. Since attributes other than prominent attributes contain attri- bute values more than K, a possible repercussion is the increased number of distinct cluster strings due to the availability of more cluster allotment labels. This implies an overall reduction in the individual count of distinct cluster strings and many small clusters may be generated. In our formulation, the hierarchical clusterer imposes a limit of n0.5 on the top cluster strings to be merged, therefore some relevant clusters could lay outside the bound dur- ing merging. This may lead to some loss of information while com- puting the initial cluster centers. The best case occurs when the number of distinct cluster strings are less than or equal to n0.5. 4.4.3. Evaluating time complexity The proposed algorithm to compute initial cluster centers has three parts, 1. Compute Vanilla/Prominent/Significant attributes 2. Compute initial cluster centers 3. If needed, merge clusters The time complexity of computation of Vanilla/Prominent attri- butes is O(nm), whereas for computing significant attributes is O(nm2T2) (Ahmad & Dey, 2007a), where T is the average number of distinct attribute values per attribute and T � n. Computation of initial cluster centers (from Algorithm 3) needs the basic K- modes algorithm to run P times (in the worst-case m times). As the K-modes algorithm is linear with respect to the size of the dataset (Huang, 1997), the worst-case time complexity will be O(rKm2n), where r is the number of iterations needed for conver- gence and r � n. For merging the distinct cluster strings into K clusters, computeInitialModes (Attributes A) uses hierarchical clus- tering. The worst-case complexity of the hierarchical clustering is O(n2 log n), however the proposed approach chooses only n0.5 most frequent distinct cluster string (see Section 4.4.1), therefore the worst-case complexity for merging cluster strings become O(n log n). Combining all the parts together, we get two worst-case time complexities: � Using All/Prominent attributes – O(nm + rKm2n + n log n) � Using Significant attributes – O(nm2T2 + rKm2n + n log n) It is to be noted that the worst-case time complexity using signif- icant attributes is higher than using all/prominent attributes due to the additional computation time spent in finding out the signifi- cance of attributes. However, the worst-case time complexity of using both the methods is log-linear in the number of data objects. With prominent attributes approach, the proposed method is advantageous for the datasets when n � rKm2. For significant attri- butes approach, the method may be useful for those datasets when n � m2T2 + rKm2. 5. Experimental analysis 5.1. Datasets To evaluate the performance of the proposed initialization method, we use several pure categorical datasets from the UCI Ma- chine Learning Repository (Batche & Lichman, 2013). A short description for each dataset is given below. Soybean Small. This dataset consists of 47 cases of soybean dis- ease each characterized by 35 multi-valued categorical variables. These cases are drawn from four populations, each one of them representing one of the following soybean diseases: D1-Diaporthe stem canker, D2-Charcoat rot, D3-Rhizoctonia root rot and D4-Phy- tophthorat rot. Ideally, a clustering algorithm should partition these given cases into four groups (clusters) corresponding to the diseases. The clustering results on Soybean Small data are shown in Table 5. Breast Cancer Data. This data has 699 instances with 9 attri- butes. Each data object is labeled as benign (458% or 65.5%) or malignant (241% or 34.5%). There are 9 instances in attribute 6 and 9 that contain a missing (i.e. unavailable) attribute value. The clustering results of breast cancer data are shown in Table 6. Zoo Data. It has 101 instances described by 16 attributes and distributed into 7 categories. The first attribute contains a unique animal name for each instance and is removed because it is non- informative. All other characteristics attributes are Boolean except for the character attribute corresponds to the number of legs that lies in the set 0, 2, 4, 5, 6, 8. The clustering results of Zoo data are shown in Table 7. 7450 S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 Author's personal copy Lung Cancer Data. This dataset contains 32 instances described by 56 attributes distributed over 3 classes with missing values in attributes 5 and 39. The clustering results for lung cancer data are shown in Table 8. Mushroom Data. Mushroom dataset consists of 8124 data ob- jects described by 22 categorical attributes distributed over 2 clas- ses. The two classes are edible (4208 objects) and poisonous (3916 objects). It has missing values in attribute 11. The clustering results for mushroom data are shown in Table 9. Congressional Vote Data. This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes. Each of the votes can either be a yes, no or an unknown dis- position. The data has 2 classes with 267 democrats and 168 republicans instances. The clustering results for Vote data are shown in Table 10. Dermatology Data. This dataset contains six types of skin dis- eases for 366 patients that are evaluated using 34 clinical attri- butes, 33 of them are categorical and one is numerical. The categorical attribute values signify degrees in terms of whether the feature is present, contain largest possible amount or relative intermediate values. In our experiment, we discretize the numeri- cal attribute (representing the age of the patient) to contain 10 cat- egories. The clustering results for Dermatology data are presented in Table 11. We used the WEKA framework (Hall et al., 2009) for the data pre-processing and implementing the proposed algorithm.1 5.2. Comparison and performance evaluation metric To evaluate the quality of clustering results and their fair com- parison, we used the performance metrics used by Wu et al. (2007) that are derived from information retrieval. Assuming that a data- set contains K classes, for any given clustering method, let ei be the number of data objects that are correctly assigned to class Ci, let bi be the number of data objects that are incorrectly assigned to class Ci, and let ci be the data objects that are incorrectly rejected from class Ci, then precision, recall and accuracy are defined as follows: PR ¼ PK i¼1 ei eiþbi � � K ð5Þ RE ¼ PK i¼1 ei eiþci � � K ð6Þ AC ¼ PK i¼1ei N ð7Þ Jain and Dubes (1988) noted that the results of partitional clus- tering algorithms improve when the initial cluster centers are close to the actual cluster centers. To measure the closeness between initial cluster centers computed by the proposed method and the actual modes of the clusters in the data, we define a match metric, matchMetric ¼ 1 K � m XK i¼1 Xn j¼1 dðinitialij; actualijÞ ð8Þ where initialij is the jth value of the initial mode for the ith cluster, actualij is the corresponding jth value of the actual mode for the ith cluster and d is defined same as in Section 3. The matchMetric will give degree of closeness between initial and actual modes with a value of 0 means no match and 1 means exact match between them. 5.3. Effect of number of attributes In Section 4.4.2, we discussed that choosing all the attributes can lead to the generation of large number of cluster strings, spe- cially if the attributes have many attribute values. To test this intu- ition, we performed a comparative analysis of the effect of the number of selected attributes on the number of distinct cluster strings (generated in Step 6 of Algorithm 3). In Table 4, m is the to- tal number of attributes in the data, p is the number of prominent attributes, s is the number of significant attributes (s = jSj and s = p), CSM, CSP and CSS are the number of distinct cluster string obtained using Vanilla, Prominent and Significant attributes, and n0.5 is the limit on the number of top cluster strings to be merged using hier- archical clustering. The table shows that choosing a Vanilla ap- proach (all attributes) leads to larger number of cluster strings, whereas with the proposed approach (either prominent or signifi- cant attributes) they are relatively smaller. This fact can be seen for the Soybean Small, Mushroom and Dermatology data. For Lung Cancer data p � m therefore the number of cluster strings are equivalent. For Soybean Small and Mushroom datasets, for same number of Prominent and Significant attributes, the corresponding number of cluster strings are different (CSP – CSS). This is due to the fact that the set of p prominent and significant attributes is dif- ferent because both methods select attributes by using different approaches. While the choice of prominent attributes (when they are less than m) should reduce the overall cluster strings as it se- lects attributes with fewer attribute values, the attributes selected by significance method may contain attributes with more attribute values that results in more cluster strings. For Zoo, Vote and Breast Cancer datasets, all the attributes were prominent therefore p and m are same and hence CSP = CSA. It is to be noted that the number of distinct cluster strings using proposed approach for Zoo and Mush- room datasets are within the bounds of n0.5 limit. 5.4. Clustering results In this section, we present the K-Modes clustering results that use the initial cluster centers computed with the proposed method. We conducted five set of experiments, their details are as follows: � Experiment1: We compared the clustering results obtained by using prominent attributes to find initial cluster centers with the method of random selection of initial cluster centers and the methods described by Cao et al. (2009) and Wu et al. (2007). As mentioned in Section 2, there are some computa- tional inaccuracies in the work of Bai et al. (2012), therefore we exclude their method from comparison with our work. For random initialization, we randomly group data objects into K clusters and compute their modes to be used as initial cluster centers. � Experiment2: We compared the clustering results obtained by using prominent and significant attributes to find initial cluster centers. Table 4 Effect of choosing different number of attributes. Dataset Vanilla Prominent Significant n0.5 m CSA p CSP s CSS Soybean 35 25 20 21 20 23 7 Mushroom 22 683 5 16 5 44 91 Dermatology 34 357 33 352 33 357 20 Lung-Cancer 56 32 54 32 54 32 6 Zoo 16 7 16 7 16 7 11 Vote 16 126 16 126 16 126 21 Breast-Cancer 9 355 9 355 9 355 27 1 The Java source code is publicly available at http://www.cs.uwaterloo.ca/ �s255khan/code/kmodes-init.zip. S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 7451 Author's personal copy � Experiment3: For some datasets, the Vanilla attributes are differ- ent from prominent attributes, for those cases we compared their clustering results. � Experiment4: For all the three approaches to compute initial cluster centers i.e. Vanilla, Prominent and Significant, we com- puted the matchMetric to measure the quality of initial cluster centers in terms of their closeness to the actual modes or cluster centers of the data. � Experiment5: We performed a scalability test by increasing the number of data objects � 100,000 and recording the time spent in computing initial cluster centers. We also compared the order of time complexities of the proposed method with two other initialization methods. Experiment1. Tables 5–11 show the clustering results, with con- fusion matrix representing the cluster structures obtained by seed- ing K-modes algorithm with the initial cluster centers computed using the proposed method. It can be seen that the proposed initialization method outperforms random cluster initialization when used as seed for K-modes clustering algorithm for the cate- gorical data in accuracy, precision and recall. The random initiali- zation method gives non-repeatable results, whereas the proposed method gives fixed clustering results. Therefore, repeat- able and better cluster structures can be obtained by using the pro- posed method. In comparison to the initialization methods of Cao et al. and Wu et al., we evaluate our results in terms of: � Accuracy – The proposed method outperforms or equals other methods in 4 cases and perform worse in one case. � Precision – The proposed method performs well or equals other methods in 2 cases and performs worse in 3 cases. � Recall – The proposed method outperforms or equals other methods in 4 cases and performs worse in 1 case. The results for Congressional Vote and Dermatology data are not available from the papers from Cao et al. and Wu et al., there- fore we compared the clustering accuracy of the proposed method against the random initialization method. The clustering results for Table 5 Clustering results for Soybean Small data. Cluster Class D1 D2 D3 D4 (a) Confusion matrix D1 10 0 0 0 D2 0 10 0 0 D3 0 0 10 2 D4 0 0 0 15 Random Wu Cao Proposed (b) Performance comparison AC 0.8644 1 1 0.9574 PR 0.8999 1 1 0.9583 RE 0.8342 1 1 0.9705 Table 6 Clustering results for Breast Cancer data. Cluster Class Benign Malignant (a) Confusion matrix Benign 453 56 Malignant 5 185 Random Wu Cao Proposed (b) Performance comparison AC 0.8364 0.9113 0.9113 0.9127 PR 0.8699 0.9292 0.9292 0.9318 RE 0.7743 0.8773 0.8773 0.8783 Table 7 Clustering results for Zoo data. Cluster Class a b c d e f g (a) Confusion matrix a 39 0 0 0 0 0 0 b 0 19 0 0 0 0 0 c 0 1 4 0 4 0 0 d 2 0 1 13 0 0 0 e 0 0 0 0 0 0 1 f 0 0 0 0 0 8 2 g 0 0 0 0 0 0 7 Random Wu Cao Proposed (b) Performance comparison AC 0.8356 0.8812 0.8812 0.8911 PR 0.8072 0.8702 0.8702 0.7224 RE 0.6012 0.6714 0.6714 0.7716 Table 8 Clustering results for Lung Cancer data. Cluster Class a b c (a) Confusion matrix a 8 7 0 b 1 6 8 c 0 0 2 Random Wu Cao Proposed (b) Performance comparison AC 0.5210 0.5000 0.5000 0.5000 PR 0.5766 0.5584 0.5584 0.6444 RE 0.5123 0.5014 0.5014 0.5168 Table 9 Clustering results for Mushroom data. Cluster Class Poisonous Edible (a) Confusion matrix Poisonous 3052 98 Edible 864 4110 Random Wu Cao Proposed (b) Performance comparison AC 0.7231 0.8754 0.8754 0.8815 PR 0.7614 0.9019 0.9019 0.8975 RE 0.7174 0.8709 0.8709 0.8780 Table 10 Clustering results for Congressional Vote data. Cluster Class Republican Democrat (a) Confusion matrix Republican 158 55 Democrat 10 212 Random Proposed (b) Performance comparison AC 0.4972 0.8506 PR 0.5030 0.8484 RE 0.5031 0.8672 7452 S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 Author's personal copy Dermatology data with random initialization are worse due to mixing up of data objects among various clusters. The above results are very encouraging due to the fact that the proposed method attempts to find dense localized regions and dis- cards the boundary cases, thus ensuring the selection of better ini- tial cluster centers with log-linear worst-case time complexity. The method of Wu et al. induces random selection of data objects and Cao et al. can select boundary cases as initial cluster centers which can be detrimental to the clustering results. The accuracy values of proposed method are better than or equal to other methods. The only case where the proposed method perform worse in all three performance metric is the Soybean Small dataset. This dataset has only 47 data objects, our algorithm could not cluster only 2 data objects correctly. However, due to the small size of the data- set, the clustering error appears to be large. We observe that on some datasets the proposed method gives worse values for precision, which implies that in those cases some data objects from non-classes are getting clustered in given classes. The recall values of proposed method are better than the other methods, which suggests that the proposed approach tightly con- trols the data objects from given classes to be not clustered to non-classes. Breast Cancer data has no prominent attribute in the data and uses all the attributes and produces comparable results to other methods. Lung Cancer data, though smaller in size has high dimension and the proposed method is able to produce better precision and recall rates than other methods. It is also observed that the proposed method performs well on large dataset such as Mushroom data with more than 8000 data objects. In our experi- ments we did not encounter a scenario where the distinct cluster strings are less than the desired number of clusters (step 7 of Algo- rithm 3). Experiment2. Table 12 shows that for Zoo, Vote and Breast Can- cer datasets, the clustering results using prominent and significant attributes are same. This is because all the attributes are consid- ered in these datasets for computing the initial cluster centers. For other datasets, we observe that using prominent attributes is a better choice than significant attributes. Although we choose the same number of prominent and significant attributes (specially when all the attributes are not prominent), their clustering results varies because both of the attribute spaces may contain different set of attributes. The reason is that by definition (see Section 4), prominent and significant attributes use different criteria to choose relevant attributes for computing initial cluster centers. Moreover, generating the ranking of significant attributes is costlier in terms of time complexity than computing prominent attributes (see Section 4.4.3 for details). Experiment3. As per Algorithm 1, for Zoo, Vote and Breast Cancer data all the attributes are prominent. For the rest of the datasets, this is not the case and the prominent attributes are less than the total number of attributes. We performed an experiment to analyze the scenario when there are fewer prominent attributes and its impact on overall clustering results. Table 13 shows that for all the datasets except Soybean Small, choosing prominent attributes less than the total number of attributes improve the clustering performance. Choosing all attributes in comparison to Table 11 Clustering results for Dermatology data. Cluster Class Seboreic dermatitis Psoriasis Lichen planus Cronic dermatitis Pityriasis rosea Pityriasis rubra pilaris (a) Confusion matrix Seboreic dermatitis 53 7 5 2 36 0 Psoriasis 0 96 0 0 0 0 Lichen planus 0 0 66 0 0 0 Cronic dermatitis 2 1 0 39 0 0 Pityriasis rosea 6 8 1 11 13 4 Pityriasis rubra pilaris 0 0 0 0 0 16 Random Proposed (b) Performance comparison AC 0.2523 0.7732 PR 0.2697 0.7909 RE 0.2954 0.7570 Table 12 Comparison of clustering results using Prominent and Significant attributes. Dataset Prominent Significant AC PR RE AC PR RE Soybean 0.9574 0.9583 0.9705 0.6809 0.7549 0.7176 Mushroom 0.8815 0.8975 0.8780 0.5086 0.7303 0.5256 Dermatology 0.7732 0.7909 0.7570 0.6502 0.5601 0.5512 Lung-Cancer 0.5000 0.6444 0.5168 0.46875 0.5079 0.4838 Zoo 0.8911 0.7224 0.7716 0.8911 0.7224 0.7716 Vote 0.8506 0.8484 0.8672 0.8506 0.8484 0.8672 Breast-Cancer 0.9127 0.9318 0.8783 0.9127 0.9318 0.8783 Table 13 Comparison of clustering results using Vanilla and Prominent attributes. Dataset Vanilla Prominent AC PR RE AC PR RE Soybean 0.9787 0.9772 0.9853 0.9574 0.9583 0.9705 Mushroom 0.6745 0.7970 0.6627 0.8816 0.8976 0.87803 Dermatology 0.4180 0.3889 0.3420 0.7372 0.7909 0.7570 Lung-Cancer 0.5000 6317 0.5017 0.5 0.6444 0.5168 Table 14 Comparison of matchMetric and its effect on K-modes algorithm convergence. Dataset Vanilla Prominent Significant matchMetric #Itr matchMetric #Itr matchMetric #Itr (a) p < m Soybean 0.7357 2 0.9643 2 0.8428 2 Mushroom 0.6136 2 0.8863 2 0.5681 1 Dermatology 0.6176 5 0.6813 6 0.6274 6 Lug-Cancer 0.6726 6 0.7261 5 0.7321 8 Dataset Vanilla matchMetric #Itr (b) p = m Zoo 0.8661 2 Vote 0.7500 2 Breast-Cancer 0.8333 3 S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 7453 Author's personal copy fewer prominent attributes generates more cluster strings (see Table 4). If these cluster strings are more than n0.5, then many relevant cluster strings may be not be chosen, which if included could have contributed in computation of initial cluster centers. Experiment4. For all the datasets using the three methods of computing initial cluster centers, we computed the matchMetric (see Eq. (8)), which measures the degree of closeness of initial and actual modes. We also studied the impact of quality of initial cluster centers on the convergence of the K-modes algorithm (in terms of number of iterations, #Itr). Table 14(a) shows the case when prominent attributes are less than total number of attributes. The initial cluster centers selected by prominent/significant attri- butes are always closer to the actual modes of the datasets in terms of the matchMetric and therefore the K-modes algorithm converges in very few iterations with good cluster structures (see discussion of clustering results in Experiment1). Similar results were obtained when all the attributes are chosen as prominent attributes and used to compute initial cluster centers (see Table 14(b)). The high values of matchMetric show that the initial cluster centers are close to the actual cluster centers and the K-modes clustering algorithm with these initial cluster centers converges fast with good cluster- ing performance. The reason for the initial cluster centers to be close to the actual cluster centers is that the proposed method finds dense localized clusters, merges them if needed and discards the insignificant clusters. Experiment5. Time Scalability of the Proposed Algorithm. We per- formed an experiment to test the scalability of the proposed meth- od for computing initial cluster centers for large datasets. We used the Mushroom dataset (see Section 5.1) that contains 8124 data objects described by 22 categorical attributes and 2 clusters. We made copies of this dataset in multiples of 2, 4, 6, 8, 10 and 12 such that the data size varies from 8124 to 113,736. We execute the pro- posed algorithm for computing initial cluster centers on each of these copies separately. We ran the experiment on a HP Touch- Smart tm2 machine with Intel Pentium™ U4100 1.3 GHz proces- sor, L2 cache 2048 KB and 4 GB RAM. Fig. 1 shows the plot between the different sizes of the data and the corresponding time consumed in computing the initial cluster centers. It can be ob- served that the time cost of the proposed method grows almost lin- early with the increase in the number of data objects. The experimental results suggest that the proposed cluster center ini- tialization method scales linearly and can be implemented for large datasets. Table 15 compares the time complexities of the proposed clus- ter initialization algorithms with the two competing initialization methods of Cao et al. (2009) and Wu et al. (2007). In the proposed algorithm (with prominent attributes), if rKm2 is larger than logn (which is more likely to be true for high dimensional dataset with large number of clusters), the complexity is decided by the second term, rKm2n, which is linear in number of data objects and similar to Cao’s method and better than Wu’s method (with respect to the number of data objects). This linear time complexity behavior is also observed in our scalability experiments. Multiple Attributes Clustering Challenges In Section 4, we men- tioned some of the challenges in employing multiple clustering ap- proaches (as defined by Müller et al. (2010)). The proposed method uses the multiple clustering approach to find initial cluster centers for a partitional clustering algorithm. The proposed approach suc- cessfully generated and detected the multiple clustering views from the data and can process different distinguishable clusters into relevant number of clusters by a modified hierarchical cluster- ing approach or use them unaltered, whichever the case may be (as 0 2 4 6 8 10 12 x 10 4 5 10 15 20 25 30 35 40 45 50 Number of data objects Ti m e (s ec on ds ) Fig. 1. Time consumption in computing initial cluster center for variable data size. Table 15 Comparison of time complexities. Initialization method Order of complexity Cao et al. O(nmK2) Wu et al. O(cn), where c can be between 2 to n0.5 Proposed method O(nm + rKm2n + n log n), for all/prominent attributes 7454 S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 Author's personal copy discussed in Algorithm 3). In the worst-case, the proposed algo- rithm will generate clustering views equal to the number of total attributes in the data. This is a significant improvement over other approaches such as by Khan and Kant (2007), which can run arbi- trary number of times for evidence accumulation. The proposed method is flexible and tested on various categorical datasets, how- ever a known limitation is the advance knowledge of number of natural clusters in the data. 6. Conclusions K-modes clustering algorithm is employed to partition the cat- egorical data into pre-defined K clusters, however the clustering results intrinsically depend on the choice of random initial cluster centers, that can cause non-repeatable results and produce impro- per cluster structures. In this paper, we propose an algorithm to compute the initial cluster centers for the categorical data by per- forming multiple clustering of data based on the attribute values present in different attributes. The present algorithm is based on the experimental fact that similar data objects form the core of the clusters and are not affected by the selection of initial cluster centers, and that individual attribute also provides useful informa- tion in generating cluster structures. The proposed algorithm is composed of two parts – relevant attributes selection and comput- ing initial cluster centers. For choosing the relevant attributes from the data, we presented two competitive methods. The first method chooses Prominent attributes on the basis of the attribute values present in an attribute and the second method computes the rank- ing of Significant attributes by an unsupervised learning method. Based on the selected attributes, the proposed algorithm partitions the data multiple times to generate multiple clustering views of the data. The multiplicity of clustering views is captured in the form of cluster strings, which produces distinct distinguishable clusters in the data that may be greater than, equal to or less than the desired number of clusters (K). If it is greater than K, then a modified hierarchical clustering is used to merge similar cluster strings into K clusters, if it is equal to K then the data objects cor- responding to cluster strings can be directly used as initial cluster centers. An obscure possibility may arise when the cluster strings are less than K, in this case, it is assumed that the current value of K is not the true representative of the desired number of clus- ters. In our experiments we did not get such situation, largely be- cause it can happen in a rare occurrence, when all the attribute values of different attributes cluster the data in the same way. These initial cluster centers when used as seed to K-modes cluster- ing algorithm, improves the accuracy of the traditional K-modes clustering algorithm that uses random cluster centers as starting point. Since the proposed method provides a definitive choice of initial cluster centers (zero standard deviation), consistent and repetitive clustering results can be obtained. We also show that the initial cluster centers computed by using prominent attributes performs better than significant attribute selection approach and has the advantage of lower computational complexity. The initial cluster centers computed by the proposed approach are found to be very similar to the actual cluster centers of the data that leads to faster convergence of K-modes clustering algorithm and better clustering results. The performance of the proposed method is bet- ter than random initialization and better than or equal to the other two methods compared on all datasets except one case. The biggest advantage of the proposed method is the worst-case log-linear time complexity of computation and fixed choice of initial cluster centers from dense localized regions, whereas the other two meth- ods lack one of them. When the number of desired clusters is not available in ad- vance, we would like to extend the proposed multi-clustering ap- proach for the categorical data for finding out the natural number of clusters present in the data, in addition to computing the initial cluster centers for such cases. The present algorithm to compute Prominent attributes sometimes select all the attributes in the data, however our experiments indicate that considering fewer most relevant attributes is a better choice than choosing all attributes. We would like to further investigate such cases in fu- ture. We would like to extend the Significant attributes approach by ranking them according to their significance in the final consensus building instead of taking a fixed number of attributes. In other words, while computing the similarity of cluster strings in the merging algorithm, more weights will be given to the clustering re- sults computed by using more significant attributes. References Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. In L. M. Haas & A. Tiwary (Eds.), SIGMOD conference (pp. 94–105). ACM Press. Ahmad, A., & Dey, L. (2007a). A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowledge Engineering, 63. Ahmad, A., & Dey, L. (2007b). A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognition Letters, 28, 110–118. Ahmad, A., & Dey, L. (2011). A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets. Pattern Recognition Letters, 32, 1062–1069. Anderberg, M. R. (1973). Cluster analysis for applications. New York: Academic Press. Bai, L., Liang, J., Dang, C., & Cao, F. (2012). A cluster centers initialization method for clustering categorical data. Expert Systems with Applications, 39, 8022–8029. Boley, D., Gini, M., Gross, R., Han, E.-H., Karypis, G., Kumar, V., et al. (1999). Partitioning-based clustering for web document categorization. Decision Support Systems, 27, 329–341. Bradley, P. S., & Fayyad, U. M. (1998). Refining initial points for k-means clustering. In J. W. Shavlik (Ed.), ICML (pp. 91–99). Morgan Kaufman. Cao, F., Liang, J., & Bai, L. (2009). A new initialization method for categorical data clustering. Expert Systems and Applications, 36, 10223–10228. Caruana, R., Elhawary, M. F., Nguyen, N., & Smith, C. (2006). Meta clustering. In ICDM (pp. 107–118). IEEE Computer Society. Davidson, I., & Qi, Z. (2008). Finding alternative clusterings using constraints. In ICDM (pp. 773–778). IEEE Computer Society. Bache, K., & Lichman, M., (2013). UCI machine learning repository, http:// archive.ics.uci.edu/ml. Fred, A. L. N., & Jain, A. K. (2002). Data clustering using evidence accumulation. In ICPR (4) (pp. 276–280). Gowda, K. C., & Diday, E. (1991). Symbolic clustering using a new dissimilarity measure. Pattern Recognition, 24, 567–578. Guha, S., Rastogi, R., & Shim, K. (1999). Rock: a robust clustering algorithm for categorical attributes. In Proceedings of the 15th international conference on data engineering, 23–26 March 1999, Sydney, Austrialia (pp. 512–521). IEEE Computer Society. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: an update. In SIGKDD Explorations: Vol. 11 of 1. He, Z. (2006). Farthest-point heuristic based initialization methods for k-modes clustering. CoRR, abs/cs/0610043. He, Z., Xu, X., & Deng, S. (2005). A cluster ensemble method for clustering categorical data. Information Fusion, 6, 143–151. Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data mining. In Research issues on data mining and knowledge discovery. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2, 283–304. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.. Ji, J., Pang, W., Zhou, C., Han, X., & Wang, Z. (2012). A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowledge-Based Systems, 30, 129–135. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: an introduction to cluster analysis. John Wiley. Khan, S. S., & Ahmad, A. (2004). Cluster center initialization algorithm for k-means clustering. Pattern Recognition Letters, 25, 1293–1302. Khan, S. S., & Ahmad, A. (2003). Computing initial points using density based multiscale data condensation for clustering categorical data. In Proceedings of 2nd international conference on applied artificial intelligence. Khan, S. S., & Ahmad, A. (2012). Cluster center initialization for categorical data using multiple attribute clustering. In E. Mülle, T. Seidl, S. Venkatasubramanian, & A. Zimek (Vol. Eds.), Workshop proceedings of the 3rd multiclust workshop: discovering, summarizing and using multiple clusterings, USA (pp. 3–10). Khan, S. S., & Kant, S. (2007). Computation of initial modes for k-modes clustering algorithm using evidence accumulation. In Proceedings of the 20th international joint conference on artificial intelligence (IJCAI) (pp. 2784–2789). S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456 7455 Author's personal copy Matas, J., & Kittler, J. (1995). Spatial and feature space clustering: applications in image analysis. In CAIP (pp. 162–173). Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Density-based multiscale data condensation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 734–747. Müller, E., Günnemann, S., Färber, I., & Seidl, T. (2010). Discovering multiple clustering solutions: grouping objects in different views of the data. In G. I. Webb, B. Liu, C. Zhang, D. Gunopulos, & X. Wu (Eds.), ICDM (pp. 1220). IEEE Computer Society. Petrakis, E. G. M., & Faloutsos, C. (1997). Similarity searching in medical image databases. IEEE Transactions on Knowledge Data Engineering, 9, 435–447. Ralambondrainy, H. (1995). A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16, 1147–1157. Sun, Y., Zhu, Q., & Chen, Z. (2002). An iterative initial-points refinement algorithm for categorical data clustering. Pattern Recognition Letters, 23, 875–884. Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14, 249–260. Wu, S., Jiang, Q., & Huang, J. Z. (2007). A new initialization method for clustering categorical data. In Proceedings of the 11th Pacific-Asia conference on advances in knowledge discovery and data mining PAKDD’07 (pp. 972–980). Berlin, Heidelberg: Springer-Verlag. 7456 S.S. Khan, A. Ahmad / Expert Systems with Applications 40 (2013) 7444–7456