Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information). 7/2 (2014), 61-66 DOI: http://dx.doi.org/10.21609/jiki.v7i2.258 DIVERSITY-BASED ATTRIBUTE WEIGHTING FOR K-MODES CLUSTERING M Misbachul huda, Dian Rahma Latifa Hayun, and Annisaa Sri I. Informatics Engineering, information Technology Department Institut Teknologi Sepuluh Nopember Surabaya, East Java, Indonesia E-mail: annisaaindrawanti@gmail.com, misbachul@mhs.if.its.ac.id Abstract Categorical data is a kind of data that is used for computational in computer science. To obtain the information from categorical data input, it needs a clustering algorithm. There are so many clustering algorithms that are given by the researchers. One of the clustering algorithms for categorical data is k- modes. K-modes uses a simple matching approach. This simple matching approach uses similarity va- lues. In K-modes, the two similar objects have similarity value 1, and 0 if it is otherwise. Actually, in each attribute, there are some kinds of different attribute value and each kind of attribute value has different number. The similarity value 0 and 1 is not enough to represent the real semantic distance between a data object and a cluster. Thus in this paper, we generalize a k-modes algorithm for catego- rical data by adding the weight and diversity value of each attribute value to optimize categorical data clustering. Keywords: categorical data, diversity, K-modes, attribute weighting. Abstrak Data Kategorial merupakan suatu jenis data perhitungan di ilmu komputer .Untuk mendapatkan infor- masi dari input data kategorial diperlukan algoritma klastering. Ada berbagai jenis algoritma klas- tering yang dikembangkan peneliti terdahulu. Salah satunya adalah K-modes. K-modes menggunakan pendekatan simple matching. Pendekatan simple matching ini menggunakan nilai similarity. Pada K- modes, jika dua objek data mirip, maka akan diberi nilai. Jika dua objek data tidak mirip, maka diberi nilai 0. Pada kenyataannya, tiap atribut data terdiri dari beberapa jenis nilai atribut dan tiap jenis nilai atribut terdiri dari jumlah yang berbeda. Nilai similarity 0 dan 1 kurang merepresentasi jarak antara sebuah objek data dan klaster secara nyata. Oleh karena itu, pada paper ini, kami mengembangkan algoritma K-modes untuk data kategorial dengan penambahan bobot dan nilai diversity pada setiap atribut untuk mengoptimalkan klastering data kategorial. Kata Kunci: data kategorial, diversity, K-modes, pembobotan atribut. 1. Introduction Computer science is always related to the data. All computational processes need data not only small data but also big data. There are two data types, they are numeric data and categorical data. Arithmetic process can be given to numeric data but it cannot be given to categorical data. Catego- rical data is used in some systems or applications. For example, categorical data in intrusion detec- tion systems, population data and customer infor- mation in online shopping. Example of categorical data in intrusion detection system is IP address. IP address in each data must be different and it can- not be arithmetically compared. It will be in popu- lation data and customer information, too. Popu- lation data has such categorical data like gender, blood type and home address. Costumer informa- tion, in online shopping, has such categorical data. For example: the phone number and the used ba- nk. The categorical data clustering considers the similarity or dissimilarity between data. Similarity or dissimilarity can be considered by distance bet- ween the two data object. The shorter the distance between objects, the more similar the objects. Si- mple matching approaches can be used to calcu- late the distance. Example of simple matching ap- proach for categorical data is k-modes [7]. In in- trusion detection system, there is such new intru- sion that has not been known earlier. It needs a clustering algorithm to cluster and detect the new intrusion. Clustering is used to cluster population data and customer data to obtain the information needed, too. The study [7], k-means and k-modes are joi- ned together to cluster numeric and categorical data. K-means is a clustering algorithm for nume- ric data. K-means clusters the data type that can 61 62 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information), Volume 7, Issue 2, June 2014 be arithmetically compared. Categorical data is a kind of data that cannot be arithmetically compar- ed each other. K-means cannot be used to cluster categorical data. To overcome the categorical da- ta, it uses k-modes algorithm. K-modes algorithm uses a simple matching approach to the process of matching dissimilarity clustering of data, replac- ing the means to modes. In k-means, to update a cluster centroid, it is used means formulae. On the other hand the k-modes, modes updating on the clustering process is determined by the frequency of occurrence of data so as to minimize the cost function. In [3], K-modes algorithm is supposed to ha- ve some deficiencies. In K-modes, the two similar objects have similarity value 1, and 0 if it is oth- erwise. According to [3], a simple matching app- roach 0 and 1 are not good enough to represent the real semantic distance between a data object and a cluster. Thus, in [3] the authors proposed a range between 0 and 1 which represents the wei- ght of similarity. In [8], focused on optimizing the k-modes algorithm for clustering categorical data with the new dissimilarity approach using the rou- gh set membership. According to the research, the shortcoming of the simple matching approach is having a weak intra-similarity. To solve the issue, the dissimilarity approach (frequency-based) bet- ween the two objects in [8] takes into account to the distribution of universal attribute values. The study [1] explains that sometimes the occurrence of low frequency data and high frequency data ha- ve the same overlap similarity value. With these problems, the study [1] approach takes into acco- unt to the similarity of the frequency distribution of the different attributes value to define the simi- larity between two categorical attribute values. In addition, the dissimilarity approach taking into account the frequency distribution of differ- ent attribute values can also be applied to the opti- mization of categorical data dissimilarity [1]. For example, there are two data, AAABB and ABC- DD. If using the k-modes algorithm [8], both dis- similarity values are equal to 1. As we can see, the diversity level between the first data and the se- cond data is different because the quantity of each letter is different. Thus, in this paper, the research focuses on the k-modes clustering with diversity- based attribute weighting and the research contri- bution is distance values diversity to optimize the categorical data dissimilarity. 2. Methods Categorical Data Categorical variables represent types of data whi- ch may be divided into groups. Analysis of cate- gorical data generally involves the use of data ta- bles. A two-way table presents categorical data by counting the number of observations that fall into each group for two variables, one divided into ro- ws and the other divided into columns. There is no intrinsic ordering to the catego- ries in categorical data. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories. Hair color is also a categorical va- riable having a number of categories (blonde, bro- wn, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest. So that, computing the similarity and the dis- similarity between categorical data instances is not straightforward owing to the fact that there is no explicit notion of ordering between categorical values. To tackle this, there are some data-driven similarity measures that have been proposed for categorical data. In this section, we will describe the categorical data characteristics. But first, we will define the notation used in later explanation. Let’s consider for a categorical data set D containing N objects, defined over a set of l attribute where 𝐴𝐴𝑖𝑖 denotes the ith attribute. And let the ni be the number of values in each at- tribute𝐴𝐴𝑖𝑖. The next notation used are following: 𝑓𝑓𝑖𝑖(π‘₯π‘₯) = the frequency of value π‘₯π‘₯ lies on attribute Ai in data set D. 𝑝𝑝𝑖𝑖(π‘₯π‘₯) = the probability of attribute 𝐴𝐴𝑖𝑖 to take the value π‘₯π‘₯ in the data set D, defined as equation(1). 𝑝𝑝𝑖𝑖(π‘₯π‘₯) = 𝑓𝑓𝑖𝑖(π‘₯π‘₯) 𝑁𝑁 (1) 𝑝𝑝𝑖𝑖 2(π‘₯π‘₯) = another probability measure of attribute 𝐴𝐴𝑖𝑖 to take the value π‘₯π‘₯ in the data set D, defined as equation(2). 𝑝𝑝𝑖𝑖 2(π‘₯π‘₯) = 𝑓𝑓𝑖𝑖 (π‘₯π‘₯)(𝑓𝑓𝑖𝑖(π‘₯π‘₯)βˆ’1) 𝑁𝑁(π‘π‘βˆ’1) (2) Size of the data, N. Most measure are typi- cally invariant to the size of the data. Number of attributes, l. In [1] experiment showed that the nu- mber of attributes does affect the performance of the outlier detection algorithm. The frequency of values taken by each attri- bute, ni. A data set may contain attributes that take some values and attributes that only take a few va- lues. A similarity measure may ignore another at- tribute while finding the more importance attri- bute. Distribution of 𝑓𝑓𝑖𝑖(π‘₯π‘₯). It refers to the distri- bution of frequency of values taken by attribute in the given data set. It is possible for a similarity measure to give more priority to frequently occu- rring attribute values or otherwise. M. Misbachul Huda, et a.l, Diversity-Based Attribute 63 Similarity and Dissimilarity Measure for Cate- gorical Data in K-Modes The classical theory of clustering in k-Modes, eit- her an element belongs to a cluster or it does not. The corresponding membership function is the characteristic function of the cluster that takes values 1 and 0. Suppose that there are π‘₯π‘₯,𝑦𝑦 ∈ 𝐷𝐷, then the simple matching dissimilarity measure in k-Modes is defined following equation(3). 𝐷𝐷𝐷𝐷𝐷𝐷(π‘₯π‘₯, 𝑦𝑦) = οΏ½ 1, 𝐷𝐷𝑓𝑓 π‘₯π‘₯ β‰  𝑦𝑦 0, π‘œπ‘œπ‘œπ‘œβ„Žπ‘’π‘’π‘’π‘’π‘’π‘’π·π·π·π·π‘’π‘’ (3) With this concept, the distance between two objects computed often results in cluster with we- ak intra-similarity and disregards the similarity embedded in the categorical values [1]. Then, a new dissimilarity measure between the mode of a cluster and an object is introduced for the k-Modes Algorithm. To obtain such a clus- ter having the strong intra-similarity, the rough membership is used. It takes values between 0 and 1. Taking account of the frequency of mode components in the current cluster, Ng et al. [2] in- troduced a valuable dissimilarity measure into the k-Modes clustering algorithm. Let 𝑃𝑃 βŠ† 𝐴𝐴 and π‘Žπ‘Ž ∈ 𝑃𝑃, then the Ng dissimilarity measure 𝐷𝐷𝐷𝐷𝐷𝐷𝑃𝑃(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖) between a categorical object π‘₯π‘₯𝑖𝑖 and the mode of a cluster 𝑧𝑧𝑙𝑙 with respect to P is defined as the following equation(4). 𝐷𝐷𝐷𝐷𝐷𝐷𝑃𝑃(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖) = οΏ½π·π·π·π·π·π·π‘Žπ‘Ž(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖), π‘Žπ‘Žβˆˆπ‘ƒπ‘ƒ where, π·π·π·π·π·π·π‘Žπ‘Ž(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖) = οΏ½ 1, 𝐷𝐷𝑓𝑓 𝑓𝑓(𝑧𝑧𝑙𝑙, π‘Žπ‘Ž) β‰  𝑓𝑓(π‘₯π‘₯𝑖𝑖, π‘Žπ‘Ž), 1 βˆ’ π‘šπ‘šπ‘Žπ‘Ž, π‘œπ‘œπ‘œπ‘œβ„Žπ‘’π‘’π‘’π‘’π‘’π‘’π·π·π·π·π‘’π‘’ where, π‘šπ‘šπ‘Žπ‘Ž = |{π‘₯π‘₯𝑖𝑖|𝑓𝑓(π‘₯π‘₯𝑖𝑖, π‘Žπ‘Ž) = 𝑓𝑓(𝑧𝑧𝑙𝑙, π‘Žπ‘Ž), π‘₯π‘₯𝑖𝑖 ∈ 𝑐𝑐𝑙𝑙 }| |𝑐𝑐𝑙𝑙| (4) And |𝑐𝑐𝑙𝑙| is the number of objects in the lth cluster. For the k-Modes algorithm with Ng’s dis- similarity measure [2], the simple matching dissi- milarity measure is still used in the first iteration. So, Cao et al. in [3] proposed a new dissimilarity measure by using π‘†π‘†π·π·π‘šπ‘šπ‘Žπ‘Ž(π‘₯π‘₯, 𝑦𝑦) defined in equation (5). π‘†π‘†π·π·π‘šπ‘šπ‘Žπ‘Ž(π‘₯π‘₯, 𝑦𝑦) = 𝑓𝑓(π‘₯π‘₯, π‘Žπ‘Ž) ≑ 𝑓𝑓(𝑦𝑦, π‘Žπ‘Ž) βˆ‘ 𝑓𝑓(π‘₯π‘₯, π‘Žπ‘Ž) ≑ 𝑓𝑓(𝑧𝑧, π‘Žπ‘Ž)π‘§π‘§βˆˆπ·π· (5) From the equation(5) formula [3] introduced that the new dissimilarity measure is defined as the following equation(6). 𝑁𝑁𝐷𝐷𝐷𝐷𝐷𝐷𝑃𝑃(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖) = βˆ‘ π‘π‘π·π·π·π·π·π·π‘Žπ‘Žπ‘Žπ‘Žβˆˆπ‘ƒπ‘ƒ (𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖) (6) where, π‘π‘π·π·π·π·π·π·π‘Žπ‘Ž(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖) = 1 βˆ’ π‘†π‘†π·π·π‘šπ‘šπ‘Žπ‘Ž(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖) Γ— π‘šπ‘šπ‘Žπ‘Ž As opposed to Ng’s dissimilarity measure, the similarity π‘†π‘†π·π·π‘šπ‘šπ‘Žπ‘Ž(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖) between object π‘₯π‘₯𝑖𝑖and cluster 𝑧𝑧𝑙𝑙 is included in the proposed measure π‘π‘π·π·π·π·π·π·π‘Žπ‘Ž(𝑧𝑧𝑙𝑙, π‘₯π‘₯𝑖𝑖). In [3] introduced a weighted k- modes clustering algorithm, by considering that the similarity value between two objects is not al- ways 1, but it can be a value between 0 and 1. A pair of object π‘₯π‘₯𝑖𝑖, π‘₯π‘₯𝑗𝑗 is considered more similar th- an the second pair(π‘₯π‘₯𝑠𝑠, π‘₯π‘₯𝑑𝑑), if and only if π‘₯π‘₯𝑖𝑖and π‘₯π‘₯𝑗𝑗 exhibit a less common attribute value match in the population. In other words, similarity among obje- cts could be decided by the un-commonality of their attribute value matches. Based on that, in [3] defined the β€œMore Similar Attribute Set” of an at- tribute value π‘Žπ‘Žπ‘—π‘— (π‘Ÿπ‘Ÿ) as the following equation(7). π‘€π‘€οΏ½π‘Žπ‘Žπ‘—π‘— (π‘Ÿπ‘Ÿ)οΏ½ = {π‘Žπ‘Žπ‘—π‘— (𝑑𝑑)|π‘“π‘“οΏ½π‘Žπ‘Žπ‘—π‘— (𝑑𝑑)|𝐷𝐷� ≀ 𝑓𝑓(π‘Žπ‘Žπ‘—π‘— (π‘Ÿπ‘Ÿ)|𝐷𝐷)} (7) where 𝑓𝑓(π‘Žπ‘Žπ‘—π‘— (𝑑𝑑)|𝐷𝐷) is the frequency count of attribute π‘Žπ‘Žπ‘—π‘— (𝑑𝑑) in the data set D. This is the set of attribute values with lower or equal frequencies of occu- rrence than that of π‘Žπ‘Žπ‘—π‘— (π‘Ÿπ‘Ÿ). Note that a value pair is more similar if it has lower frequency of occur- rence. The weighting function in [3] is defined as equation(8). πœ”πœ”οΏ½π‘Žπ‘Žπ‘—π‘— (π‘Ÿπ‘Ÿ)οΏ½ = 1 βˆ’ βˆ‘ 𝑓𝑓(π‘Žπ‘Žπ‘—π‘— (𝑑𝑑)|𝐷𝐷)(π‘“π‘“οΏ½π‘Žπ‘Žπ‘—π‘— (𝑑𝑑) οΏ½π·π·οΏ½βˆ’1) 𝑛𝑛(π‘›π‘›βˆ’1)π‘Žπ‘Žπ‘—π‘— (𝑑𝑑)βˆˆπ‘€π‘€(π‘Žπ‘Žπ‘—π‘— (π‘Ÿπ‘Ÿ)) (8) where n is the frequency of objects in the data set D. If above function is used, less frequent values will make more contributions to the similarity va- lue. Diversity Index and Variable Uniqueness Although the above algorithms can effectively im- prove the accuracy of the clustering result of the k-modes algorithm, it is noted that the k-modes al- gorithm and its modified version cannot detect the diversity of the data in a certain attribute in data set. It is easily prove in simple matching similarity measure. Because in the value will be 1 if it is ide- ntical, and 0 if otherwise. In wk-modes, the basic 64 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information), Volume 7, Issue 2, June 2014 TABLE 1 DATA SAMPLE Dataset/ object π‘₯π‘₯1 π‘₯π‘₯2 π‘₯π‘₯3 π‘₯π‘₯4 π‘₯π‘₯5 Div 𝐷𝐷1 A A A A A 1 𝐷𝐷2 A A A A B 2 𝐷𝐷3 A A A B B 2 𝐷𝐷4 A A B B C 3 𝐷𝐷5 A A B C D 4 𝐷𝐷6 A B C C C 3 𝐷𝐷7 A B C D E 5 concept used is using individual Simpson diver- sity index (will be explained in the next part). By using this information, it is noted that this algo- rithm cannot detect the diversity of data in data set. Let us see Table 1 as the data sample of coun- ting the dissimilarity between two objects. Let the number of attribute in the given data set is only 1, and the number of data set is 7. The number of object in a dataset is 5. From the table we have the diversity value (Div) as variable uniqueness sho- wn in the table. The value of Div show us the number of dis- tinct value in a dataset. We assume that by using this Div value it will increase the dissimilarity between clusters based on the diversity of attribu- te value. From the Table 1, we will get 1 as the value of dissimilarity because each data is different. But, as we can see in the real in each data, there are some similar data in it. But, they have differ- rent diversity. For AAAAA, the diversity value is 1 because there’s just a single object data. But for ABCDE, the diversity value is 5 because actually there are 5 different data. Based on this diversity, we propose a new method to count the dissimi- larity measure to count the distance between two distinct objects. We use this diversity value to de- crease the inter-similar cluster value. Diversity index is formerly a concept in bio- logical area. It is a mathematical measure of spe- cies diversity in a community. Diversity indices provide more information about community com- position than simply species richness (i.e., the number of species present); they also take the re- lative abundances of different species into account [4]. Diversity indices provide important informati- on about rarity and commonness of species in a community. The ability to quantify diversity in th- is way is an important tool for biologists trying to understand community structure. One of the commonly used diversity index measures is Simpson Diversity Index (SDI). The basic concept of SDI represents the measurement of dissimilarity by frequency based. Simpson's Di- versity Index is a measure of diversity. In ecolo- gy, it is often used to quantify the biodiversity of a habitat. It takes into account the number of speci- es present, as well as the abundance of each spe- cies. Simpson's Index (D) measures the probabi- lity that two individuals randomly selected from a sample will belong to the same species (or some category other than species). The formula is defi- ned as below. 𝐷𝐷 = βˆ‘π‘›π‘›(𝑛𝑛 βˆ’ 1) 𝑁𝑁(𝑁𝑁 βˆ’ 1) (9) where n is the total number of organisms of a par- ticular species, and N is the total number of orga- nisms of all species. Simpson's Index gives more weight to the more abundant species in a sample. The addition of rare species to a sample causes only small changes in the value of D [5]. In clus- tering theory, the property n can be defined as the number of object π‘₯π‘₯𝑖𝑖 in a dataset D, and the pro- perty N can be defined as the number of all ob- jects in data set D. Diversity-based Attribute Weighting for K- Modes Clustering Taking into account of the intra and inter cluster information, we introduce the new dissimilarity measure in the equation(10). 𝐷𝐷𝐷𝐷𝐷𝐷𝑃𝑃�π‘₯π‘₯𝑖𝑖,𝑦𝑦𝑗𝑗� = 𝑆𝑆𝐷𝐷𝑆𝑆𝑖𝑖 Γ— 𝐷𝐷𝐷𝐷𝐷𝐷 Γ— 𝑛𝑛𝑖𝑖 𝑁𝑁 Γ— 𝑆𝑆𝐷𝐷𝑆𝑆𝑗𝑗 Γ— 𝐷𝐷𝐷𝐷𝐷𝐷 Γ— 𝑛𝑛𝑗𝑗 𝑁𝑁 (10) where, 𝑆𝑆𝐷𝐷𝑆𝑆𝑖𝑖 is simpson diversity index in data set D for object π‘₯π‘₯𝑖𝑖, 𝑆𝑆𝐷𝐷𝑆𝑆𝑗𝑗 is simpson diversity index in data set D for object 𝑦𝑦𝑗𝑗, 𝐷𝐷𝐷𝐷𝐷𝐷 is the number of category in data set D, 𝑛𝑛𝑖𝑖 is the frequency of π‘₯π‘₯𝑖𝑖 in a cluster, 𝑛𝑛𝑗𝑗 the frequency of 𝑦𝑦𝑗𝑗 in a cluster, 𝑁𝑁 is the number of data in data set D. This formula is derived from the basic concept of probability count of object π‘₯π‘₯𝑖𝑖 as defined below. 𝑝𝑝(π‘₯π‘₯𝑖𝑖) = 𝑁𝑁π‘₯π‘₯𝑖𝑖 𝑁𝑁 (11) where, 𝑝𝑝�π‘₯π‘₯𝑖𝑖,οΏ½ is the probability of π‘₯π‘₯𝑖𝑖, 𝑁𝑁π‘₯π‘₯𝐷𝐷 is the frequency of π‘₯π‘₯𝑖𝑖, and 𝑁𝑁 is the number of data in data set D. In order to count the dissimilarity between two objects π‘₯π‘₯𝑖𝑖,𝑦𝑦𝑗𝑗 , we modify the above formula as defined in equation(12). 𝐷𝐷𝐷𝐷𝐷𝐷𝑃𝑃�π‘₯π‘₯𝑖𝑖, 𝑦𝑦𝑗𝑗� = 𝑁𝑁π‘₯π‘₯𝑖𝑖 𝑁𝑁 Γ— 𝑁𝑁𝑦𝑦𝑗𝑗 𝑁𝑁 (12) M. Misbachul Huda, et a.l, Diversity-Based Attribute 65 TABLE 2 SCALABILITY EXPERIMENTAL RESULT Pengujian Scalability Number of objects Diversity based Time(ms) Original Time (ms) 10 4 3 50 10 10 100 18 18 500 135 124 1000 737 585 5000 38010 18984 TABLE 3. DATA SET CHARACTERISTIC Dataset Class number Number of data Number of attributes Voting 2 435 16 Mushroom 2 8124 23 Soybean 4 47 36 TABLE 4 CLUSTERING EFFICIENCY RESULT Voting Mushroom Soybean Average Original K-Modes 0.8592 0.7381 0.8177 0.805 wk- Modes 0.8651 0.7905 0.897 0.850867 Zengyou k-modes 0.8734 0.7644 0.86 0.8326 Diversity k-modes 0.8861 0.7849 0.887 0.852667 Figure 1. Scalability Graphic between original k-modes and diversity-based k-modes where, 𝐷𝐷𝐷𝐷𝐷𝐷𝑃𝑃�π‘₯π‘₯𝑖𝑖, 𝑦𝑦𝑗𝑗�is the dissimilarity of π‘₯π‘₯𝑖𝑖, 𝑦𝑦𝑗𝑗in attribute P, 𝑁𝑁π‘₯π‘₯𝑖𝑖is the number of object π‘₯π‘₯𝑖𝑖in data set D, 𝑁𝑁𝑦𝑦𝑗𝑗is the number of object 𝑦𝑦𝑗𝑗in data set D and 𝑁𝑁is the number of data in data set D. To increase the accuracy of clustering pro- cess, we add the Simpson diversity index that has defined in the previous to the formula above. Our new proposing dissimilarity measure is defined as the following equation(13). 𝐷𝐷𝐷𝐷𝐷𝐷𝑃𝑃�π‘₯π‘₯𝑖𝑖,𝑦𝑦𝑗𝑗� = 𝑆𝑆𝐷𝐷𝑆𝑆𝑖𝑖 Γ— 𝐷𝐷𝐷𝐷𝐷𝐷 Γ— 𝑛𝑛𝑖𝑖 𝑁𝑁 Γ— 𝑆𝑆𝐷𝐷𝑆𝑆𝑗𝑗 Γ— 𝐷𝐷𝐷𝐷𝐷𝐷 Γ— 𝑛𝑛𝑗𝑗 𝑁𝑁 (13) where, 𝑆𝑆𝐷𝐷𝑆𝑆𝑖𝑖is simpson diversity index in data set D for object π‘₯π‘₯𝑖𝑖, 𝑆𝑆𝐷𝐷𝑆𝑆𝑗𝑗is simpson diversity index in data set D for object 𝑦𝑦𝑗𝑗, 𝐷𝐷𝐷𝐷𝐷𝐷is the number of category in data set D, 𝑁𝑁is the number of data in data set D. 3. Results and Analysis In this part, we will discuss about the experiment- tal result of scalability and efficiency of clustering algorithms. It employs three different algorithms. In first part of this section, we will explain the ex- periment environment and evaluation index. In se- cond part, we will explain the experimental result of scalability of original k-modes and our propos- ed method. And in the last part, we will show the experimental result of clustering efficiency of four different modified k-modes algorithm include our proposed method. Experiment Environment and Evaluation In- dex The experiment was done in core i3 (1.4 GHz), 4GB RAM, and windows 8.1x64 based computer. The implementation of the algorithm uses Java as the programming language. To evaluate the accu- racy of the algorithm, we use the following equa- tion(14). π‘Žπ‘Žπ‘π‘π‘π‘ = βˆ‘ π‘Žπ‘Žπ‘–π‘–π‘˜π‘˜π‘–π‘–=1 𝑛𝑛 (14) where k is the number of known categories, π‘Žπ‘Žπ‘–π‘– is the number of object that lies in the right cluster in 𝐢𝐢𝑖𝑖(1 ≀ 𝐷𝐷 ≀ π‘˜π‘˜). Scalability Evaluation To compare the scalability of original k-modes and our proposed method, we use synthetic data with the variant number of objects between 10 and 5000. This synthetic data have 23 different at- tributes. To decrease the random effect of k-mo- des algorithm, we did 100 times experiment for every scenario. For each experimental result shown in Table 2, is the average value of the 100 times experi- mental result. From these experiments, we get a scalability graphic shown in Figure 1. The experimental result of runtime from di- versity-based k-modes show that it needs longer time than the original k-modes to evaluate the gi- 66 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information), Volume 7, Issue 2, June 2014 ven data set. It is caused by added step to compute the diversity of the data set. Clustering Efficiency Evaluation In this section, to compare the efficiency cluste- ring algorithms, we use four different modified k- modes algorithm include our proposed method. They are Simple-Matching k-modes, Zengyou’s k-modes, wk-Modes and our new proposed me- thod diversity-based k-modes. We use tree differ- rent data set from UCI Machine Learning [6]. Table 3 shows the characteristic of the data set. In experiments, the missing value is ignored. To decrease the random effect in k-modes, every experiment was conducted 100 times. Table 4 sh- ows the result of clustering efficiency evaluation between the four algorithms. Every value is the average of 100 times experiments result. 4. Conclusion The k-modes algorithm is widely used for clus- tering categorical data. Dissimilarity and simila- rity measure play crucial rules in this area. In this paper the limitation of the former algorithm on the usage of between cluster information is solved by using simpson diversity index to extend the value of the intra similarity index. The experimental re- sult shows that our proposed algorithm give better result. References [1] Shyam Boriah, Varun Chandola, Vipin Ku- mar. Similarity Measures for Categorical Data: A Comparative Evaluation. Depart- ment of Computer Science and Engineering University of Minnesota. 2008s [2] M.K. Ng, M.J. Li, Z.X. Huang, Z.Y. He, on the impact of dissimilarity measure in k-Mo- des clustering algorithm, IEEE Transactions on Pattern Analysis and Machine Intelli- gence 29 (3) (2007) 503–507. [3] Zengyou He, Xiaofei Xu, Shengchun Deng. Attribute value weighting in k-modes cluster- ing. Elsevier. 2011 [4] M. Beals, L. Gross, and S. Harrell, http://ww w.tiem.utk.edu/~gross/bioed/bealsmodules/s hannonDI.html, 2000, retrieved June 1, 2014 [5] Offwell Woodland & Wildlife Trust http:// www.countrysideinfo.co.uk/simpsons.htm., 2000, retrieved June 1, 2014 [6] UCI Machine Learning Repository http://ww w.ics.uci.edu/mlearn/MLRepository.html, 2009, retrieved June 1, 2014. [7] Huang, Z.J. Extension to the k-means algori- thm for clustering large data sets with catego- rical values. Data mining Knowledge Discov- ery, Vol. 2, No. 3, pp.283-304. 1998. [8] Fuyuan Cao, Jiye Liang, Deyu Li, Liang Bai, Chuangyin Dang. A dissimilarity measure for the k-Modes clustering algorithm. Elsevier. 2012.