UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN l h o^t?eToA h ts7ei ng this Serial i s re ? hi ^ h w withdraw' thC lib 4fro« T ° renew coll T»i , *«"M«ol f rom ^ = T========= = ~Lr!!^^A^ IGN / J^-lU^ UIUCDCS-R-T^-655 June, ±9lh CLUSTERING BY CLIQUE GENERATION Chih-Meng Cheng DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS UIUCDCS-R-TU-655 This volume is bound without no. 656 which is a restricted publication. ERATION which is/are unavailable. June, 197^ DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBAN A, ILLINOIS 6l801 Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urban a- Champaign, June, 197^ UIUCDCS-R-7^-655 CLUSTERING BY CLIQUE GENERATION by Chih-Meng Cheng June, 197^ DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBAN A-CHAMPAIGN URBANA, ILLINOIS 6l801 Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urban a- Champaign, June, 197^ Digitized by the Internet Archive in 2013 http://archive.org/details/clusteringbycliq655chen /no. &SS-&&0 ACKNOWLEDGEMENT I wish to thank my advisor, Professor D. S. Watanabe, for his constant gui dance, patience, and encouragement, in the preparation of this thesis. Financial support from the Department of Computer Science of the University of Illinois is also gratefully acknowledged. IV TABLE OF CONTENTS PAGE 1. INTRODUCTION 1 2. BRON-KERBOSCH ALGORITHM 1+ 2.1 Analysis k 2.2 Bron-Kerbosch Algorithm 10 2.3 Moon-Moser Graphs 12 3. IMPLEMENTATION 15 3.1 Numerical Results l6 LIST OF REFERENCES 19 APPENDIX 20 -1- INTRODUCTION Given a set of objects each described by a vector of characteristics, a clustering technique groups those objects with similar characteristics together into subsets called clusters. The similarity criterion uses an appropriate distance function measuring the distance between objects which varies with the interpretation of the characteristic vector for the set. Clustering techniques are useful in many areas. For example, they can be used in medicine to identify new diseases and to refine existing disease categories, in biology to develop taxonomies for plants and animals, and in archaeology to classify artifacts with respect to period and style. There is no general method which always yields useful clusters for an arbitrary set of objects. Usually different techniques are tried, and often relevant clusters can be obtained through comparison of the results. Two of the most popular and effective techniques are the single-link method [5] and clique generation. In both methods, the set of objects is interpreted as an undirected graph. For a given distance function, we can define a threshold . Each node in T is a complete subgraph, and each edge from a node a in level £ to a node 3 in level I + 1 is labeled with the node of G added to a to form 3. A clique is generated by traversing a path or sequence of edges which terminates in a clique. Obviously one way to generate all the cliques is to visit every node and traverse all the paths in T. This approach is time-consuming and wasteful -5- 1 2 3 It 5 6 1 1 1 2 1 1 1 3 1 1 1 1 k 1 5 1 1 6 1 1 k Figure 2a. Adjacency Matrix A Figure 2b. Graph G level level 1 {1} level 2 {1,2} {1,3} {2.3} {2,U> {3,5} {3,6} {5,6} level 3 {1,2,3} {3,5,6} Figure 2c. Clique Generation Graph T -6- because most of the paths lead to cliques already generated. Most early algorithms perform an ordered traversal of V, but this is still wasteful because subsets of already generated cliques are repeatedly constructed. Bron and Kerbosch used a cleverer approach and -were able to eliminate certain paths by applying the ideas formalized in the following lemmas. Lemma 1. Suppose the paths from node a in T beginning with the edge labeled with node a of G have been explored so that all cliques containing a U {a} have been generated. Then only those paths from a beginning with edges labeled with nodes of G not adjacent to a need be explored. Proof. Let C be any clique generated by exploring a path from a beginning with an edge labeled with a node adjacent to a. Then it must either contain or not contain nodes not adjacent to a. Suppose it contains such a node, say b. Then clearly it can be generated by exploring a path from a begin- ning with the edge labeled with node b which is not adjacent to a. Suppose it contains no such nodes. It obviously contains ot , and it must contain a since all its other nodes by assumption are adjacent to a. Therefore it has already been generated. Q.E.D. Lemma 2. Suppose the paths from node a in T beginning with the edge labeled with node a of G have been explored so that all cliques containing a U {a} have been generated. Then at any node 3 of T which properly contains a, those paths beginning with an edge labeled with a can be ignored. Proof. Suppose a path from 3 beginning with an edge labeled with a is -7- explored. It must clearly terminate in a clique containing a (J {a}. But "by assumption, all the cliques containing a U {a} have already been generated, and hence the path can be ignored. Q.E.D. The Bron-Kerbosch algorithm is recursive; upon arriving at a node in level &, the algorithm calls itself to explore the levels higher than I. Lemma 1 is applied at each level. Upon first arriving at node a in T at level £, the algorithm selects a node of G called FIXP that is adjacent to the most nodes adjacent to the partially constructed clique a, moves to the node a U {FIXP}, and calls itself to construct all the cliques containing a U {FIXP}. This choice of FIXP eliminates the maximum number of paths from a. Upon returning to node a, the algorithm chooses a node of G called SEL that is adjacent to the nodes in a but not adjacent to FIXP, moves to the node a U {SEL}, calls itself to construct all the cliques containing a (J {SEL}, and repeats this procedure for all such nodes. This process is illustrated for the graph G of Figure 2b in Figure 3. Lemma 1 cannot be used to eliminate all redundant edges. Note, for example, that the edge labeled 6 from node {3} in Figure 3 is traversed although it leads to the clique {3,5,6}, previously generated. However, lemma 2 can be used to eliminate the edge labeled 5 from node {3,6}, and the clique is not regenerated. Lemma 1 is used only once at each level for the node FIXP. Conceivably it could be applied repeatedly for every node SEL at each level. However, if it is used to eliminate edges labeled with nodes adjacent to SEL, some -8- [9] ^"1 [91^2 [l]/3 {1} {6} {2,3} {1,2,3} {3,5,6} [ l] Select 3 as FIXP at level 0, and move to {3}. [ 2] Select 1 as FIXP at level 1, and move to {1,3}. [ 3] Select 2 as FIXP at level 2, move to clique {1,2,3}, and hack up to {3}. [ h] Ignore 2 because 2 is adjacent to 1, FIXP at level 1. [ 5] Select 5 as SEL at level 1 and move to {3,5}. I 6J Select 6 as FIXP at level 2, move to clique {3,5,6}, and back up to {3}. [ 7] Select 6 as SEL at level 1, and move to {3,6}. [ 8J Ignore 5 because 5 was selected at {3}, and back up to . [ 9] Ignore 1, 2, 5, and 6 because 1, 2, 5, and 6 are adjacent to 3, FIXP at level 0. [10] Select h as SEL at level 0, and move to {U}. [ll] Select 2 as FIXP at level 1, move to clique {2,^}, back up to , and stop, Figure 3. Application of the Bron-Kerbosch Algorithm -9- previously ignored edges may have to be traversed. For example, if lemma 1 is applied at node {3} in Figure 3 for node SEL, when SEL is 5, the edge labeled 6 can be ignored, but only if the previously ignored edge labeled 2 is traversed. Hence, the lemma should be used selectively so that the number of new edges to be traversed is less than the number of edges to be eliminated. The elimination of all redundant edges requires additional tests. Thus if lemma 1 is applied at node (3) for node 5, the edge labeled 2 must be traversed to generate the clique {2,3,6}, if this clique exists. However if a test reveals that nodes 2 and 6 are not adjacent, this clique cannot exist, and the edges labeled 2 and 6 can both be ignored. These modifications were incorporated into the Bron-Kerbosch algorithm, but the performance of the algorithm was not improved because the time required to perform the additional tests was comparable to that required to traverse the redundant edges. Therefore it seems unlikely that the Bron- Kerbosch algorithm can be improved significantly. -10- 2.2 Bron-Kerbosch Algorithm The following formulation of the Bron-Kerbosch algorithm is very similar to Mulligan ' s formulation . ALG0RITHM_BR0N-KERB0SCH: PROCEDURE; DECLARE S the set of all data nodes, NIL the empty set, C a global integer variable, COMPSUB a global set of nodes ; /* COMPSUB is a complete subgraph- containing C nodes */ STEP_1: /* Initially COMPSUB is empty, none of the nodes have been explored, and all nodes are candidates which can be added to COMPSUB. Hence call EXTEND with arguments NIL and S. */ C = 0; COMPSUB = NIL; CALL EXTEND (NIL,S) ; EXTEND: PROCEDURE ( EXPL,CAND) RECURSIVE; DECLARE EXPL a local set of nodes , /* EXPL = {a e S \ a adjacent to all the nodes e COMPSUB, and all cliques containing COMPSUB U {a} have been generated}. Nodes in EXPL are not added to COMPSUB because this would lead to cliques previously generated. */ CAND a local set of nodes, /* CAND = {a e S | a adjacent to all nodes e COMPSUB, a 4 EXPL}, the set of nodes called candidates that can be added to COMPSUB to form new complete subgraphs. */ -11- NEXPL a local set of nodes , /* NEXPL = {a e EXPL | a adjacent to SEL}, the new set of explored nodes constructed in STEP_3 of EXTEND for the next recursive call to EXTEND. */ NCAND a local set of nodes, /* NCAND = {a £ CAND | a adjacent to SEL, a ± SEL}, the new set of candidates constructed in STEP_3 of EXTEND for the next recursive call to EXTEND. */ FIXP a local variable representing one node, /* FIXP is the first node e EXPL U CAND adjacent to the most nodes e CAND. */ SEL a local variable representing one node; /* SEL is a node e CAND selected to be added to COMPSUB. */ STEP_2: /* Choose FIXP and SEL. */ FIXP = first node e EXPL U CAND adjacent to the most nodes e CAND; IF FIXP e EXPL THEN SEL = first node e CAND not adjacent to FIXP; ELSE SEL = FIXP; STEP_3: /* Add SEL to COMPSUB, increment C, and construct NEXPL and NCAND, Note that the number of candidates decreases for each call to EXTEND; hence EXTEND always returns. */ NEXPL = {a e EXPL | a adjacent to SEL}; NCAND = {a e CAND | a adjacent to SEL, a j SEL}; COMPSUB = COMPSUB U {SEL}; CAND = CAND - {SEL}; -12- EXPL = EXPL U {SEL}; C = C + 1; STEP_H : /* If NCAND and NEXPL are empty, a clique has "been generated. */ IF (NEXPL = NIL) & (NCAND = NIL) THEN print the clique of C nodes contained in COMPSUB ; STEP_5: /* If NCAND is not empty, COMPSUB can he extended further. */ IF NCAND -1= NIL THEN CALL EXTEND ( NEXPL, N CAND ) ; STEP_6: /* Either NEXPL is not empty and NCAND is empty which implies that a previously generated clique is heing constructed, or a new clique has "been printed, or a successful return from EXTEND has occurred. Hence back up by removing SEL from COMPSUB. If possible, select a new SEL and attempt to generate more cliques. Otherwise return. */ C = C - 1; COMPSUB = COMPSUB - {SEL}; IF there are nodes e CAND not adjacent to FIXP THEN select the first such node as SEL and go to STEP_3; ELSE RETURN; END EXTEND; END ALGORITHM_BRON-KERBOSCH ; 2 . 3 Moon-Moser Graphs Bron and Kerbosch tested their algorithm on the Moon-Moser graphs [3] which contain more cliques per node than any other graphs. These graphs have 3k nodes grouped into k triplets, and each node is adjacent to every -13- other node except the two nodes in the same triplet. The graph with 3k nodes contains 3 cliques. They found that their algorithm constructed k all the cliques in the Moon-Moser graph with 3k nodes in time of 0(3.1*+ ). We can gain some insight into this result by counting the number of com- parisons required to generate the cliques . For a Moon-Moser graph with 3k nodes grouped into the triplets {1,2,3}, {U,5,6}, ..., {3k-2, 3k- 1, 3k}, the number of comparisons, c , is as follows. K. Operation Comparisons Find FIXP 3k(3k-l) Construct lists EXPL,CAND 3k-l Call EXTEND c, n k-1 Find next SEL 1 Construct lists NEXPL,NCAND 3k-l Call EXTEND c, . k-1 Find next SEL 1 Construct lists NEXPL,NCAND 3k-l Call EXTEND c, ., k-1 This is the best case because the choices for SEL are 2 and 3, and finding the next SEL requires only one comparison. Summing these counts yields C. = 3C, , + 9k 2 + 6k-l . k k-1 This linear difference equation is easily solved giving C k = 3 k i | 1 (9i 2 + 6i-l)3 _i . The worst case occurs when the choices for SEL are 3k-l and 3k. In this case C k = 3 i | 1 (9i + 12i-T)3" ■1k- As k ■*■ °°, both sums converge yielding best k+2.6053 L k ~ J worst k+2.6801 C k ~ 3 If the number of comparisons is an adequate measure of the time required by the algorithm, then the Bron-Kerbosch algorithm operates at the theoretical limit of 0(3 k ) . -15- 3 . IMPLEMENTATION In most areas where clustering techniques are applied, large amounts of data are generally processed. Clique generation is often used as an important first step in classifying the data. To make clique generation practical for large data sets, it is essential to develop an efficient implementation of the fastest available algorithm, the Bron-Kerbosch algorithm. Bron and Kerbosch implemented their algorithm in Algol, while Mulligan implemented it in PL/I. Since their implementations are identical, we will restrict our attention to Mulligan's. Mulligan's implementation is fairly fast. However, since enormous amounts of time are generally required to process large graphs , even a modest improvement in performance is of practical significance. In his implementation, EXPL and CAND are concatenated into a single vector of integers with a pointer indicating the boundary between the two lists. A selected candidate is trans- ferred from the candidate list to the explored list by exchanging it with the first node in the candidate list and incrementing the pointer by one. This data structure has certain advantages . Additions to and deletions from the lists are simple, and the determination of the list contents is trivial. However, it complicates the execution of the principal operations of finding FIXP, and constructing the lists NEXPL and NCAND. These operations must be performed serially node by node in loops. They could be speeded up if the lists were sorted, but this would require a more elaborate list structure and additions to and deletions from the lists would be more complicated. We observed that the principal operations can be written in terms of set intersections as follows: -16- FIXP = first node i e EXPL U CAND for which CAND fl {nodes e S adjacent to i} has maximum number of nodes, NEXPL = EXPL fl {nodes e S adjacent to SEL} , NCAND = CAND fl {nodes e S adjacent to SEL}. If the lists are represented by bit strings, then intersections of the lists can be computed rapidly using boolean operations which perform blocks of comparisons in parallel. Hence we chose to represent the lists of candidates and explored nodes and the rows of the adjacency matrix for an m-node graph by m-bit strings. This new data structure speeds up the principal operations, but it also creates new problems. The determination of the names and the number of nodes in a list, formerly trivial, now is fairly difficult. It would defeat the purpose of the new data structure to perform these operations bit by bit in a high-level language. Hence we chose to implement these operations in a low-level language. An efficient subroutine can be written to count the one bits in a bit string using the IBM/360 logical instruction "translate and test", which maps a byte into a table. Unfortunately, there is no LBM/36O instruction which extracts the locations of the one bits in a string. However, a subroutine which rapidly extracts the one bit locations can be written using a fast register to register add instruction. We implemented the basic algorithm in PL /I. A listing of the PL/I procedure EXTEND and the Assembler subroutines is presented in the Appendix. 3.1 Numerical Results The new implementation was compared to Mulligan's on several graphs. Unfortunately, accurate timing results could not be obtained because of the -17- local multiprogramming environment. Some typical results are shown in Tables 1 and 2. Table 1 presents results for the Moon-Moser graphs with 3k nodes. Since the time estimates are contaminated with random errors, least squares approximations of the form ak + b were fitted to the logarithms of the times. These approximations indicate that the time required to generate 3 cliques is proportional to k k 3.00 for Mulligan's implementation and to 2.99 for the new implementation. Given the timing errors, we can conclude that the actual execution time is probably proportional to 3 . Table 2 presents results obtained using data from a color-shape preference test for preschool children. Each child's performance is described by a characteristic vector of 72 bits. Two performances were judged to be similar if 6 or more bits in the corresponding vectors matched. We analyzed an 80 node graph summarizing the data for 80 children. As the threshold 6 decreases, the number of edges in the graph grows, and the number of cliques increases rapidly. In both examples, the new implementation is superior to Mulligan's. Although the improvement in performance is relatively modest, it is significant in view of the high cost of analyzing large data sets. -18- k Time (Seconds) Mulligan Present 5 1.U3 .81+ 6 5.02 2.1*8 7 lU.13 l.h3 8 k2.kk 22. 9k 9 129 . 82 66. 3h Table 1. Moon-Moser Graphs with 3k Nodes Threshold Number of Cliques Time (Seconds) Mulligan Present 33 165 8.78 1.1+2 31 315 lU.39 2.87 29 730 31.25 6.10 27 23U8 95-10 . 20.57 25 7505 3U6.78 66.78 Table 2. Color-Shape Preference Test Graph -19- LIST OF REFERENCES 1. Augustson, J. G. , and Minker, J., "An analysis of some graph theoretical cluster techniques," J. ACM IT , 571-588, 1970. 2. Bron, C. , and Kerbosch, J., "Finding all cliques of an undirected graph," Comm. ACM 16 , 575-577, 1973. 3. Moon, J. W. , and Moser, L. , "On cliques in graphs," Israel J. Math. 3 , 23-28, 1965. h. Mulligan, G. D. , Algorithms for finding cliques of a graph , Technical Report kl , Department of Computer Science, University of Toronto, 1972. 5. Sibson, R., "Single-link cluster method," Comp. J. 16 , 30-32, 1973. -20- APPENDIX PL/I and ASSEMBLER PROGRAMS -21- EXTEND: PROCEDURE(CAND,EXPL,N) RECURSIVE; /* RECURSIVE PROCEDURE GENERATING ALL POSSIBLE CLIQUES EXTENDED FROM THE PARTIAL SOLUTION IN "CCMPSUB" USING "CAND" INITIALLY, CAND CONTAINS 1 BITS FOR ALL NODES PRESENT EXPL IS THE NIL OR ZERO BIT STRING GLOBALLY DEFINED VARIABLES ARE: PRNTFLG - BIT 1 => CLIQUES ARE TO BE PRINTED BIT => CLIQUES ARE NOT PRINTED JUST CCUNTED IN NUMOUT NUMCUT - COUNTER OF CLIQUES WHEN PRNTFLG IS BIT CONNECTED - N DIMENSIONAL VECTOR, EACH ELEMENT IS OF N BITS. ( AOJACENCY MATRIX ) CONNECTED(J) - BIT STRING REPRESENTING NODES ADJACENT TO J VERTEX - N DIMENSIONAL VECTOR LIKE CCNNECTEO VERTEX(J) IS OF N BITS WITH A 1 BIT IN THE JTH PCSITICN ONLY */ /* ASSEMBLER SUBROUTINES */ /* CCUNT(N6,STR,CT) IS A SUBROUTINE THAT CCUNTS UP THE NUMBER OF 1 BITS IN A BIT STRING: CT = * OF 1 BITS IN BIT STRING STR, WHERE STR IS OF LENGTH NB BYTES */ DCL COUNT OPTICNS(ASM) ENTRY(FIXEL) BIN ( 15, ) , BIT ( * J, FIXED BINU5,OJ), /* XTRACT(NB, STR, LIST, M) IS A SUBROUTINE THAT EXTRACTS THE POSITIONS OF 1 BITS IN A BIT STRING LIST = LIST OF POSITIONS OF 1 BITS IN THE BIT STRING STR, WHERE STR HAS LENGTH NB BYTES, ANO M = NUMBER OF ELEMENTS IN LIST */ XTRACT OPTICNS(ASM) ENTRYCFIXED BINI 15 , 0) , BI T (* ),(*) FIXED BIN(15,0), FIXED B1N( 15,0)), CAND BIT<*), /* CANDIDATES PASSED */ EXPL BIT(*>, /* EXPLORED NODES PASSED */ (NCAN,NEXP) BIT(N), /* NEW CAND, NEW EXPL */ NB FIXED BIN(15,0), /* NO. OF BYTES IN BIT STRINGS */ ASEL BIT(N), /* LIST OF FUTURE SELECT NODES */ FIXP FIXED BIN(15,0), /* MOST CONNECTED NCOE W.R.T. THE CANDIDATES */ RLIST(N) FIXED BIN(15,0), /* RETURN LIST FROM ASM ROUTINES */ ZERO BIT(N), /* ZERO BIT STRING */ /* CT, CNT ARE COUNTERS OF 1 BITS SEL IS THE SELECTED NODE IFL IS A FLAG CTHERS ARE INDEXING VARIABLES */ CNT THEN /* FIND MOST CONNECTED NODE */ DO; CNT=CT; FIXP=J; ASEL= -.CONNECTED! J ) £ CAND; end; END; SKPl: IF ZERG=CAND THEN GO TO SKP2; CALL XTRACT(NB,CAND,RLIST,M); /* EXTRACT NODES FROM CAND LIST */ DO 1=1 TO M; /* SEARCH THRU EXPL LIST */ J=FLIST(I); CALL CCUNT(NB,CONNECTED(J) & CANO,CT); IF CT>CNT THEN DO; CNT S CT; FIXP=J; IFL-l; ASEL * -.CONNECTEDC Jl 6 CAND; END; END; SKP2: CALL XTRACT< NB, ASEL, RLIST, M) ; IS=1; IF IFL=0 THEN /* IFL=0 => FIXP IS IN EXPL LIST */ DO; /* SELECTED NODE MUST BE A CAND. INSTEAD OF FIXP, CHOOSE A CAND. NOT CONNECTED TO FIXP */ SEL=RLIST(IS); IS*IS*l; END; ELSE SEL=FIXP; /* ELSE FIXP IS A CAND,, CHOOSE FIXP */ -23- DO WHILE (IS <= M+l); /* feHlLE THERE ARE STILL SEL'S */ /* CONSTRUCT NEW CAND, EXPL */ NEXP = EXPL £ CGNNECTED A CLIQUE IS FOUND */ IF (NCAN I NEXP) = ZERO THEN IF PRNTFLG THEN PUT EDIT( (LOMPSUB(I) DO 1 = 1 TO O) < SKIP,