UNIVERSITY OF The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN MAR221W JAN 2 mi DEC 15 1917 SEP 8 983 L161— O-1096 sep 1 2 ma fY)oJH>i. C«tf-2- UIUCDCS-R-80-1026 UILU-ENG 80 1718 KNOWLEDGE ACQUISITION THROUGH CONCEPTUAL CLUSTERING: A Theoretical Framework and an Algorithm for Partitioning Data into Conjunctive Concepts by May 1980 Ryszard S. Michalski HHE LIBRARY Of M AUG i q -fc 0U UNIVERSITY OF ILLINOIS Digitized by the Internet Archive in 2013 http://archive.org/details/knowledgeacquisi1026mich UIUCDCS-R-80-1026 KNOWLEDGE ACQUISITION THROUGH CONCEPTUAL CLUSTERING: A Theoretical Framework and an Algorithm for Partitioning Data into Conjunctive Concepts Ryszard S. Michalski May 1980 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 Supported in part by the National Science Foundation under grant no, NSF MCS 79-06614. KNOWLEDGE ACQUISITION THROUGH CONCEPTUAL CLUSTERING : A Theoretical Framework and an Algorithm for Partitioning Data into Conjunctive Concepts Ryszard S. Michalski University of Illinois Urbana, 111. 61801 Then he took the seven loaves and the fish, and when he had given thanks, he broke them and gave them to the disciples, and they in turn to the people. Mathew 15:36 U INTRODUCTION Clustering is the intelligent partitioning of a collection of entities. Specifically, it is the process of dividing entities (objects, observations, measurements, data, etc.) into categories that are meaningful or useful for some purpose. It is one of the fundamental operations people use to simplify descriptions of their environment, and by that, to improve the efficiency of their decision making. Appropriate clustering reveals the underlying structure of the given set of objects, and hence clustering can be viewed as a form of knowledge acquisition. Clustering problems pervade many fields, particularly experimental sciences such as biology, chemistry, geology, medicine, etc. Intelligent partitioning of objects can also be an important capability of autonomous - 2 - or semi-autonomous robots designed for exploration of special environments (e.g., the bottom of an ocean or the surface of a planet). Consequently, understanding the nature of clustering is not only of scientific interest, but also of significant practical importance. A conventional view of clustering is that it is a process of partitioning objects into groups such that the degree of similarity (or "natural association") is high among objects of the same group, and low among the objects of different groups. The notion of the degree of similarity is therefore fundamental to this viewpoint. A great variety of different similarity measures have been developed and used in various clustering techniques. Frequently a reciprocal of a distance measure is used as a similarity function. The distance measure for such purposes, however, does not have to satisfy all the postulates of a distance function (specifically, the triangle inequality). A comprehensive review of various distance and similarity measures is provided in Diday and Simon [1] and Anderberg [2]. Backer [3] describes a fuzzy similarity measure based on the theory of fuzzy sets. To determine the similarity of objects, a measure of similarity is applied to symbolic descriptions of objects (data points). Such descriptions are typically vectors, whose components represent scores on selected qualitative or quantitative variables used to describe objects. The underlying assumption is that if the similarity function has high value for the given descriptions, then the objects represented by the descriptions are similar. The similarity relationship between any two objects in the population to be clustered is thus reduced to a single number — the value of the similarity function applied to symbolic - 3 - descriptions of objects. Conventional measures of distance are "context-free," i.e., the distance between any two data points A and B is a function of these points only, and does not depend on the relationship of these points to other data points: Similarity(A,B) = f(A,B) (1) For example, for any conventional distance measure, the distance between points A and B is the same as between B and C (Fig.l). • _ • • • • • •• • An illustration of the context-free distance Fig.l. Recently some authors have been introducing "context-sensitive" measures of similarity: Similarity(A,B) = f(A,B,E) (2) where the similarity between A and B depends not only on A and B, but also on the relationship of A and B to other data points, represented in (2) by E. For example, Gowda and Krishna [4] defined the so-called "mutual neighborhood" distance measure. If point A is the nth closest point to B and B is the mth closest point to A, then the mutual neighborhood distance - 4 - between A and B is n-hn. These authors have demonstrated that a method using such a distance measure can solve some clustering problems which methods based on the "context-free" distance cannot. Both previous clustering approaches cluster data points only on the basis of knowledge of the individual data points. Therefore such methods are fundamentally unable to capture the "Gestalt property" of objects, i.e., a property which is characteristic to certain configurations of points considered as a whole, and not as a collection of independent points. In order to detect such properties, the system must know not only the data points, but also certain "concepts". To illustrate this point, let us consider a problem of clustering data points in Fig. 2. A person considering the problem in Fig. 2 would typically describe it as "a circle on top of a rectangle." • # • An illustration of conceptual clustering Fig. 2. Thus, the points A and B, although being very close, are placed in separate clusters. Here, human solution involves partitioning the data points Into - 5 - groups not on the basis of pairwise distance between points, but on the basis of "concept membership." That means that the points are placed in the same cluster if together they represent the same concept. In our example, the concepts are a circle and a rectangle. The approach to clustering which clusters objects into groups representing a priori defined conceptual entities is called "conceptual clustering." A link between conceptual clustering and distance-based clustering methods can be established by stating that in conceptual clustering the similarity between the data points is a function of these points, context E, and a set of predefined concepts C: Similarity(A,B) - f(A,B,E,C) (3) The approach has been introduced by Michalski [5]. It evolved from earlier work by the author and his collaborators on the problem of generating "uniclass covers." Such covers are disjunctive descriptions of a class of objects learned from only positive examples of the class. Stepp [6] describes a computer program and various experimental results on determining uniclass covers. His work is concerned with what can be called 'free'* conceptual clustering. The idea that the similarity measures of the type (1) or (2) (the "concept-free" measures) may be inadequate for some clustering problems is not new. In the past, several authors noticed this problem and proposed various solutions. For example, Watanabe [7,8] proposed the concept of "cohesion" to measure the "degree of clusterness" of points, which *In "free" clustering the number of clusters is not predefined, as opposed to "constraint" clustering where the number of clusters is assumed a priori. - 6 - utilizes the entropy measure. Using this concept he was able to resolve the "three girls in the dormitory" paradox, which cannot be solved by "concept-free" methods . Other measures of "cohesiveness" of objects were proposed on the basis of graph-theoretic considerations, e.g., Matula [9], Auguston and Minker [10], Zahn [11], Cheng [12]. This paper presents a theoretical basis and an algorithm for conceptual clustering, where conceptual entities are conjunctive statements in variable-valued logic calculus VL [13] (which is a typed many-valued logic extension of propositional calculus). These statements, called VL complexes, are logical products of relational statements involving discrete variables of an arbitrary number of values (definition 2 and 3 in the next chapter). Complexes have a simple linguistic interpretation and are able to express consisely a large class of relationships among discrete variables. The algorithm combines the methodology of optimization of variable-valued logic expressions [14] with the dynamic clustering method [1]. Its theoretical foundation is a special property of complexes fomulated as the Sufficiency Principle (section 3). 2. COMPLEXES AS CONCEPTUAL ENTITIES FOR CLUSTERING : BASIC DEFINITIONS Let x,, x„, ..., x denote discrete variables which are selected to 12 n describe objects in the population to be clustered. For each variable a value set or domain is defined, which contains all possible values this variable can take for any object in the population. We shall assume that the value sets of variables x., i=l,2,...,n are finite, and therefore can be represented as: - 7 - D ± = {0,1,...,^}, i = 1,2, ...,n (4) In general, the value sets may differ not only with respect to their size, but also with respect to the structure relating their elements (reflecting the scale of measurement). In this paper we will restrict ourselves only to the case of nominal or linear variables (i.e., variables with unordered or linearly ordered domains, respectively). A sequence of values of variables x, . x„, ..., x , is called an event: 1 2 n e = (r 1§ r 2 , ...., r n ) (5) where r e D , i - l,2,...,n. The set of all possible events, E, is called the event space : 1 = {e i } f=l (6) where d = d «d •...•d (the size of the event set) and d. = H. + 1. 1 2 n i i Definition 1. Given two events e. , e_ in £ , the syntactic distance , 6(e ,e„) between e. and e_ is defined as the number of variables which have different values in e. and e_. Definition 2. A relational expression [x ± # R ± ] (7) where R , called the reference set , is one or more elements from the domain D and # stands for one of the relational operators = ^ _> j<, is called a VL . selector * or, briefly, a s elector . *VL stands for variable-valued logic system VL [13] which uses such selectors. - 8 - Here are a few examples of a selector, in which variables and their values are represented by linguistic terms: [height «= tall] [color = blue, red] (read: color is blue or red) [length > 2] [size ^ medium] [weight = 2.. 5] The operator . . in the last selector denotes the range of values from 2 to 5, inclusively. It is used when the domain of the variable is a linearly ordered set. A selector [x # R ] is said to be satisfied by an event e = (x. ,x„,...,x ), if the value of x. in e, is in relation # with i 2. n i any element of R . Definition _3. A logical product of selectors is called a VL term : A [x # R ] iel x 1 (8) where I c_ {l,2,...,n>, and R _c D . A set of events which satisfy a VL. term is called a VL. complex or, briefly, a complex . Thus a VL. term is a formal representation of a complex. Since these two notions have a one to one correspondence, we will use them interchangeably, unless it leads to a confussion. Therefore, if a set- theoretic notation is applied to a term, it means that the operation is applied to the corresponding complex (i.e., a set of events satisfying the term). A complex ( VL. term ) a is said to cover an event e, if the values of variables in e satisfy the relational statements (selectors) in the - 9 - complex (term) . For example, event e = (2,7,0,1,5,4,6) satisfies the complex [x 1 = 2,3] [x 3 _< 3] [x 5 - 3. .8] . Let E be a set of events in E, which are data points to be clustered. The events in E are called data events (or observed events ) and events in £ \ E (i.e., events in £ which are not data events) are called empty events (or unobserved events ) . Let a be a complex which covers some data events and some empty events. Definition 4^ The number of empty events covered by a is called the sparseness of a and denoted by s(a). Let p(a) denote the number of data events covered by a, and t(ot) denote the total number of events covered by a. We have then t(a) = p(a) + s(a). The total number of events satisfying the complex a = A [x # R ] is: iel t(a) = II c(R.) • n d iel 1 itfl 1 (9) where I _c {1,2,... ,n) c(R ) - the cardinality of R d - the cardinality of the value set of variable x . . - 10 Definition _5. The degree of generality g(a) of complex a is defined: / \ i t(a) , ,. s(a). ,. - % * (ct) " 1o8 F(^) = log (1 + 7fc) ) (10) The value — : — r specifies how many events are in the complex per one data event. Thus, the degree of generality g(a) specifies the uncertainty of the location of the data points in the complex. The greater the degree of generality of a complex, the greater is the uncertainty. If g ■ 0, then all the events in the complex are data events. We can see from (10) that for a fixed p(ot) the degree of generality is a monotonically growing function of sparseness. Let L be a set of complexes (or events) , and R be the set of all the distinct values which variable x takes in these complexes (or events) . Definition §_. The operation which transforms L into the complex n A [x = R ] is called reference union or refunion . The resulting i=l complex is called the minimal covering complex or mc-complex for L and denoted RU(L) (refunion). If any R = D . , then the corresponding selector is removed from the complex. The refunion is thus a transformation which transforms a set of complexes (or events) into the minimal covering complex. Theorem 1_. The mc-complex of an event set has the minimum sparseness among all complexes covering this set. Proof : Let a be the mc-complex for an event set E: - 11 - a = RU(E) = A [x =R ] i=l (11) where R _c D (the domain of x . ) . Suppose that p = A [x =P ] is a 1=1 complex which covers E and has a smaller sparseness than a. If this is true, then there must exist P such that P. c R . . But R . , according to the definition 6, contains all values that x takes in events in E. Therefore, if P c R . , then complex a could not possibly cover all events in E, which is a contradiction. • Let E be data events which are covered by a complex a . Definition 7. The set E is called the core of a, and the complex a* = RU(E) is called the trimmed a. From Theorem 1 we have a* c^ a. Theorem 2. If E. and E~ are two disjoint event sets then: s(RU(E 1 )) + s(RU(E 2 )) < s(RU(E : U Y.^ ) (12) Proof : According to Theorem 1, RU(E.) and RU(E~) have the smallest possible sparseness among all complexes covering E. and E„, respectively. Since E. and E. are disjoint, then (12) must hold. • The property expressed by Theorem 2 has an analogy in statistical clustering, where with the increasing number of clusters the 'fit' between each cluster and the probability distribution "fitted" to the cluster also increases. - 12 - Theorem 3. Let a and a„ be two intersecting complexes, whose union covers an event set E. Let E. (E_) denote the set of events in a (a ) which are covered only by this complex (the relative core of the complex). Let a' and a." be any two disjoint complexes covering the same event set E. If RU(E ) and RU(E~) are disjoint complexes, then: sCRUCEj)) + s(RU(E 2 )) < sCop + s(op (13) Proof : The theorem is an immediate consequence of Theorem 2 and the premise that a' and a' are disjoint complexes. • We will next introduce two basic concepts for the conceptual clustering algorithm presented in section 6. They are the star of an event against an event set and a cover of an event set against another event set. Let F be a proper subset of the event space Z, and e an event outside of F, i.e. , e a, F. Definition 8. The star G(e|F) of e against F is the set of all maximal under inclusion complexes covering the event e and not covering any event in F. (A complex a is maximal under inclusion with respect to property P, if there does not exist a complex a* with property P, such that a c a*.) Let E and E be two disjoint event sets, E n E_ = . Definition 9. A cover COV(E |E ) of E against E is any set of complexes, { a .). rT » such that for each event e e E- there is a complex a., j e J, covering it, and none of the complexes a. cover any event in E . Thus we have: - 13 - E c. U a c Z \ E_ jeJ 3 (14) A cover in which all complexes are pairwise disjoint sets is called a disjoint cover . If set E- is empty, then the cover COV(E. |E ? ) = COV(E. |0) is simply denoted as COV(E ) . Definition 10 . The sparseness (the degree of generality ) of a cover is defined as the sum of the sparsenesses (the degrees of generality) of complexes in the cover. _3. SUFFICIENCY OF COMPLEXES AS CLUSTER REPRESENTATIONS First, we will observe the following property of complexes: Theorem 4. For any given event space E and integer k < d •d ,,, d (where d. — I Z n I is the cardinality of the value set of variable x ), there exist k pairwise disjoint complexes a , a_, ..., a, which completely fill up the space E, i.e. , j=l J (15) Proof : The theorem is equivalent to saying that any event space can be partitioned into an arbitrary number of complexes (but, of course, not larger than the cardinality of Z) . To see this, take any subset of variables such that the arithmetic product of corresponding d -s is greater than or equal to k. Let R., j=l,2,... denote all possible sequences of values of variables x , i e I. - 14 - Construct complexes: "j " A [X i " r ij ] J iel x 1J (16) where r , i e I, j ■ 1,2,..., denotes a value of variable x in the sequence R . Obviously, the complexes a are pairwise disjoint and fill up the space E. If k' > k, then k' - k complexes are joined with the remaining ones into single complexes, according to the formula: 3tx i = a] v 0[x ± « b] = B[x ± -a,b] (17) where 3 denotes a conjunction of selectors involving variables other than x . This is always possible, because for any x , i e I, there are d. complexes a , which differ only in the value of x . • From the view of clustering, a more interesting question is whether for any given event set E in the space E, there always exist an arbitrary number k _< c(E) of pairwise disjoint complexes, such that they not only fill up the space £, but also partition the set E into k non-empty subsets. A positive answer to this question would imply that any given event set can be partitioned into an a priori assumed number of subsets, each covered by a simple complex, disjoint from other complexes. The answer is indeed positive. In fact, even a stronger property holds, as stated by the following theorem. Theorem _5 . (The Sufficiency Principle ) For an event space E and any data event set E = ^ e i» e o» •••» e i,^» E _c T. there exists at least one set of k pairwise disjoint complexes a l> a 2» •••» a y ii such that each complex contains one data event: - 15 - e, e a , j = 1,2, ...,k (18) and the union of complexes fills up the space E: k U o = E j-1 J (19) Proof: The basic idea of the proof is to show that for any E = {e.,e-,..i,e, }, E _c E, it is always possible to construct a tree, in which nodes are assigned the variables x , i e l,2,...,n, branches of node x are assigned elements of a partition of D (the value set of x ), and the leaves represent complexes a , such that each complex covers a single event e., and the union of complexes fills up the space E. Suppose, e = (Xj ,x 2 , . . . ,x ), j - 1,2, ...,k, and x £ D^ Take any variable, say x , which has different values for events in E. Suppose these values are a., a_, ..., a Partition the value set, D of 1 2 z. p x , into subsets {a.}, {a }, . . . , {a .}, A , where a e A and A is a set p 12 z-lz z z z D \ {a. ,a„,a_, . . . ,a .}. It is obvious that complexes [x = a. ] , [x = a„], ...., [x ■ A ], partition both, the event set E and p 1 p 2 p z r the event space E into z non-empty subsets. Suppose these complexes partition E into E , E , ..., E and E into E , E , ..., E , where a. a_ A a. a_ A 12 z 12 z E c E . a i" a i Variable x is assigned to the root of a tree. Branches from the root are assigned values a., a_, .«., A . Leaves of this tree correspond to J. z. z complexes [x = a, ] , [x ■ a»] , ..., [x = A ], covering event sets p 1 p 2 ' p z ° - 16 - E al' E a2' * , E , respectively (Fig. 3) CXp sq,3 CX p -a,3[Xr-By] Constructing a tree for the proof of the sufficiency principle Fig. 3. For every one of the above event sets which has more than one element repeat the above process with the following modification. Suppose E has 3. J. more than one element and x takes values b, , b_, .... B for events in r 1* 2 y E .. Assign x to the root of a new tree, and attach the tree to the leaf al ° r corresponding to E (i.e., to the leaf marked by [x = a. ] in Fig 3). d 1 P X Assign the branches emanating from this root values b. , b„, ..., B , where B = D \ {b ,b_,...,b .}. It is obvious that complexes: yy s 2 y-1 [x = &1 ] [x r = b x ] , [x = a x ] [x r - b 2 ] , . . . , [x = a x ] [x r = B y ] (20) partition both, the set E and the set Z . into y disjoint subsets. This process is continued until leaves of the obtained tree correspond to complexes, each of which covering only one event from E. Because every step of this process partitions simultaneously events in E and in £, the - 17 - union of the obtained complexes covers E and fills up the whole space E. Thus, these complexes constitute the desired set . • This above theorem asserts that the space of all complexes is sufficient to be a space of cluster representations, because any event set can be clustered into an arbitrary number of complexes. The theorem is used as the theoretical basis for the clustering algorithm described in section 6. As the above proof indicates, there usually will be many covers which constitute a k-partition of any given event set. Therefore, a question arises as to which cover to select as the most desirable. In order to answer this question, a criterion of the quality of a cover is needed. 4- . A CRITERION FOR EVALUATING QUALITY OF CLUSTERING Let E be the set of data points, and COV(E) a disjoint cover of E. Such a cover implies a partition of E into clusters, each cluster being the event set contained in one complex. The sparseness (or the degree of generality) of the cover could be used for defining a criterion of quality of a partition. However, if E is partitioned into individual events, then, obviously, the sparseness ( as well as the degree of generality) will be zero. Consequently, this kind of criterion can be used only if the number of clusters is assumed a priori, i.e. , for a constrained clustering problem. In this case the problem is to find a disjoint cover of E with k complexes, whose sparseness (or the degree of generality) is minimum. In the case of a free clustering problem (i.e., when the number of clusters is not assumed a priori), a criterion of quality of partitioning has to - 18 - involve, in addition to sparseness (or the degree of generality), also some "cost" function dependent on the number of clusters, e.g., a measure of complexity of a cover. In this paper we are concerned only with the constraint clustering problem. Although it may seem otherwise, this is not a serious limitation because interesting practical solutions of clustering problems should not produce more than just a few clusters (this is so, because when the number of clusters is large, humans prefer to organize them into an hierarchy). Consequently, to obtain a general solution, a constraint clustering algorithm should be repeated for several different k, and the best obtained partition selected as the general solution. The sparseness (or the degree of generality) may not be sufficient as the sole criterion for selecting a cover. One may seek a cover which exhibits other properties than minimum sparseness. In order to use several criteria for selecting a cover simultaneously, we adopt the lexicographic cost functional defined in [14] . A lexicographic evaluation functional ( LEF ) is defined as a pair of two lists: A = (21) where a-list = (a ,a_, . . . ,a p ) , is a list of attributes used to evaluate a cover T-list = (t ,t , . . . ,t ) , is a list of "tolerances" assigned to the attributes a . , respectively , _< T _< 1. - 19 - Let V , j = 1,2,... denote all possible disjoint covers of the event set E. Let V denote one of the covers, and let a . (V ) denote the value of attribute a. for cover V . Cover V is said to be optimal (minimal ) under functional A if for every j: A(V) <• A(V ) (22) where A(V) = (a^V), a 2 (V),...,a £ (V)) A(V ) = (a x (V ), a 2 (V ) a £ (V )), j =1,2,..., x and <• is a relation, called the lexicographic order with tolerances , which holds if: a l ( V " a l (V) > T l or la^V ) - a^V)! < Tj and a 2 (V ) - a 2 (V) > ? 2 or (23) where or and a £ (V ) - a £ (V) > T i = V (a imax- a imin^ i = L2,. ..,*-! a imax =^x{a ± (V )>, a imin = m j in{a i (V j ) > Note that if t = (0,0,..., 0) then <• denotes the lexicographic order in the usual sense. In this case, A can be specified just as A = . - 20 - To specify the a functional A one selects a set of attributes, puts thera in the desirable order in the a-list, and sets the values for tolerances in the T-list. Relation <• partitions all covers into equivalence classes and orders the classes linearly, with the first class containing one or more optimal covers, and the next classes containing consecutively less optimal covers. Below are a few criteria which may be used to assemble an a-list: • Sparseness (or generality g) of a cover. Minimizing sparseness will produce complexes which "fit" as closely as possible to clusters of data events. This criterion is an analog to the criterion of minimizing intra- distances in the conventional distance-based clustering. • Intersection , defined as the average degree of intersection (DI) between any two complexes in the cover. The DI between two complexes is the total number of selectors which remain in both complexes after removing every pair of disjoint selectors (selectors whose reference sets do not intersect) . For example, the degree of intersection between complexes [x 2 =2,3] [x 4 =3,5,7] [x 5 =2..5] and \ f [ Xl =3] [x 2 =l] [x 4 =5. .12] [x 5 =l] is 3. The introduction of DI as a criterion for clustering comes from the observation, that people tend to prefer partitions of objects, in which clusters differ not in just one, but in many characteristics. This criterion is an analog to the criterion of maximizing cluster inter- distances in distance-based clustering. - 21 - • Imbalance, defined as k' 1/k T |l/k-c(E) - c(E n o )| i-1 (24) where c(E) is the size of the event set, and c(E r\ a . ) is the number of data events covered by complex a (the cardinality of the core of a.). The imbalance measures the variability of cluster sizes. • Dimensionality , defined as the total number of different variables involved in the complexes of the cover. The dimensionality tells us how many variables are used to describe clusters, and, thus, how many variables have to be measured to classify objects into these clusters. _5 . PROCEDURES STAR and NIP Before describing an algorithm for conceptual clustering (next section) we shall first describe two important procedures used in this algorithm: STAR and NID. Procedure STAR generates the star (def. 8) of a data event against a set of other data events, and procedure NID transforms a non-disjoint cover, whenever possible, into a disjoint cover with the same number of complexes. Procedure STAR : This procedure is based on the algorithm described in [14] . Let e be an event and a a complex. The operation e |— a (read: e o ™t- r Q i extended in a) is defined: - 22 - e f— a o a, if e e a o (25) , otherwise Let event e. = (r.,r_,...,r ) and e + e . The operation e — I e. L 1 2 n io r ol (read: e extende d against e, ) is defined: o 1 % "I e l " (e o I- [x i ' r l iel ]) (26) Let G (e|E) denote the union of complexes from the star G(e|E). It can be shown that: G U (e|E) = r\ e eE (e -H . ) (27) To obtain the star G(e|E) from G (e|E), the right-hand side of (27) must be converted to the union of maximal (under inclusion) complexes. Such a union is obtained when the set-theoretical multiplication is done with the application of absorption laws. Procedure NIP : (A transformation of a _non-disjoint cover _into a disjoint cover) Let {a ,a , . . . ,a } be a set of not necessarily disjoint complexes, which is a cover of a data event set F. - 23 - 1. Let c(a ), i = 1,2,..., A, denote the cardinality of a (the total number of events covered). Determine the (arithmetic) j|um of cardinalities: I sc - I c(o ) (28) i-1 and the cardinality of the (set-theoretic) sum of complexes: cs « c( U o ) (29) i-1 2. If sc = cs then STOP: L is already a disjoint cover. 3. For i = 1,2, ••.,£, determine the relative core, CORE , of complex a ,i.e., the set containing data events covered by complex a and only by this complex. Let RESIDUE denote the set of remaining events, I i.e., RESIDUE - F \ U CORE . i=l 4. For each CORE determine its mc-complex (def. 6): a° = RU(CORE ), i = 1,2,...,* (30) 5. If any two complexes a intersect, then STOP. The disjoint cover cannot be obtained. (This is a direct consequence of Theorem 1) 6. Select an event from RESIDUE and call it e. Delete e from RESIDUE. 7. For each pair (e,a ), i = 1,2,...,£, determine the covering complex: a* = RU({e> n a°) (31) 8. Delete every a which intersects with any a , j + i. If all a are deleted then STOP: a disjoint cover cannot be obtained. - 24 - 9. Select the best complex, Best-a, among complexes a , according to the LEF: where Aspars res Asel <(Aspars, -res, -Asel),(T ,t ,t )> 1 - the difference between the sparseness of a and a - the number of events in RESIDUE covered by a - the difference between the number of selectors in a and T 1 ,T 2 ,T 3 tolerances are set to by default. The sign '-' in front of res and Asel indicates that the algorithm will maximize these criteria (by minimizing the negative value). 10. Suppose Best-a was created by joining e with a . Assign to a a new value Best-a. 11. If RESIDUE = <(>, then END, otherwise go to 6. The output from this procedure is either a disjoint cover {a ,a , . . . ,a } of set F, or an indication that such cover cannot be obtained from the initial cover {a . ,a„, . . . ,a f }. 6. AN ALGORITHM FOR CONJUNCTIVE CONCEPTUAL CLUSTERING 6.1. An Overview Based on the ideas described in previous sections, we have developed an algorithm for conjunctive conceptual clustering, called PAF.* Given a *Polish-American-French - 25 - set, E, of events from an arbitrary event space, and an integer k, PAF partitions E into k clusters, each of which has a conjunctive description in the form of a VL complex. The obtained partition is optimal or suboptimal with regard to a lexicographic evaluation functional, assembled by a user from the criteria listed in the previous section. The general structure of the algorithm is based on the multicriteria dynamic clustering method developed by Diday and his collaborators (Diday and Simon [1], Hanani [15]). Underlying notions of the dynamic clustering method are two functions: g - the representation function , which, given k clusters of a partition of E (a k-partition) produces a set of k cluster representations, called kernels . There may be diferent kinds of kernels, e.g., the center of gravity of a cluster, a few selected points from a cluster, a probability distribution best fitting the cluster, a linear manifold of minimal inertia, etc. f - the allocation function , which, given a set of kernels, partitions E into k clusters, "best fitting" these kernels. The method works iteratively, starting with a set of k initial, randomly chosen kernels (of a given kind). A single iteration consists of an application of function f to given kernels, and then of function g to the obtained partition. An iteration ends with a new set of kernels. The process continues until the chosen criterion of quality of a partition, W, ceases to improve. (Criterion W measures the "fit" between a partition and kernels.) It has been proven [1], that this method always converges to a local optimum. - 26 - The measure W can be a single criterion, or a sequence of criteria. In the nultlcrlterla case, for each criterion an appropriate type of kernels Is used (Hananl [15]). The algorithm PAF applies a multicriteria dynamic clustering method, In which the basic and final cluster representation is a VL complex. Intermediate representations include the geometrical center of a cluster (using the syntactic distance; def.l) and the "most outstanding" event (most distant from the center) in a cluster. The use of the latter representation is an application of an "adversity principle." This principle states that if the most outstanding event truly belongs to the given cluster, then if it serves as the cluster representation, then the "fit" between it and other events in the same cluster should still be better than the "fit" between it and events of any other cluster. In the algorithm PAF, the measure of "fit" between a data event and a kernel (a VL. complex) is a binary measure, defined by a predicate specifying whether an event satisfies the complex or not. A. complex is a form, which can describe a very large number of configurations of events. For n variables, each taking d distinct values, there are N = (2 -1) different complexes. For example, if n ■ 10 and d = 7, then N is 20 approximately 10 . Such a large size of the "concept space" makes conjunctive clustering computationally an extremely complex problem. To obtain a feasible practical solution, it is necessary to apply a combination of carefully designed heuristic search methods. In PAF, one of the methods used is a well known "best first" search technique developed - 27 - in artificial intelligence [16] . 6.2 Description of PAF A flow diagram of algorithm PAF is shown in Fig. 4. 1. In the first step (block 1), a set of k data events E = {e. , e_, . . . ,e, >, called seeds , is selected from the event set E. Seeds can be selected arbitrarily, or they can be chosen as events which are most distant syntactically (def.l ) from each other. In the latter case the algorithm will generally converge faster. For selecting such events program ESEL [17] can be used. 2. For each seed e , i = l,2,...,k, a star is generated against the remaining seeds (using procedure STAR described in sec. 5): G ± = G( ei |E o \ < ei >), i = 1,2 k 3. From each star a complex is selected, such that the resulting set of k complexes: (i) Is a disjoint cover of E (ii) Is an optimal or suboptimal cover among all possible such covers, according to an assumed criterion LEF (constructed by a user from criteria listed in sec. 4: sparseness or generality, intersection, imbalance and dimensionality). This is the most difficult and computationally costly step of the algorithm. It can be performed in a number of different ways. We will distinguish between three different procedures: P (parallel), PS (parallel- sequential) and S (sequential). These procedures are described - 28 - Give n : - a set of data events - the desired nr of clusters - the evaluation functional Using procedure STAR determine the star of each seed against the remaining seeds Select from eacli star one complex, SO that the obtained collection, 1', of k complexes will he the "best" disjoint cover of E (with help of NID procedure) . Vt7 Is the termination criterion appl ied to P satisfied? Yes W Is iteration odd or even W Choose k new seed events which are central in the complexes in P <^7 Choose k new Seed events which are extreme in the complexes in P A flow diagram of algoritm PAF Figure 4. - 29 - in the next section. 4. A termination criterion of the algorithm is applied to the obtained cover. The termination criterion is a pair of parameters (b,p), where b (the base) is a standard number of iterations the algorithm always performs, and p (the probe) is the number of iterations beyond b, which the algorithm performs, after each iteration which produces an improved cover. 5. A new set of seeds is determined. If the iteration is odd, then the new seeds are data events in the centers of complexes in the cover (according to the syntactic distance) . If the iteration is even, then the new seeds are data events maximally distant from the centers (Recording to the "adversity principle"). ]_. PROCEDURES P, SP AND S^ All three procedures use bounded stars , that is stars whose size is limited by special parameter MAXSTAR. The reason is that the size of stars may be very large when the number of variables n is high. As can be seen from procedure STAR, the upper bound on the number of complexes in a star grows exponentially with k (the number of clusters); namely n . The size of any star is controlled by not allowing it to have more than MAXSTAR complexes. Whenever a star exceeds this number, complexes are ordered in the order of ascending sparseness, and only first MAXSTAR complexes are retained. It is also assumed that all complexes in stars are trimmed (i.e., the refunion operation is applied to the core of each complex, and then the resulting mc-complex is used to replace the original complex in the - 30 - 9tar; see def .7) . To simplify the description of procedures we will assume that the criterion of clustering optimality is minimizing the sparseness of the disjoint cover (representing a partition). The procedures can be extended for a multicriteria case by using a criterion LEF (which imposes a linear order between equivalence clsses of sets of complexes). In such a multicriteria case, however, sparseness should be used as the primary criterion in order to retain the properties of the described procedures. Procedure P_ The procedure is applicable for relatively small MAXSTAR and k. It is particularly useful for execution on a parallel processor. Let star G = G(e |E \ {e }) be a set , i = l,2,...,k. Assume that i complexes a , j=0,l,...,g , are ordered in ascending order on sparseness. The position of a complex in the star so ordered (indicated by a subscript, which counts from 0), is called the rank of the complex (thus, e.g., complex a has rank 2) . Taking one symbol a. from each star G . , i = 1,2,..., k, at a time, generate all possible sequences: / 1» 2 k, P = (a a , ...,ot ) o o o , 1 2 k-1 k N 1 o o o I (32) P = {a ,a ,...,a > r 8 X 8 2 8 k where r = (8^1) (8 2 +l) . . . (g +i) - 31 - The sum of the ranks of complexes in any such sequence is called the pathrank . Assume that sequences P , j * 1,2,...,T are now arranged in ascending order on their pathrank, with sequences of equal pathrank ordered arbitrarily. As before, P has pathrank (because all complexes in P have rank 0). P. ,P„, . . . ,P, , however, denote sequences with pathrank 1, and P„ denotes a sequence with pathrank g,+g 9 +» • «+g, • Considering sequences P in the ascending order on their pathrank, the following operations are performed on each sequence: (i) A P is tested whether it is a cover of E. This can be done by consecutively removing from E data events covered by each complex in P . If at the end E becomes the empty set, P is a cover. If a P is not a cover, it is removed from further consideration. (ii) A P is tested whether it is a disjoint cover. If it is, its sparseness is calculated. If it is not, a lower bound (l.b.) on the sparseness of a possible disjoint cover is calculated (without actually determinig the disjoint cover). The l.b. is computed by determining the relative core of each complex (i.e., data events covered only by the given complex and not by any other complexes) , and then computing the sparseness of the mc- complex of the core. The l.b. is the sum of so obtained sparsenesses (this computation is based on theorem 3). {The purpose of using the l.b. is to avoid, whenever possible, the computationally costly procedure NID.} - 32 - (iii) If the computed sparseness (or l.b.) i8 not a new minimum (i.e., ia not 8maller than the aparaene88 of the be8t cover obtained so far), then the cover is removed from further consideration. Otherwise, if it is a disjoint cover, it is retained as the best cover; and if it is a non-disjoint cover, it is transformed by NID, if possible, into a disjoint cover (note that some operations of the NID procedure were already done in (ii)). If the sparseness of the obtained disjoint cover still represents a new minimum, the cover is retained as the best so far. If the sparseness is not a new minimum, or NID fails to produce a disjoint cover, the cover is removed from further consideration. The disjoint cover retained at the end of the above search process through sequences P is the output of the procedure. It is a minimum sparseness cover which can be assembled from complexes in the given stars. The existence of at least one disjoint cover is assured by the sufficiency principle. An advantage of the above described ordering of sequences P is that the best cover will most likely be close to the beginning of the list. Therefore, if the number of sequences is very large, the search can stop before reaching the end, with a low risk of loosing the optimal solution. Procedure PS In procedure P, all sequences P were generated first, and then linearly searched in order to determine the best cover. In this procedure, the search for the best cover is done during the process of generating the - 33 - sequences, using the "best first" search strategy (Winston [16]). Specifically, the search is based on the algorithm A* (Nilsson [18]). At step a complex is added to the partial cover (a partial sequence after application of NID) which most likely leads the optimal cover (according to an evaluation function). This process avoids testing (usually many) sequences P , for which it is possible to predict that they will not produce an optimal cover. The procedure PS is especially applicable when stars G are large. Fig. 5 illustrates the search process. Branches emanating from a node at level i represent complexes in star G . A path from the root to a node at level i represents a partial disjoint cover with i complexes. When i=k, the path represents a complete disjoint cover (corresponding to some sequence P to which NID was applied). In the first step, sequence P ■ (a ,a , . . . ,a ) is generated. (It is the sequence of complexes of the smallest sparseness). The relative core of each complex is determined and then the mc-complex is constructed for each core. Let s.. ,s_, . . . ,s, denote the sparsenesses of the obtained mc- complexes. On the basis of theorem 3, the sum s. + s 9 + •••+ s. specifies a lower bound on the sparseness of the best disjoint cover which can be built from complexes of given stars. In the next step, node (1) (fig. 5) is expanded, i.e., a is paired with every complex in G , procedure NID is applied to each pair, and then the sparseness is calculated for the obtained disjoint pair. If NID fails, the path is abandoned. The obtained pair is a partial cover with i=2 complexes. Nodes corresponding to generated partial covers (including the - 34 - •^ *- (LEVEL 1) (LEVEL2) (LEVEL 3) (35) (24] 1X^(19) (21) OPTIMAL COVER a o (LEVEL k) (24) (27) (24) A search tree for an optimal cover Fig. 5. - 35 - remaining complexes in G.) are assigned a value of the evaluation function: f = h + g (33) where h - is the sparseness of the obtained partial disjoint cover g - is the sum s . +. s . _ + ... + s, , where i is the number of complexes in the partial cover. {g represents a l.b. on the sparseness of the remaining complexes to be determined, i.e., complexes that are needed to complete the cover under construction} According to the best first strategy, the node to be expanded at each step is the one which is associated with the lowest value of the evaluation function. It is proven that such strategy will produce the optimal cover [18]. The order of expanding nodes in the tree in Fig. 5 is shown by numbers in circles. The value of the evaluation function associated with each node is given in parentheses. Procedure S^ This procedure is like procedure PS, with the exception that stars are not generated beforehand. When expanding a node in the search tree, rather than taking complexes from already determined stars, an appropriate star is generated each time. This requires a multiple repetition of the star generation process, but saves on the memory for storing all stars (which may be large sets). - 36 - 8. A NOTE ON IMPLEMENTATION AND AN EXAMPLE The algorithm has been Implemented by R. Stepp in PASCAL for Cyber 175. The details on the implementation are in [19]. For illustration we will briefly describe two examples, which were used In testing experiments with the program. Figure 6a represents a diagrammatic representation [20] of an event space, spanned over variables x.,x_,x-,x,, with domain sizes 2, 5, 4, 2, respectively. Each cell represents one event. Cells marked by 1 represent data events, remaining cells represent empty events. Fig. 6a also shows a cover obtained from the first iteration of the algorithm. The remaining figures show results from the consecutive iterations. Cells representing seed events in each iteration are marked by + . The partition evaluation criterion was a LEF: <(sparseness, imbalance, dimensionality) (0, 0, )> According to this criterion, the best partition is the one shown in Fig. 6c. The partition is specified by comple es: a° = [x x = 0] [x 2 = 1] [x 4 = 0] a° = [x 1 = 0] [x 2 = 2] [x 3 - 1..3] a° = [Xj = 1] [x 2 = 1..3] Another experiment with the program involved clustering 47 cases of soybean diseases. These cases represented four different diseases, as determined by plant pathologists (the program was not, of course, given this information) . Each case was represented by an event of 35 many-valued variables. With k=4, the program partitioned all cases into four categories. These four categories turned out to be precisely the - 37 - Sparseness = 18 lmbalance= 1.6 Dimensionality 2 3 ITERATION 2 Sparseness = 20 Imbalance = 3.6 Dimensionality ■. 3 c. ITERATION 3 (Optimal solution) Sparseness = 1 2 Imbalance 3 7.3 Dimensionality 2 4 ITERATION 4 Sparseness = 16 Imbalance 5 3.6 Dimensionality 5 3 Fig. 6. - 38 - categories corresponding to individual diseases. The complexes defining the categories involved known characteristic symptoms of the corresponding diseases. 9.- CONCLUSION The paper presented a theoretical foundation and an algorithm for conceptual clustering, in which entities are assembled into classes described by single conjunctive concepts (VL. complexes). Thus, the proposed approach produces clusters together with their descriptions. The descriptions are conjunctive statements involving relations on variables characterizing the entities, and have a simple linguistic interpretation. The presented algorithm has been implemented and tested on various examples. The results indicate that the method provides an valuable alternative to the conventional clustering methods, and has a potential for application in variety of clustering problems. ACKNOWLEDGEMENTS A major part of the research reported in this paper was done when the author worked as a Visiting Professor at the University of Paris - IX (Dauphine) and at the research institute IRIA in France. A partial support provided by these institutions, and by the National Science Foundation under grant MCS-79-06614 is gratefully acknowledged. Numerous conversations with Professor E. Diday and his collaborators at IRIA have been very useful in shaping up the ideas and the method - 39 - described here. Some of the results related to the algorithm PAF have been developed in collaboration with Bob Stepp. His suggestions and insightful criticisms contributed significantly to the final version of the paper. Thanks go also to June Wingler for her help in typing the paper and admirable struggles with our not always reliable editing system. REFERENCES [1] Diday, E. and Simon, J. C. Clustering analysis, Chapter in Communication and Cybernetics 10, Ed. K. S. Fu, Springer-Verlag, Berlin, Heidelberg, New York, 1976. [2] Anderberg, Michael R. Cluster Analysis for Applications , Academic Press, 1973. [3] Becker, E. Cluster analysis formalized as a process of fuzzy identification based on fuzzy relations, Delft University of Technology, Department of Electrical Engineering, Report IT-78-15, October 1978. [4] Gowda, K. Chidananda and Krishna, G. Disaggregative clustering using the concept of mutual nearest neighborhood, IEEE Trans, on Systems, Man and Cybernetics, Vol. SMC-8, No. 12 (December 1978), pp. 888-894. [5] Michalski, R. S. Studies in inductive inference and plausible reasoning, (a proposal to NSF, November 1978) to appear as a report of the Department of Computer Science, University of Illinois, Urbana, Illinois). [6] Stepp, R. Learning without negative examples via variable-valued logic characterizations: the uniclass inductive program AQ7UN1, Department of Computer Science, Report 982, University of Illinois, Urbana, Illinois, July 1979. [7] Watanabe, S. Pattern Recognition as an Inductive Process , Methodologies of Pattern Recognition , Academic Press, Ed. S. Watanabe, 1968. [8] Watanabe, S. Knowing and Guessing ; a quantitative study of inference and information , New York, Wiley , 1969. - 40 - [9] Matula, D. W. Cluster analysis via graph-theoretic techniques, Proc. of the Louisiana Conerence on Combinatorics, Graph Theory and Computing, Eds. R. C. Mullin, K. B. Reid and D. P. Roselle, Louisiana State University, Baton Rouge, March 1-5, 1970. [10] Auguston, I. G. and Minker, J. An analysis of some graph theoretical cluster techniques, Journal of the ACM, Vol. 17, No. 4 (October 1970), pp. 571-588. [11] Zahn, C. T. Graph-theoretic methods for detecting and describing Gestalt clusters, IEEE Trans, on Computers, Vol. C-20, No. 1 (January 1971), pp. 68-86. [12] Cheng, Chih-Meng. Clustering by clique generation, Department of Computer Science, Report 655, University of Illinois, Urbana, Illinois, June 1974. [13] Michalskl, R. S., VARIABLE-VALUED LOGIC: System VL , Proceedings of the 1974 Intern. Symp. on Multiple-Valued Logic, West Virginia University, Morgantown, West Virginia, May 29-31, 1974. [14] Michalskl, R. S., "Synthesis of optimal and quasi-optimal variable- valued logic formulas", Proceedings of the 1975 Intern. Symp. on Multiple-Valued Logic, Bloomlngton, Indiana, May 13-16,1975 [15] Hanani, U. Multicritera dynamic clustering, Reports of IRIA, 1979. [16] Winston, P. H., Artificial Intelligence , Addison-Wesley Publishing Company, 1977. [17] Michalskl, R.S. and Larson, J.B. Selection of most representative training examples md incremental generation of VL. hypotheses: the underlying methodology and the description of programs ESEL and AQ11, Report No. 867, Department of Computer Science, University of Illinois, Urbana, Illinois, 1978. [18] Nilsson, Nils, T., Principles of Artificial Intelligence , Tioga Publishing Company, 1980. [19] Stepp, R., A Description and User's Guide for CLUSTER/PAF — a Program for Conjunctive Conceptual Clustering, to appear as a report of the Department of Computer Science, University of Illinois, Urbana, Illinois, 1980. [20] Michalskl, R. S. A Planar Geometrical Model for Representing Multidimensional Discrete Spaces and Multiple-Valued Logic Functions, Report No. 897, Department of Computer Science, University of Illinois, Urbana, Illinois, 1978. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-80-1026 3. Recipient's Accession No. 4. Title and Subtitle KNOWLEDGE ACQUISITION THROUGH CONCEPTUAL CLUSTERING: A Theoretical Framework and an Algorithm for Partitioning Data into Conjunctive Concepts 5. Report Date May 1980 6. 7. Author(s) Ryszard S. Michalski 8. Performing Organization Rept. No. 9. Performing Organization Name and Address Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 10. Project/Task/Work Unit No. 11. Contract/Grant No. NSF MCS 79-06614 12. Sponsoring Organization Name and Address National Science Foundation Washington, DC 13. Type of Report & Period Covered 14. 15. Supplementary Notes 16. Abstracts The conventional methods of cluster analysis partition given entities into clusters of "similar" entities, using a similarity function which takes into considera- tion only the information about the entities themselves. Therefore, clusters obtained this way do not usually have any simple conceptual interpretation. The paper presents an approach to clustering (called conceptual clustering), in which entities are assemblec into a single cluster not because of their pairwise similarity, but because together they represent a certain concept. In the presented theory and algorithm, the concepts characterizing clusters are single conjunctive statements involving relations on vari- ables characterizing the entities. Thus, the algorithm not only clusters entities, but also provides descriptions of the obtained clusters. The algorithm is iterative and its general structure is based on the dynamic clustering method. In one of the testing examples the implemented algorithm was able to re-discover the correct classification of an unordered collection of cases of four different soybean diseases, and find a descrip- tion of each disease which was compatible with a characterization of this disease by 17. Key Words and Document Analysis. 17a. Descriptors Cluster analysis Data analysis Learning without Teacher Knowledge Acquisition Numerical Taxonomy Pattern Recognition Inductive Inference Classification theory 17b. Identifiers/Open-Ended Terms plant pathologists 17c. COSATI Field/Group 18. Availability Statement 19. Security Class (This Report) UNCLASSIFIED cunty Class (Thi 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages A3 22. Price FORM NTIS-38 (10-70) USCOMM-DC 40329-P7 1 JUH 3 W8\