UNIVERSITY OF 
 
The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books are reasons 
 for disciplinary action and may result in dismissal from 
 the University. 
 To renew call Telephone Center, 333-8400 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 MAR221W 
 JAN 2 mi 
 
 DEC 15 1917 
 
 SEP 8 
 
 983 
 
 L161— O-1096 
 

 sep 1 2 ma 
 

 fY)oJH>i. 
 
 C«tf-2- 
 
 UIUCDCS-R-80-1026 
 
 UILU-ENG 80 1718 
 
 KNOWLEDGE ACQUISITION THROUGH CONCEPTUAL 
 CLUSTERING: A Theoretical Framework and an 
 Algorithm for Partitioning Data into Conjunctive 
 
 Concepts 
 
 by 
 
 May 1980 
 
 Ryszard S. Michalski 
 
 HHE LIBRARY Of M 
 
 AUG i q -fc 0U 
 
 UNIVERSITY OF ILLINOIS 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/knowledgeacquisi1026mich 
 
UIUCDCS-R-80-1026 
 
 KNOWLEDGE ACQUISITION THROUGH CONCEPTUAL 
 CLUSTERING: A Theoretical Framework and an 
 Algorithm for Partitioning Data into Conjunctive 
 
 Concepts 
 
 Ryszard S. Michalski 
 
 May 1980 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 Supported in part by the National Science Foundation under grant no, 
 NSF MCS 79-06614. 
 
KNOWLEDGE ACQUISITION THROUGH CONCEPTUAL 
 
 CLUSTERING : A Theoretical Framework and an 
 
 Algorithm for Partitioning Data into Conjunctive 
 
 Concepts 
 
 Ryszard S. Michalski 
 
 University of Illinois 
 Urbana, 111. 61801 
 
 Then he took the seven loaves 
 and the fish, and when he had given 
 thanks, he broke them and gave them 
 to the disciples, and they in turn 
 to the people. 
 
 Mathew 15:36 
 
 U INTRODUCTION 
 
 Clustering is the intelligent partitioning of a collection of 
 entities. Specifically, it is the process of dividing entities (objects, 
 observations, measurements, data, etc.) into categories that are meaningful 
 or useful for some purpose. It is one of the fundamental operations 
 people use to simplify descriptions of their environment, and by that, to 
 improve the efficiency of their decision making. Appropriate clustering 
 reveals the underlying structure of the given set of objects, and hence 
 clustering can be viewed as a form of knowledge acquisition. 
 
 Clustering problems pervade many fields, particularly experimental 
 sciences such as biology, chemistry, geology, medicine, etc. Intelligent 
 partitioning of objects can also be an important capability of autonomous 
 
- 2 - 
 
 or semi-autonomous robots designed for exploration of special environments 
 (e.g., the bottom of an ocean or the surface of a planet). Consequently, 
 understanding the nature of clustering is not only of scientific interest, 
 but also of significant practical importance. 
 
 A conventional view of clustering is that it is a process of 
 partitioning objects into groups such that the degree of similarity (or 
 "natural association") is high among objects of the same group, and low 
 among the objects of different groups. The notion of the degree of 
 similarity is therefore fundamental to this viewpoint. A great variety of 
 different similarity measures have been developed and used in various 
 clustering techniques. Frequently a reciprocal of a distance measure is 
 used as a similarity function. The distance measure for such purposes, 
 however, does not have to satisfy all the postulates of a distance 
 function (specifically, the triangle inequality). A comprehensive review 
 of various distance and similarity measures is provided in Diday and Simon 
 [1] and Anderberg [2]. Backer [3] describes a fuzzy similarity measure 
 based on the theory of fuzzy sets. 
 
 To determine the similarity of objects, a measure of similarity is 
 applied to symbolic descriptions of objects (data points). Such 
 descriptions are typically vectors, whose components represent scores on 
 selected qualitative or quantitative variables used to describe objects. 
 The underlying assumption is that if the similarity function has high value 
 for the given descriptions, then the objects represented by the 
 descriptions are similar. The similarity relationship between any two 
 objects in the population to be clustered is thus reduced to a single 
 number — the value of the similarity function applied to symbolic 
 
- 3 - 
 
 descriptions of objects. 
 
 Conventional measures of distance are "context-free," i.e., the 
 distance between any two data points A and B is a function of these points 
 only, and does not depend on the relationship of these points to other data 
 points: 
 
 Similarity(A,B) = f(A,B) (1) 
 
 For example, for any conventional distance measure, the distance 
 between points A and B is the same as between B and C (Fig.l). 
 
 • _ • 
 
 • • • • 
 
 •• • 
 
 
 An illustration of the context-free distance 
 Fig.l. 
 
 Recently some authors have been introducing "context-sensitive" 
 measures of similarity: 
 
 Similarity(A,B) = f(A,B,E) (2) 
 
 where the similarity between A and B depends not only on A and B, but also 
 on the relationship of A and B to other data points, represented in (2) by 
 E. 
 
 For example, Gowda and Krishna [4] defined the so-called "mutual 
 neighborhood" distance measure. If point A is the nth closest point to B 
 and B is the mth closest point to A, then the mutual neighborhood distance 
 
- 4 - 
 
 between A and B is n-hn. These authors have demonstrated that a method using 
 such a distance measure can solve some clustering problems which methods 
 based on the "context-free" distance cannot. 
 
 Both previous clustering approaches cluster data points only on the 
 basis of knowledge of the individual data points. Therefore such methods 
 are fundamentally unable to capture the "Gestalt property" of objects, 
 i.e., a property which is characteristic to certain configurations of 
 points considered as a whole, and not as a collection of independent 
 points. In order to detect such properties, the system must know not 
 only the data points, but also certain "concepts". To illustrate this 
 point, let us consider a problem of clustering data points in Fig. 2. 
 
 A person considering the problem in Fig. 2 would typically describe it 
 as "a circle on top of a rectangle." 
 
 • # • 
 
 
 An illustration of conceptual clustering 
 Fig. 2. 
 
 Thus, the points A and B, although being very close, are placed in separate 
 clusters. Here, human solution involves partitioning the data points Into 
 
- 5 - 
 
 groups not on the basis of pairwise distance between points, but on the 
 basis of "concept membership." That means that the points are placed in the 
 same cluster if together they represent the same concept. In our example, 
 the concepts are a circle and a rectangle. 
 
 The approach to clustering which clusters objects into groups 
 representing a priori defined conceptual entities is called "conceptual 
 clustering." A link between conceptual clustering and distance-based 
 clustering methods can be established by stating that in conceptual 
 clustering the similarity between the data points is a function of these 
 points, context E, and a set of predefined concepts C: 
 
 Similarity(A,B) - f(A,B,E,C) (3) 
 
 The approach has been introduced by Michalski [5]. It evolved from 
 earlier work by the author and his collaborators on the problem of 
 generating "uniclass covers." Such covers are disjunctive descriptions of 
 a class of objects learned from only positive examples of the class. Stepp 
 [6] describes a computer program and various experimental results on 
 determining uniclass covers. His work is concerned with what can be called 
 'free'* conceptual clustering. 
 
 The idea that the similarity measures of the type (1) or (2) (the 
 "concept-free" measures) may be inadequate for some clustering problems is 
 not new. In the past, several authors noticed this problem and proposed 
 various solutions. For example, Watanabe [7,8] proposed the concept of 
 "cohesion" to measure the "degree of clusterness" of points, which 
 
 *In "free" clustering the number of clusters is not predefined, as opposed 
 to "constraint" clustering where the number of clusters is assumed a 
 priori. 
 
- 6 - 
 
 utilizes the entropy measure. Using this concept he was able to resolve 
 the "three girls in the dormitory" paradox, which cannot be solved by 
 "concept-free" methods . Other measures of "cohesiveness" of objects were 
 proposed on the basis of graph-theoretic considerations, e.g., Matula [9], 
 Auguston and Minker [10], Zahn [11], Cheng [12]. 
 
 This paper presents a theoretical basis and an algorithm for 
 conceptual clustering, where conceptual entities are conjunctive statements 
 in variable-valued logic calculus VL [13] (which is a typed many-valued 
 logic extension of propositional calculus). These statements, called VL 
 complexes, are logical products of relational statements involving discrete 
 variables of an arbitrary number of values (definition 2 and 3 in the next 
 chapter). Complexes have a simple linguistic interpretation and are able 
 to express consisely a large class of relationships among discrete 
 variables. The algorithm combines the methodology of optimization of 
 variable-valued logic expressions [14] with the dynamic clustering method 
 [1]. Its theoretical foundation is a special property of complexes 
 fomulated as the Sufficiency Principle (section 3). 
 
 2. COMPLEXES AS CONCEPTUAL ENTITIES FOR CLUSTERING : BASIC DEFINITIONS 
 
 Let x,, x„, ..., x denote discrete variables which are selected to 
 12 n 
 
 describe objects in the population to be clustered. For each variable a 
 value set or domain is defined, which contains all possible values this 
 variable can take for any object in the population. We shall assume that 
 the value sets of variables x., i=l,2,...,n are finite, and therefore can 
 be represented as: 
 
- 7 - 
 
 D ± = {0,1,...,^}, i = 1,2, ...,n (4) 
 
 In general, the value sets may differ not only with respect to their size, 
 
 but also with respect to the structure relating their elements (reflecting 
 
 the scale of measurement). In this paper we will restrict ourselves only to 
 
 the case of nominal or linear variables (i.e., variables with unordered or 
 
 linearly ordered domains, respectively). A sequence of values of variables 
 
 x, . x„, ..., x , is called an event: 
 1 2 n 
 
 e = (r 1§ r 2 , ...., r n ) (5) 
 
 where r e D , i - l,2,...,n. 
 
 The set of all possible events, E, is called the event space : 
 
 1 = {e i } f=l (6) 
 
 where d = d «d •...•d (the size of the event set) and d. = H. + 1. 
 1 2 n i i 
 
 Definition 1. Given two events e. , e_ in £ , the syntactic distance , 
 6(e ,e„) between e. and e_ is defined as the number of variables which have 
 different values in e. and e_. 
 
 Definition 2. A relational expression 
 
 [x ± # R ± ] (7) 
 
 where R , called the reference set , is one or more elements from the domain 
 D and # stands for one of the relational operators = ^ _> j<, is called a 
 VL . selector * or, briefly, a s elector . 
 
 *VL stands for variable-valued logic system VL [13] which uses such 
 selectors. 
 
- 8 - 
 
 Here are a few examples of a selector, in which variables and their 
 values are represented by linguistic terms: 
 [height «= tall] 
 
 [color = blue, red] (read: color is blue or red) 
 [length > 2] 
 [size ^ medium] 
 [weight = 2.. 5] 
 
 The operator . . in the last selector denotes the range of values from 
 
 2 to 5, inclusively. It is used when the domain of the variable is a 
 
 linearly ordered set. A selector [x # R ] is said to be satisfied by an 
 
 event e = (x. ,x„,...,x ), if the value of x. in e, is in relation # with 
 i 2. n i 
 
 any element of R . 
 
 Definition _3. A logical product of selectors is called a VL term : 
 
 A [x # R ] 
 iel x 1 (8) 
 
 where I c_ {l,2,...,n>, and R _c D . A set of events which satisfy a VL. 
 term is called a VL. complex or, briefly, a complex . 
 
 Thus a VL. term is a formal representation of a complex. Since these 
 two notions have a one to one correspondence, we will use them 
 interchangeably, unless it leads to a confussion. Therefore, if a set- 
 theoretic notation is applied to a term, it means that the operation is 
 applied to the corresponding complex (i.e., a set of events satisfying the 
 term). A complex ( VL. term ) a is said to cover an event e, if the values 
 of variables in e satisfy the relational statements (selectors) in the 
 
- 9 - 
 
 complex (term) . 
 
 For example, event e = (2,7,0,1,5,4,6) satisfies the complex 
 [x 1 = 2,3] [x 3 _< 3] [x 5 - 3. .8] . 
 
 Let E be a set of events in E, which are data points to be clustered. 
 The events in E are called data events (or observed events ) and events in 
 £ \ E (i.e., events in £ which are not data events) are called empty events 
 (or unobserved events ) . 
 
 Let a be a complex which covers some data events and some empty 
 events. 
 
 Definition 4^ The number of empty events covered by a is called the 
 sparseness of a and denoted by s(a). 
 
 Let p(a) denote the number of data events covered by a, and t(ot) 
 
 denote the total number of events covered by a. We have then 
 
 t(a) = p(a) + s(a). The total number of events satisfying the complex 
 
 a = A [x # R ] is: 
 iel 
 
 t(a) = II c(R.) • n d 
 
 iel 1 itfl 1 (9) 
 
 where 
 
 I _c {1,2,... ,n) 
 
 c(R ) - the cardinality of R 
 
 d - the cardinality of the value set of variable x . . 
 
- 10 
 
 Definition _5. The degree of generality g(a) of complex a is defined: 
 
 / \ i t(a) , ,. s(a). ,. - % 
 
 * (ct) " 1o8 F(^) = log (1 + 7fc) ) (10) 
 
 The value — : — r specifies how many events are in the complex per one data 
 event. Thus, the degree of generality g(a) specifies the uncertainty of the 
 location of the data points in the complex. The greater the degree of 
 generality of a complex, the greater is the uncertainty. If g ■ 0, then 
 all the events in the complex are data events. We can see from (10) that 
 for a fixed p(ot) the degree of generality is a monotonically growing 
 function of sparseness. 
 
 Let L be a set of complexes (or events) , and R be the set of all the 
 distinct values which variable x takes in these complexes (or events) . 
 
 Definition §_. The operation which transforms L into the complex 
 
 n 
 
 A [x = R ] is called reference union or refunion . The resulting 
 i=l 
 complex is called the minimal covering complex or mc-complex for L and 
 
 denoted RU(L) (refunion). 
 
 If any R = D . , then the corresponding selector is removed from the 
 complex. The refunion is thus a transformation which transforms a set of 
 complexes (or events) into the minimal covering complex. 
 
 Theorem 1_. The mc-complex of an event set has the minimum sparseness among 
 all complexes covering this set. 
 
 Proof : Let a be the mc-complex for an event set E: 
 
- 11 - 
 
 a = RU(E) = A [x =R ] 
 
 i=l (11) 
 
 where R _c D (the domain of x . ) . Suppose that p = A [x =P ] is a 
 
 1=1 
 complex which covers E and has a smaller sparseness than a. If this is 
 
 true, then there must exist P such that P. c R . . But R . , according to the 
 
 definition 6, contains all values that x takes in events in E. Therefore, 
 
 if P c R . , then complex a could not possibly cover all events in E, which 
 
 is a contradiction. • 
 
 Let E be data events which are covered by a complex a . 
 
 Definition 7. The set E is called the core of a, and the complex a* = RU(E) 
 is called the trimmed a. 
 
 From Theorem 1 we have a* c^ a. 
 
 Theorem 2. If E. and E~ are two disjoint event sets then: 
 
 s(RU(E 1 )) + s(RU(E 2 )) < s(RU(E : U Y.^ ) (12) 
 
 Proof : According to Theorem 1, RU(E.) and RU(E~) have the smallest possible 
 sparseness among all complexes covering E. and E„, respectively. Since E. 
 and E. are disjoint, then (12) must hold. • 
 
 The property expressed by Theorem 2 has an analogy in statistical 
 
 clustering, where with the increasing number of clusters the 'fit' between 
 
 each cluster and the probability distribution "fitted" to the cluster also 
 increases. 
 
- 12 - 
 
 Theorem 3. Let a and a„ be two intersecting complexes, whose union covers 
 an event set E. Let E. (E_) denote the set of events in a (a ) which are 
 covered only by this complex (the relative core of the complex). Let a' 
 and a." be any two disjoint complexes covering the same event set E. If 
 RU(E ) and RU(E~) are disjoint complexes, then: 
 
 sCRUCEj)) + s(RU(E 2 )) < sCop + s(op (13) 
 
 Proof : The theorem is an immediate consequence of Theorem 2 and the premise 
 that a' and a' are disjoint complexes. • 
 
 We will next introduce two basic concepts for the conceptual 
 clustering algorithm presented in section 6. They are the star of an event 
 against an event set and a cover of an event set against another event set. 
 
 Let F be a proper subset of the event space Z, and e an event outside 
 of F, i.e. , e a, F. 
 
 Definition 8. The star G(e|F) of e against F is the set of all maximal 
 under inclusion complexes covering the event e and not covering any event 
 in F. (A complex a is maximal under inclusion with respect to property P, 
 if there does not exist a complex a* with property P, such that a c a*.) 
 
 Let E and E be two disjoint event sets, E n E_ = <J>. 
 
 Definition 9. A cover COV(E |E ) of E against E is any set of complexes, 
 { a .). rT » such that for each event e e E- there is a complex a., j e J, 
 covering it, and none of the complexes a. cover any event in E . Thus we 
 have: 
 
- 13 - 
 
 E c. U a c Z \ E_ 
 
 jeJ 3 (14) 
 
 A cover in which all complexes are pairwise disjoint sets is called a 
 disjoint cover . If set E- is empty, then the cover COV(E. |E ? ) = COV(E. |0) 
 is simply denoted as COV(E ) . 
 
 Definition 10 . The sparseness (the degree of generality ) of a cover is 
 defined as the sum of the sparsenesses (the degrees of generality) of 
 complexes in the cover. 
 
 _3. SUFFICIENCY OF COMPLEXES AS CLUSTER REPRESENTATIONS 
 
 First, we will observe the following property of complexes: 
 
 Theorem 4. For any given event space E and integer k < d •d ,,, d (where d. 
 — I Z n I 
 
 is the cardinality of the value set of variable x ), there exist k pairwise 
 disjoint complexes a , a_, ..., a, which completely fill up the space E, 
 i.e. , 
 
 j=l J (15) 
 
 Proof : The theorem is equivalent to saying that any event space can be 
 partitioned into an arbitrary number of complexes (but, of course, not 
 larger than the cardinality of Z) . To see this, take any subset of 
 variables such that the arithmetic product of corresponding d -s is greater 
 than or equal to k. Let R., j=l,2,... denote all possible sequences of 
 values of variables x , i e I. 
 
- 14 - 
 
 Construct complexes: 
 
 "j " A [X i " r ij ] 
 J iel x 1J (16) 
 
 where r , i e I, j ■ 1,2,..., denotes a value of variable x in the 
 sequence R . Obviously, the complexes a are pairwise disjoint and fill up 
 the space E. If k' > k, then k' - k complexes are joined with the 
 remaining ones into single complexes, according to the formula: 
 
 3tx i = a] v 0[x ± « b] = B[x ± -a,b] (17) 
 
 where 3 denotes a conjunction of selectors involving variables other than 
 x . This is always possible, because for any x , i e I, there are d. 
 complexes a , which differ only in the value of x . • 
 
 From the view of clustering, a more interesting question is whether 
 for any given event set E in the space E, there always exist an arbitrary 
 number k _< c(E) of pairwise disjoint complexes, such that they not only 
 fill up the space £, but also partition the set E into k non-empty subsets. 
 A positive answer to this question would imply that any given event set can 
 be partitioned into an a priori assumed number of subsets, each covered by 
 a simple complex, disjoint from other complexes. The answer is indeed 
 positive. In fact, even a stronger property holds, as stated by the 
 following theorem. 
 
 Theorem _5 . (The Sufficiency Principle ) 
 
 For an event space E and any data event set E = ^ e i» e o» •••» e i,^» 
 E _c T. there exists at least one set of k pairwise disjoint complexes 
 
 a l> a 2» •••» a y ii such that each complex contains one data event: 
 
- 15 - 
 
 e, e a , j = 1,2, ...,k (18) 
 
 and the union of complexes fills up the space E: 
 
 k 
 
 U o = E 
 j-1 J (19) 
 
 Proof: 
 
 The basic idea of the proof is to show that for any 
 E = {e.,e-,..i,e, }, E _c E, it is always possible to construct a tree, in 
 which nodes are assigned the variables x , i e l,2,...,n, branches of node 
 x are assigned elements of a partition of D (the value set of x ), and 
 the leaves represent complexes a , such that each complex covers a single 
 event e., and the union of complexes fills up the space E. 
 
 Suppose, e = (Xj ,x 2 , . . . ,x ), j - 1,2, ...,k, and x £ D^ 
 
 Take any variable, say x , which has different values for events in E. 
 
 Suppose these values are a., a_, ..., a Partition the value set, D of 
 
 1 2 z. p 
 
 x , into subsets {a.}, {a }, . . . , {a .}, A , where a e A and A is a set 
 p 12 z-lz z z z 
 
 D \ {a. ,a„,a_, . . . ,a .}. It is obvious that complexes 
 
 [x = a. ] , [x = a„], ...., [x ■ A ], partition both, the event set E and 
 p 1 p 2 p z r 
 
 the event space E into z non-empty subsets. Suppose these complexes 
 
 partition E into E , E , ..., E and E into E , E , ..., E , where 
 a. a_ A a. a_ A 
 
 12 z 12 z 
 
 E c E . 
 
 a i" a i 
 
 Variable x is assigned to the root of a tree. Branches from the root 
 
 are assigned values a., a_, .«., A . Leaves of this tree correspond to 
 
 J. z. z 
 
 complexes [x = a, ] , [x ■ a»] , ..., [x = A ], covering event sets 
 p 1 p 2 ' p z ° 
 
- 16 - 
 
 E al' E a2' * 
 
 , E , respectively (Fig. 3) 
 
 CXp sq,3 
 
 CX p -a,3[Xr-By] 
 
 Constructing a tree for the proof of the sufficiency principle 
 
 Fig. 3. 
 For every one of the above event sets which has more than one element 
 
 repeat the above process with the following modification. Suppose E has 
 
 3. J. 
 
 more than one element and x takes values b, , b_, .... B for events in 
 
 r 1* 2 y 
 
 E .. Assign x to the root of a new tree, and attach the tree to the leaf 
 al ° r 
 
 corresponding to E (i.e., to the leaf marked by [x = a. ] in Fig 3). 
 
 d 1 P X 
 
 Assign the branches emanating from this root values b. , b„, ..., B , where 
 
 B = D \ {b ,b_,...,b .}. It is obvious that complexes: 
 yy s 2 y-1 
 
 [x = &1 ] [x r = b x ] , [x = a x ] [x r - b 2 ] , . . . , [x = a x ] [x r = B y ] (20) 
 partition both, the set E and the set Z . into y disjoint subsets. 
 
 This process is continued until leaves of the obtained tree correspond 
 to complexes, each of which covering only one event from E. Because every 
 step of this process partitions simultaneously events in E and in £, the 
 
- 17 - 
 
 union of the obtained complexes covers E and fills up the whole space E. 
 Thus, these complexes constitute the desired set <a ,a , . . . ,a, >. • 
 
 This above theorem asserts that the space of all complexes is 
 sufficient to be a space of cluster representations, because any event set 
 can be clustered into an arbitrary number of complexes. The theorem is 
 used as the theoretical basis for the clustering algorithm described in 
 section 6. 
 
 As the above proof indicates, there usually will be many covers which 
 constitute a k-partition of any given event set. Therefore, a question 
 arises as to which cover to select as the most desirable. In order to 
 answer this question, a criterion of the quality of a cover is needed. 
 
 4- . A CRITERION FOR EVALUATING QUALITY OF CLUSTERING 
 
 Let E be the set of data points, and COV(E) a disjoint cover of E. 
 Such a cover implies a partition of E into clusters, each cluster being the 
 event set contained in one complex. The sparseness (or the degree of 
 generality) of the cover could be used for defining a criterion of quality 
 of a partition. However, if E is partitioned into individual events, 
 then, obviously, the sparseness ( as well as the degree of generality) 
 will be zero. Consequently, this kind of criterion can be used only if the 
 number of clusters is assumed a priori, i.e. , for a constrained clustering 
 problem. In this case the problem is to find a disjoint cover of E with k 
 complexes, whose sparseness (or the degree of generality) is minimum. In 
 the case of a free clustering problem (i.e., when the number of clusters is 
 not assumed a priori), a criterion of quality of partitioning has to 
 
- 18 - 
 
 involve, in addition to sparseness (or the degree of generality), also some 
 "cost" function dependent on the number of clusters, e.g., a measure of 
 complexity of a cover. In this paper we are concerned only with the 
 constraint clustering problem. Although it may seem otherwise, this is not 
 a serious limitation because interesting practical solutions of 
 clustering problems should not produce more than just a few clusters (this 
 is so, because when the number of clusters is large, humans prefer to 
 organize them into an hierarchy). Consequently, to obtain a general 
 solution, a constraint clustering algorithm should be repeated for 
 several different k, and the best obtained partition selected as the 
 general solution. 
 
 The sparseness (or the degree of generality) may not be sufficient as 
 the sole criterion for selecting a cover. One may seek a cover which 
 exhibits other properties than minimum sparseness. In order to use several 
 criteria for selecting a cover simultaneously, we adopt the lexicographic 
 cost functional defined in [14] . 
 
 A lexicographic evaluation functional ( LEF ) is defined as a pair of 
 two lists: 
 
 A = <a-list,T-list> (21) 
 
 where a-list = (a ,a_, . . . ,a p ) , is a list of attributes used 
 to evaluate a cover 
 T-list = (t ,t , . . . ,t ) , is a list of "tolerances" assigned to 
 the attributes a . , respectively , _< T _< 1. 
 
- 19 - 
 
 Let V , j = 1,2,... denote all possible disjoint covers of the event 
 set E. Let V denote one of the covers, and let a . (V ) denote the value of 
 attribute a. for cover V . Cover V is said to be optimal (minimal ) under 
 functional A if for every j: 
 
 A(V) <• A(V ) (22) 
 
 where 
 
 A(V) = (a^V), a 2 (V),...,a £ (V)) 
 
 A(V ) = (a x (V ), a 2 (V ) a £ (V )), j =1,2,..., 
 
 x 
 and <• is a relation, called the lexicographic order with tolerances , which 
 
 holds if: 
 
 a l ( V " a l (V) > T l 
 or la^V ) - a^V)! < Tj and a 2 (V ) - a 2 (V) > ? 2 
 
 or 
 
 (23) 
 
 where 
 
 or and a £ (V ) - a £ (V) > 
 
 T i = V (a imax- a imin^ i = L2,. ..,*-! 
 
 a imax =^x{a ± (V )>, 
 
 a imin = m j in{a i (V j ) > 
 
 Note that if t = (0,0,..., 0) then <• denotes the lexicographic order 
 in the usual sense. In this case, A can be specified just as A = <a-list>. 
 
- 20 - 
 
 To specify the a functional A one selects a set of attributes, puts 
 thera in the desirable order in the a-list, and sets the values for 
 tolerances in the T-list. 
 
 Relation <• partitions all covers into equivalence classes and orders 
 the classes linearly, with the first class containing one or more optimal 
 covers, and the next classes containing consecutively less optimal covers. 
 
 Below are a few criteria which may be used to assemble an a-list: 
 
 • Sparseness (or generality g) of a cover. Minimizing sparseness will 
 produce complexes which "fit" as closely as possible to clusters of data 
 events. This criterion is an analog to the criterion of minimizing intra- 
 distances in the conventional distance-based clustering. 
 
 • Intersection , defined as the average degree of intersection (DI) between 
 
 any two complexes in the cover. The DI between two complexes is the total 
 
 number of selectors which remain in both complexes after removing every 
 
 pair of disjoint selectors (selectors whose reference sets do not 
 
 intersect) . For example, the degree of intersection between complexes 
 
 [x 2 =2,3] [x 4 =3,5,7] [x 5 =2..5] 
 and 
 
 \ f 
 
 [ Xl =3] [x 2 =l] [x 4 =5. .12] [x 5 =l] 
 is 3. 
 
 The introduction of DI as a criterion for clustering comes from the 
 observation, that people tend to prefer partitions of objects, in which 
 clusters differ not in just one, but in many characteristics. This 
 criterion is an analog to the criterion of maximizing cluster inter- 
 distances in distance-based clustering. 
 
- 21 - 
 
 • Imbalance, defined as 
 
 k' 
 1/k T |l/k-c(E) - c(E n o )| 
 
 i-1 (24) 
 
 where c(E) is the size of the event set, and c(E r\ a . ) is the number of 
 data events covered by complex a (the cardinality of the core of a.). The 
 imbalance measures the variability of cluster sizes. 
 
 • Dimensionality , defined as the total number of different variables 
 involved in the complexes of the cover. The dimensionality tells us how 
 many variables are used to describe clusters, and, thus, how many variables 
 have to be measured to classify objects into these clusters. 
 
 _5 . PROCEDURES STAR and NIP 
 
 Before describing an algorithm for conceptual clustering (next 
 section) we shall first describe two important procedures used in this 
 algorithm: STAR and NID. Procedure STAR generates the star (def. 8) of a 
 data event against a set of other data events, and procedure NID transforms 
 a non-disjoint cover, whenever possible, into a disjoint cover with the 
 same number of complexes. 
 
 Procedure STAR : 
 
 This procedure is based on the algorithm described in [14] . 
 
 Let e be an event and a a complex. The operation e |— a (read: e 
 o ™t- r Q i 
 
 extended in a) is defined: 
 
- 22 - 
 
 e f— a 
 o 
 
 a, if e e a 
 o 
 
 (25) 
 
 <J> , otherwise 
 
 Let event e. = (r.,r_,...,r ) and e + e . The operation e — I e. 
 L 1 2 n io r ol 
 
 (read: e extende d against e, ) is defined: 
 o 1 
 
 % "I e l " (e o I- [x i ' r l 
 iel 
 
 ]) 
 
 (26) 
 
 Let G (e|E) denote the union of complexes from the star G(e|E). It 
 can be shown that: 
 
 G U (e|E) 
 
 = r\ 
 
 e eE 
 
 (e -H . ) 
 
 (27) 
 
 To obtain the star G(e|E) from G (e|E), the right-hand side of (27) 
 must be converted to the union of maximal (under inclusion) complexes. 
 Such a union is obtained when the set-theoretical multiplication is done 
 with the application of absorption laws. 
 
 Procedure NIP : 
 
 (A transformation of a _non-disjoint cover _into a disjoint cover) 
 
 Let {a ,a , . . . ,a } be a set of not necessarily disjoint complexes, 
 which is a cover of a data event set F. 
 
- 23 - 
 
 1. Let c(a ), i = 1,2,..., A, denote the cardinality of a (the total 
 
 number of events covered). Determine the (arithmetic) j|um of 
 
 cardinalities: 
 
 I 
 sc - I c(o ) (28) 
 
 i-1 
 
 and the cardinality of the (set-theoretic) sum of complexes: 
 
 cs « c( U o ) (29) 
 
 i-1 
 
 2. If sc = cs then STOP: L is already a disjoint cover. 
 
 3. For i = 1,2, ••.,£, determine the relative core, CORE , of complex 
 
 a ,i.e., the set containing data events covered by complex a and only 
 
 by this complex. Let RESIDUE denote the set of remaining events, 
 
 I 
 i.e., RESIDUE - F \ U CORE . 
 
 i=l 
 
 4. For each CORE determine its mc-complex (def. 6): 
 
 a° = RU(CORE ), i = 1,2,...,* (30) 
 
 5. If any two complexes a intersect, then STOP. The disjoint cover 
 cannot be obtained. (This is a direct consequence of Theorem 1) 
 
 6. Select an event from RESIDUE and call it e. Delete e from RESIDUE. 
 
 7. For each pair (e,a ), i = 1,2,...,£, determine the covering complex: 
 
 a* = RU({e> n a°) (31) 
 
 8. Delete every a which intersects with any a , j + i. If all a are 
 deleted then STOP: a disjoint cover cannot be obtained. 
 
- 24 - 
 
 9. Select the best complex, Best-a, among complexes a , according to the 
 LEF: 
 
 where 
 Aspars 
 
 res 
 
 Asel 
 
 <(Aspars, -res, -Asel),(T ,t ,t )> 
 
 1 
 
 - the difference between the sparseness of a and a 
 
 - the number of events in RESIDUE covered by a 
 
 - the difference between the number of selectors in a and 
 
 T 1 ,T 2 ,T 3 
 
 tolerances are set to by default. 
 
 The sign '-' in front of res and Asel indicates that the algorithm 
 will maximize these criteria (by minimizing the negative value). 
 
 10. Suppose Best-a was created by joining e with a . Assign to a a new 
 value Best-a. 
 
 11. If RESIDUE = <(>, then END, otherwise go to 6. 
 
 The output from this procedure is either a disjoint cover 
 {a ,a , . . . ,a } of set F, or an indication that such cover cannot be 
 obtained from the initial cover {a . ,a„, . . . ,a f }. 
 
 6. AN ALGORITHM FOR CONJUNCTIVE CONCEPTUAL CLUSTERING 
 
 6.1. An Overview 
 
 Based on the ideas described in previous sections, we have developed 
 an algorithm for conjunctive conceptual clustering, called PAF.* Given a 
 
 *Polish-American-French 
 
- 25 - 
 
 set, E, of events from an arbitrary event space, and an integer k, PAF 
 partitions E into k clusters, each of which has a conjunctive description 
 in the form of a VL complex. The obtained partition is optimal or 
 suboptimal with regard to a lexicographic evaluation functional, assembled 
 by a user from the criteria listed in the previous section. 
 
 The general structure of the algorithm is based on the multicriteria 
 dynamic clustering method developed by Diday and his collaborators (Diday 
 and Simon [1], Hanani [15]). Underlying notions of the dynamic clustering 
 method are two functions: 
 
 g - the representation function , which, given k clusters of a partition 
 
 of E (a k-partition) produces a set of k cluster representations, 
 called kernels . There may be diferent kinds of kernels, e.g., the 
 center of gravity of a cluster, a few selected points from a cluster, a 
 probability distribution best fitting the cluster, a linear manifold of 
 minimal inertia, etc. 
 
 f - the allocation function , which, given a set of kernels, 
 
 partitions E into k clusters, "best fitting" these kernels. 
 
 The method works iteratively, starting with a set of k initial, 
 randomly chosen kernels (of a given kind). A single iteration consists of 
 an application of function f to given kernels, and then of function g to 
 the obtained partition. An iteration ends with a new set of kernels. The 
 process continues until the chosen criterion of quality of a partition, W, 
 ceases to improve. (Criterion W measures the "fit" between a partition and 
 kernels.) It has been proven [1], that this method always converges to a 
 local optimum. 
 
- 26 - 
 
 The measure W can be a single criterion, or a sequence of criteria. 
 In the nultlcrlterla case, for each criterion an appropriate type of 
 kernels Is used (Hananl [15]). 
 
 The algorithm PAF applies a multicriteria dynamic clustering method, 
 In which the basic and final cluster representation is a VL complex. 
 Intermediate representations include the geometrical center of a cluster 
 (using the syntactic distance; def.l) and the "most outstanding" event 
 (most distant from the center) in a cluster. 
 
 The use of the latter representation is an application of an 
 "adversity principle." This principle states that if the most outstanding 
 event truly belongs to the given cluster, then if it serves as the cluster 
 representation, then the "fit" between it and other events in the same 
 cluster should still be better than the "fit" between it and events of any 
 other cluster. 
 
 In the algorithm PAF, the measure of "fit" between a data event and a 
 kernel (a VL. complex) is a binary measure, defined by a predicate 
 specifying whether an event satisfies the complex or not. A. complex is a 
 form, which can describe a very large number of configurations of events. 
 For n variables, each taking d distinct values, there are N = (2 -1) 
 
 different complexes. For example, if n ■ 10 and d = 7, then N is 
 
 20 
 approximately 10 . Such a large size of the "concept space" makes 
 
 conjunctive clustering computationally an extremely complex problem. To 
 
 obtain a feasible practical solution, it is necessary to apply a 
 
 combination of carefully designed heuristic search methods. In PAF, one of 
 
 the methods used is a well known "best first" search technique developed 
 
- 27 - 
 
 in artificial intelligence [16] . 
 
 6.2 Description of PAF 
 
 A flow diagram of algorithm PAF is shown in Fig. 4. 
 
 1. In the first step (block 1), a set of k data events 
 E = {e. , e_, . . . ,e, >, called seeds , is selected from the event set E. 
 Seeds can be selected arbitrarily, or they can be chosen as events 
 which are most distant syntactically (def.l ) from each other. In the 
 latter case the algorithm will generally converge faster. For 
 selecting such events program ESEL [17] can be used. 
 
 2. For each seed e , i = l,2,...,k, a star is generated against the 
 remaining seeds (using procedure STAR described in sec. 5): 
 
 G ± = G( ei |E o \ < ei >), i = 1,2 k 
 
 3. From each star a complex is selected, such that the resulting set of k 
 complexes: 
 
 (i) Is a disjoint cover of E 
 
 (ii) Is an optimal or suboptimal cover among all possible such covers, 
 according to an assumed criterion LEF (constructed by a user from 
 criteria listed in sec. 4: sparseness or generality, intersection, 
 imbalance and dimensionality). This is the most difficult and 
 computationally costly step of the algorithm. It can be 
 performed in a number of different ways. We will distinguish 
 between three different procedures: P (parallel), PS (parallel- 
 sequential) and S (sequential). These procedures are described 
 
- 28 - 
 
 Give n : 
 
 - a set of data events 
 
 - the desired nr of clusters 
 
 - the evaluation functional 
 
 Using procedure STAR determine the star 
 of each seed against the remaining seeds 
 Select from eacli star one complex, SO 
 that the obtained collection, 1', of k 
 complexes will he the "best" disjoint 
 cover of E (with help of NID procedure) . 
 
 Vt7 
 
 Is the termination 
 criterion appl ied 
 to P satisfied? 
 
 Yes 
 
 W 
 
 Is iteration 
 odd or even 
 
 W 
 
 Choose k new seed 
 events which are 
 central in the 
 complexes in P 
 
 <^7 
 
 Choose k new Seed 
 events which are 
 extreme in the 
 complexes in P 
 
 A flow diagram of algoritm PAF 
 
 Figure 4. 
 
- 29 - 
 
 in the next section. 
 
 4. A termination criterion of the algorithm is applied to the obtained 
 cover. The termination criterion is a pair of parameters (b,p), where 
 b (the base) is a standard number of iterations the algorithm always 
 performs, and p (the probe) is the number of iterations beyond b, 
 which the algorithm performs, after each iteration which produces an 
 improved cover. 
 
 5. A new set of seeds is determined. If the iteration is odd, then the 
 new seeds are data events in the centers of complexes in the cover 
 (according to the syntactic distance) . If the iteration is even, then 
 the new seeds are data events maximally distant from the centers 
 
 (Recording to the "adversity principle"). 
 
 ]_. PROCEDURES P, SP AND S^ 
 
 All three procedures use bounded stars , that is stars whose size is 
 limited by special parameter MAXSTAR. The reason is that the size of stars 
 may be very large when the number of variables n is high. As can be seen 
 from procedure STAR, the upper bound on the number of complexes in a star 
 grows exponentially with k (the number of clusters); namely n . The size 
 of any star is controlled by not allowing it to have more than MAXSTAR 
 complexes. Whenever a star exceeds this number, complexes are ordered in 
 the order of ascending sparseness, and only first MAXSTAR complexes are 
 retained. It is also assumed that all complexes in stars are trimmed (i.e., 
 the refunion operation is applied to the core of each complex, and then 
 the resulting mc-complex is used to replace the original complex in the 
 
- 30 - 
 
 9tar; see def .7) . 
 
 To simplify the description of procedures we will assume that the 
 criterion of clustering optimality is minimizing the sparseness of the 
 disjoint cover (representing a partition). The procedures can be extended 
 for a multicriteria case by using a criterion LEF (which imposes a linear 
 order between equivalence clsses of sets of complexes). In such a 
 multicriteria case, however, sparseness should be used as the primary 
 criterion in order to retain the properties of the described procedures. 
 
 Procedure P_ 
 
 The procedure is applicable for relatively small MAXSTAR and k. It is 
 particularly useful for execution on a parallel processor. Let star 
 
 G = G(e |E \ {e }) be a set <a ,a , . . . ,a >, i = l,2,...,k. Assume that 
 
 i 
 complexes a , j=0,l,...,g , are ordered in ascending order on sparseness. 
 
 The position of a complex in the star so ordered (indicated by a subscript, 
 
 which counts from 0), is called the rank of the complex (thus, e.g., 
 
 complex a has rank 2) . 
 
 Taking one symbol a. from each star G . , i = 1,2,..., k, at a time, 
 
 generate all possible sequences: 
 
 / 1» 2 k, 
 P = (a a , ...,ot ) 
 
 o o o 
 
 , 1 2 k-1 k N 
 
 1 o o o I 
 
 (32) 
 
 P = {a ,a ,...,a > 
 
 r 8 X 8 2 8 k 
 
 where r = (8^1) (8 2 +l) . . . (g +i) 
 
- 31 - 
 
 The sum of the ranks of complexes in any such sequence is called the 
 pathrank . Assume that sequences P , j * 1,2,...,T are now arranged in 
 ascending order on their pathrank, with sequences of equal pathrank ordered 
 arbitrarily. As before, P has pathrank (because all complexes in P 
 have rank 0). P. ,P„, . . . ,P, , however, denote sequences with pathrank 1, and 
 P„ denotes a sequence with pathrank g,+g 9 +» • «+g, • 
 
 Considering sequences P in the ascending order on their pathrank, 
 the following operations are performed on each sequence: 
 
 (i) A P is tested whether it is a cover of E. This can be done by 
 
 consecutively removing from E data events covered by each complex in 
 
 P . If at the end E becomes the empty set, P is a cover. If a P is 
 not a cover, it is removed from further consideration. 
 
 (ii) A P is tested whether it is a disjoint cover. If it is, its 
 sparseness is calculated. If it is not, a lower bound (l.b.) on the 
 sparseness of a possible disjoint cover is calculated (without 
 actually determinig the disjoint cover). 
 
 The l.b. is computed by determining the relative core of each 
 complex (i.e., data events covered only by the given complex and not 
 by any other complexes) , and then computing the sparseness of the mc- 
 complex of the core. The l.b. is the sum of so obtained sparsenesses 
 (this computation is based on theorem 3). {The purpose of using the 
 l.b. is to avoid, whenever possible, the computationally costly 
 procedure NID.} 
 
- 32 - 
 
 (iii) If the computed sparseness (or l.b.) i8 not a new minimum (i.e., ia 
 not 8maller than the aparaene88 of the be8t cover obtained so far), 
 then the cover is removed from further consideration. Otherwise, if 
 it is a disjoint cover, it is retained as the best cover; and if it is 
 a non-disjoint cover, it is transformed by NID, if possible, into a 
 disjoint cover (note that some operations of the NID procedure were 
 already done in (ii)). If the sparseness of the obtained disjoint 
 cover still represents a new minimum, the cover is retained as the 
 best so far. If the sparseness is not a new minimum, or NID fails to 
 produce a disjoint cover, the cover is removed from further 
 consideration. 
 
 The disjoint cover retained at the end of the above search process 
 through sequences P is the output of the procedure. It is a minimum 
 sparseness cover which can be assembled from complexes in the given stars. 
 The existence of at least one disjoint cover is assured by the sufficiency 
 principle. An advantage of the above described ordering of sequences P 
 is that the best cover will most likely be close to the beginning of the 
 list. Therefore, if the number of sequences is very large, the search can 
 stop before reaching the end, with a low risk of loosing the optimal 
 solution. 
 
 Procedure PS 
 
 In procedure P, all sequences P were generated first, and then 
 linearly searched in order to determine the best cover. In this procedure, 
 the search for the best cover is done during the process of generating the 
 
- 33 - 
 
 sequences, using the "best first" search strategy (Winston [16]). 
 Specifically, the search is based on the algorithm A* (Nilsson [18]). At 
 step a complex is added to the partial cover (a partial sequence after 
 application of NID) which most likely leads the optimal cover (according to 
 an evaluation function). This process avoids testing (usually many) 
 sequences P , for which it is possible to predict that they will not 
 produce an optimal cover. The procedure PS is especially applicable when 
 stars G are large. 
 
 Fig. 5 illustrates the search process. Branches emanating from a node 
 at level i represent complexes in star G . A path from the root to a node 
 at level i represents a partial disjoint cover with i complexes. When i=k, 
 the path represents a complete disjoint cover (corresponding to some 
 sequence P to which NID was applied). 
 
 In the first step, sequence P ■ (a ,a , . . . ,a ) is generated. (It is 
 the sequence of complexes of the smallest sparseness). The relative core 
 of each complex is determined and then the mc-complex is constructed for 
 each core. Let s.. ,s_, . . . ,s, denote the sparsenesses of the obtained mc- 
 complexes. On the basis of theorem 3, the sum s. + s 9 + •••+ s. specifies 
 a lower bound on the sparseness of the best disjoint cover which can be 
 built from complexes of given stars. 
 
 In the next step, node (1) (fig. 5) is expanded, i.e., a is paired 
 with every complex in G , procedure NID is applied to each pair, and then 
 the sparseness is calculated for the obtained disjoint pair. If NID fails, 
 the path is abandoned. The obtained pair is a partial cover with i=2 
 complexes. Nodes corresponding to generated partial covers (including the 
 
- 34 - 
 
 •^ *- 
 
 (LEVEL 1) 
 
 (LEVEL2) 
 
 (LEVEL 3) 
 
 (35) (24] 1X^(19) (21) 
 
 OPTIMAL COVER 
 
 a o (LEVEL k) 
 
 (24) (27) (24) 
 
 A search tree for an optimal cover 
 Fig. 5. 
 
- 35 - 
 
 remaining complexes in G.) are assigned a value of the evaluation function: 
 
 f = h + g (33) 
 
 where 
 
 h - is the sparseness of the obtained partial disjoint cover 
 
 g - is the sum s . +. s . _ + ... + s, , where i is the number of complexes 
 
 in the partial cover. 
 
 {g represents a l.b. on the sparseness of the remaining complexes to 
 
 be determined, i.e., complexes that are needed to complete the cover 
 
 under construction} 
 
 According to the best first strategy, the node to be expanded at each 
 step is the one which is associated with the lowest value of the 
 evaluation function. It is proven that such strategy will produce the 
 optimal cover [18]. The order of expanding nodes in the tree in Fig. 5 is 
 shown by numbers in circles. The value of the evaluation function 
 associated with each node is given in parentheses. 
 
 Procedure S^ 
 
 This procedure is like procedure PS, with the exception that stars are 
 not generated beforehand. When expanding a node in the search tree, rather 
 than taking complexes from already determined stars, an appropriate star is 
 generated each time. This requires a multiple repetition of the star 
 generation process, but saves on the memory for storing all stars (which 
 may be large sets). 
 
- 36 - 
 
 8. A NOTE ON IMPLEMENTATION AND AN EXAMPLE 
 
 The algorithm has been Implemented by R. Stepp in PASCAL for Cyber 
 175. The details on the implementation are in [19]. For illustration we 
 will briefly describe two examples, which were used In testing experiments 
 with the program. 
 
 Figure 6a represents a diagrammatic representation [20] of an event 
 space, spanned over variables x.,x_,x-,x,, with domain sizes 2, 5, 4, 2, 
 respectively. Each cell represents one event. Cells marked by 1 represent 
 data events, remaining cells represent empty events. Fig. 6a also shows a 
 cover obtained from the first iteration of the algorithm. The remaining 
 figures show results from the consecutive iterations. Cells representing 
 seed events in each iteration are marked by + . The partition evaluation 
 criterion was a LEF: 
 
 <(sparseness, imbalance, dimensionality) (0, 0, )> 
 According to this criterion, the best partition is the one shown in 
 Fig. 6c. The partition is specified by comple es: 
 
 a° = [x x = 0] [x 2 = 1] [x 4 = 0] 
 
 a° = [x 1 = 0] [x 2 = 2] [x 3 - 1..3] 
 
 a° = [Xj = 1] [x 2 = 1..3] 
 
 Another experiment with the program involved clustering 47 cases of 
 soybean diseases. These cases represented four different diseases, as 
 determined by plant pathologists (the program was not, of course, given 
 this information) . Each case was represented by an event of 35 many-valued 
 variables. With k=4, the program partitioned all cases into four 
 categories. These four categories turned out to be precisely the 
 
- 37 - 
 
 Sparseness = 18 
 lmbalance= 1.6 
 Dimensionality 2 3 
 
 ITERATION 2 
 
 Sparseness = 20 
 Imbalance = 3.6 
 Dimensionality ■. 3 
 
 c. 
 
 ITERATION 3 
 (Optimal solution) 
 Sparseness = 1 2 
 
 Imbalance 3 7.3 
 Dimensionality 2 4 
 
 ITERATION 4 
 
 Sparseness = 16 
 Imbalance 5 3.6 
 Dimensionality 5 3 
 
 Fig. 6. 
 
- 38 - 
 
 categories corresponding to individual diseases. The complexes defining 
 the categories involved known characteristic symptoms of the corresponding 
 diseases. 
 
 9.- CONCLUSION 
 
 The paper presented a theoretical foundation and an algorithm for 
 conceptual clustering, in which entities are assembled into classes 
 described by single conjunctive concepts (VL. complexes). Thus, the 
 proposed approach produces clusters together with their descriptions. The 
 descriptions are conjunctive statements involving relations on variables 
 characterizing the entities, and have a simple linguistic interpretation. 
 
 The presented algorithm has been implemented and tested on various 
 examples. The results indicate that the method provides an valuable 
 alternative to the conventional clustering methods, and has a potential for 
 application in variety of clustering problems. 
 
 ACKNOWLEDGEMENTS 
 
 A major part of the research reported in this paper was done when the 
 author worked as a Visiting Professor at the University of Paris - IX 
 (Dauphine) and at the research institute IRIA in France. A partial support 
 provided by these institutions, and by the National Science Foundation 
 under grant MCS-79-06614 is gratefully acknowledged. 
 
 Numerous conversations with Professor E. Diday and his collaborators 
 at IRIA have been very useful in shaping up the ideas and the method 
 
- 39 - 
 
 described here. Some of the results related to the algorithm PAF have 
 been developed in collaboration with Bob Stepp. His suggestions and 
 insightful criticisms contributed significantly to the final version of 
 the paper. 
 
 Thanks go also to June Wingler for her help in typing the paper and 
 admirable struggles with our not always reliable editing system. 
 
 REFERENCES 
 
 [1] Diday, E. and Simon, J. C. Clustering analysis, Chapter in 
 Communication and Cybernetics 10, Ed. K. S. Fu, Springer-Verlag, 
 Berlin, Heidelberg, New York, 1976. 
 
 [2] Anderberg, Michael R. Cluster Analysis for Applications , Academic 
 Press, 1973. 
 
 [3] Becker, E. Cluster analysis formalized as a process of fuzzy 
 identification based on fuzzy relations, Delft University of 
 Technology, Department of Electrical Engineering, Report IT-78-15, 
 October 1978. 
 
 [4] Gowda, K. Chidananda and Krishna, G. Disaggregative clustering using 
 the concept of mutual nearest neighborhood, IEEE Trans, on Systems, 
 Man and Cybernetics, Vol. SMC-8, No. 12 (December 1978), pp. 888-894. 
 
 [5] Michalski, R. S. Studies in inductive inference and plausible 
 reasoning, (a proposal to NSF, November 1978) to appear as a report 
 of the Department of Computer Science, University of Illinois, 
 Urbana, Illinois). 
 
 [6] Stepp, R. Learning without negative examples via variable-valued 
 logic characterizations: the uniclass inductive program AQ7UN1, 
 Department of Computer Science, Report 982, University of Illinois, 
 Urbana, Illinois, July 1979. 
 
 [7] Watanabe, S. Pattern Recognition as an Inductive Process , 
 Methodologies of Pattern Recognition , Academic Press, Ed. S. 
 Watanabe, 1968. 
 
 [8] Watanabe, S. Knowing and Guessing ; a quantitative study of inference 
 and information , New York, Wiley , 1969. 
 
- 40 - 
 
 [9] Matula, D. W. Cluster analysis via graph-theoretic techniques, Proc. 
 of the Louisiana Conerence on Combinatorics, Graph Theory and 
 Computing, Eds. R. C. Mullin, K. B. Reid and D. P. Roselle, Louisiana 
 State University, Baton Rouge, March 1-5, 1970. 
 
 [10] Auguston, I. G. and Minker, J. An analysis of some graph theoretical 
 cluster techniques, Journal of the ACM, Vol. 17, No. 4 (October 
 1970), pp. 571-588. 
 
 [11] Zahn, C. T. Graph-theoretic methods for detecting and describing 
 Gestalt clusters, IEEE Trans, on Computers, Vol. C-20, No. 1 (January 
 1971), pp. 68-86. 
 
 [12] Cheng, Chih-Meng. Clustering by clique generation, Department of 
 Computer Science, Report 655, University of Illinois, Urbana, 
 Illinois, June 1974. 
 
 [13] Michalskl, R. S., VARIABLE-VALUED LOGIC: System VL , Proceedings of 
 the 1974 Intern. Symp. on Multiple-Valued Logic, West Virginia 
 University, Morgantown, West Virginia, May 29-31, 1974. 
 
 [14] Michalskl, R. S., "Synthesis of optimal and quasi-optimal variable- 
 valued logic formulas", Proceedings of the 1975 Intern. Symp. on 
 Multiple-Valued Logic, Bloomlngton, Indiana, May 13-16,1975 
 
 [15] Hanani, U. Multicritera dynamic clustering, Reports of IRIA, 1979. 
 
 [16] Winston, P. H., Artificial Intelligence , Addison-Wesley Publishing 
 Company, 1977. 
 
 [17] Michalskl, R.S. and Larson, J.B. Selection of most representative 
 training examples md incremental generation of VL. hypotheses: the 
 underlying methodology and the description of programs ESEL and AQ11, 
 Report No. 867, Department of Computer Science, University of 
 Illinois, Urbana, Illinois, 1978. 
 
 [18] Nilsson, Nils, T., Principles of Artificial Intelligence , Tioga 
 Publishing Company, 1980. 
 
 [19] Stepp, R., A Description and User's Guide for CLUSTER/PAF — a 
 Program for Conjunctive Conceptual Clustering, to appear as a report 
 of the Department of Computer Science, University of Illinois, 
 Urbana, Illinois, 1980. 
 
 [20] Michalskl, R. S. A Planar Geometrical Model for Representing 
 Multidimensional Discrete Spaces and Multiple-Valued Logic Functions, 
 Report No. 897, Department of Computer Science, University of 
 Illinois, Urbana, Illinois, 1978. 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-80-1026 
 
 3. Recipient's Accession No. 
 
 4. Title and Subtitle 
 
 KNOWLEDGE ACQUISITION THROUGH CONCEPTUAL CLUSTERING: A 
 Theoretical Framework and an Algorithm for Partitioning Data 
 into Conjunctive Concepts 
 
 5. Report Date 
 May 1980 
 
 6. 
 
 7. Author(s) 
 
 Ryszard S. Michalski 
 
 8. Performing Organization Rept. 
 No. 
 
 9. Performing Organization Name and Address 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, IL 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 NSF MCS 79-06614 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, DC 
 
 13. Type of Report & Period 
 Covered 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts The conventional methods of cluster analysis partition given entities into 
 clusters of "similar" entities, using a similarity function which takes into considera- 
 tion only the information about the entities themselves. Therefore, clusters obtained 
 this way do not usually have any simple conceptual interpretation. The paper presents 
 an approach to clustering (called conceptual clustering), in which entities are assemblec 
 into a single cluster not because of their pairwise similarity, but because together 
 they represent a certain concept. In the presented theory and algorithm, the concepts 
 characterizing clusters are single conjunctive statements involving relations on vari- 
 ables characterizing the entities. Thus, the algorithm not only clusters entities, but 
 also provides descriptions of the obtained clusters. The algorithm is iterative and its 
 general structure is based on the dynamic clustering method. In one of the testing 
 examples the implemented algorithm was able to re-discover the correct classification of 
 an unordered collection of cases of four different soybean diseases, and find a descrip- 
 tion of each disease which was compatible with a characterization of this disease by 
 
 17. Key Words and Document Analysis. 17a. Descriptors 
 
 Cluster analysis 
 Data analysis 
 Learning without Teacher 
 Knowledge Acquisition 
 Numerical Taxonomy 
 Pattern Recognition 
 Inductive Inference 
 Classification theory 
 
 17b. Identifiers/Open-Ended Terms 
 
 plant pathologists 
 
 17c. COSATI Field/Group 
 
 18. Availability Statement 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 cunty Class (Thi 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 A3 
 
 22. Price 
 
 FORM NTIS-38 (10-70) 
 
 USCOMM-DC 40329-P7 1 
 
JUH 3 W8\