The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN "■>!# 1 ,'■ :• - 987 L161— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/learninggenerali1007diet 21 £ tor A UIUCDCS-R-80-1007 LEARNING AND GENERALIZATION OF STRUCTURAL DESCRIPTIONS: Evaluation Criteria and Comparative Review of Selected Methods UILU-ENG 80 1710 by Thomas G. Dietterich and Ryszard S. Michalski February 1980 UIUCDCS-R-80-1007 LEARNING AND GENERALIZATION OF STRUCTURAL DESCRIPTIONS: Evaluation Criteria and Comparative Review of Selected Methods by Thomas G. Dietterich and Ryszard S. Michalski February 1980 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 LEARNING AND GENERALIZATION OF STRUCTURAL DESCRIPTIONS: Evaluation Criteria and Comparative Review of Selected Methods Thomas G. Dietterich Computer Science Department Stanford University Stanford, California 94305 Ryszard S. Michalski Department of Computer Science University of Illinois Urbana, Illinois 61801 ABSTRACT Some recent work in the area of learning structural descriptions from examples is reviwed in light of the need in many diverse disciplines for programs which can perform conceptual data analysis. Such programs describe complex data in terms of logical, functional, and causal relationships which cannot be dis- covered using traditional data analysis techniques. Various important aspects of the problem of learning structural descriptions are examined and criteria for evaluating current work is presented. Methods published by Buchanan, et. al. [1-3,20], Hayes-Roth [6-9], and Vere [22-25], are analyzed according to these criteria and compared to a method developed by the authors. Finally some goals are suggested for future research. Key words: Machine Learning, Inductive Inference, Knowledge Acquisition, Structural Learning Computer Inference This research was supported in part by the National Science Foundation under grants MCS-76-22940 and MCS-79-06614. This report was submitted for publication in Artificial Intelligence Journal, 1. INTRODUCTION 1 . 1 Motivation and Basic Concepts There are many problem areas where large volumes of data are generated about a class of objects, the behavior of a system, a process, etc. Scien- tists in fields as diverse as agriculture, chemistry, and psychology are faced with the need to analyze such data in order to detect regularities and common patterns. Traditional tools for data analysis include various statistical techniques, curve-fitting techniques, numerical taxonomy, etc These methods, however, are often not satisfactory because they impose an overly restrictive mathematical framework on the scope of possible solu- tions. For example, statistical methods describe the data in terms of pro- bability distribution functions placed on random variables. As a result, the types of patterns which they can discover are limited to those which can be expressed by placing constraints upon the parameters of various pro- bability distribution functions. Because of the mathematical frameworks upon which they are based, traditional methods cannot detect conceptual patterns such as the logical, causal, or functional relationships that are typical of descriptions produced by humans. This is a well-known problem in AI, namely that a system in order to learn something must first be able to express it. The solution requires introducing more powerful representa- tions for hypotheses and developing corresponding techniques of data analysis and pattern discovery. Work done In AI and related areas on com- puter induction and learning structural descriptions from examples has laid the groundwork for researh in this area. This is not accidental, because, as Michie [17] has pointed out, the development of systems which deal with - 1 - problems in human conceptual terms is a fundamental characteristic of AI research. In this paper, we examine some of the recent work in AI on the subject of learning and generalization of structural descriptions. In particular, we will review four recent methods of inductive generalization: Buchanan et. al., Hayes-Poth, Vere, and our own work (Farlier well-known work by Winston was recently reviewed by Knapman [10]). We also outline some goals for research in this area. Attention is given primarily to the simplest form of generalization, namely the maximally specific conjunctive statements which characterize a single set of input events (called for short, conjunctive generalizations). The reason for this choice is that most work done in this area is addressing this, quite restricted, subject. Many of the researchers whose work we review in this paper have done work on other as- pects of machine learning including generalization using negative examples (Vere, Michalski) and developing discriminant descriptions of several classes of objects (Michalski). Due to space limitations, we have been un- able to include these topics in this paper. Instead, these contributions are mentioned in the sections concerning extensions. We begin the analysis by first discussing several important aspects of the problem of learning conceptual descriptions: . types of descriptions: characteristic versus discriminant . forms of descriptions . types of generalization processes involved in generalizing descrip- tions (rules of generalization) • constructive versus non-constructive induction . peneral versus problem-oriented methods of induction. - 2 - 1.2 Types of Descriptions We distinguish between characteristic and discriminant descriptions [15]. A characteristic description Is a description of a single set of objects (examples, events) which is intended to discriminate that set of objects from all other possible objects. For example, a characteristic description of the set of all tables would discriminate any table from all things which are non-tables. Psychologists consider this problem under the name of con- cept formation (e.g. Hunt [9]). Since it is impossible to examine all oth- er possible objects, a characteristic description is usually developed by specifying all characteristics which are true for all known objects of the class (positive examples). Alternatively, in some problems there are available so-called "near misses" which can be used to more precisely cir- cumscribe the given class. A discriminant description is a description of a single class of objects in the context of a fixed set of other classes of objects. It states only those properties of objects in the class under consideration which are necessary to distinguish them from the objects in the other classes. A characteristic description can be viewed as a discriminant description in which the given class is discriminated against infinitely many alternative classes . In this paper we restrict ourselves to the problem of determining charac- teristic descriptions. The problem of determining discriminant descrip- tions has been studied by Michalski and his collaborators [12-16]). 1 .3 Forms of Descriptions - 3 - Descriptions, either characteristic or discriminant, may take several forms. In this paper we concentrate on generalizations in conjunctive form. Other forms include disjunctions, exceptions, production rules of various types, hierarchical and multilevel descriptions, semantic nets, and frames . 1 . 4 Generalization Rules The proc-ss of inducing a general description from examples can be viewed as a process of applying certain generalization rules to the initial descriptions to transform them into more general output descriptions. This viewpoint permits one to characterize various methods of induction by specifying the rules of generalization which they use. Below is a brief re- view of various generalization rules based on the paper [16]. i) Dropping Condition Rule . If a description is viewed as a conjunc- tion of conditions which must be satisfied, then one way to generalize it is to drop one or more of these conditions. For example: red(x) * big(x) k red(x) (this reads: "the description 'xs which are red and big' can be generalized to the description 'xs which are red'; |< denotes the generalization operator ) ii) Turning Constants to Variables Rule . If we have two or more descriptions, each of which refers to a specific object (in a set to be characterized), we can generalize these by creating one description which - 4 - contains a variable in place of the specific object: tall(Fred) man(Fred) | |< tall(x) man(x) tall(Jim) man(Jim) | assuming that the value set of x is {Fred, Jim, ... }. 'x' can be inter- preted as representing 'a person from the .group under consideration.' These first two rules of generalization are the rules most commonly used in the literature on computer induction. Both rules can, however, be viewed as special cases of the following rule. iii) Generalizing by Internal Disjunction Pule . A description can be generalized by extending the set of values that a descriptor (i.e. vari- able, function, or predicate) is permitted to take on in order that the description is satisfied. This process involves an operation called the internal disjunction . For example; shape(x, square) | |< shape(x, (square or triangle or rectangle)) shape(x, triangle) | where statements on the left of |< describe some single objects in a class, and the statement on the right is a plausible generalization. Using the notation of variable-valued logic system VL-. [16] this rule can be expressed somewhat more compactly: [shape(x)=square] | |< [shape(x)=square, triangle, rectangle] [shape(x)=triangle] | The ',' in the expression on the right of the |< denotes the internal disjunction . Although it may seem at first glance that the internal dis- - 5 - junction is just a notational abbreviation, this operation appears to be one of the fundamental operations people use in generalizing descriptions. In general this rule can be expressed: V[L = PI] k W[L - R2] where W is some condition and where Rl and R2 are sets of values linked by Internal disjunction, and PI R2. There are two other important special cases of this rule. First, when the descriptor involved takes on values which are linearly ordered (a linear descriptor) and the second when the descriptor takes on values which represent concepts at various levels of generality (a structured descriptor ) . In the case of a linear descriptor we have: iv) Closing Interval P ule . For example, suppose two objects of the same class have all the same characteristics except that they have dif- ferent sizes, a and b. Then, it is plausible to hypothesize that all objects which share these characteristics but which have sizes between a and b are also in this class. W[size(xl)-a] | fc W[size(x) - a..b] W[size(x2)«b] | In the case of structured descriptors we have: v) Clinbing Oneralizatlon Tree Rule . Suppose the value set of the - 6 - shape descriptor is the tree of concepts: plane geometric figure polygon oval figure ;le rectangle ellipse circl< triangle rectangle ellips< With this tree structure, values such as triangle and rectangle can be gen- eralized by climbing the generalization tree: [shape(x)=rectangle] | |< [shape(x)-polygon] [shape(x)=triangle] | 1 .5 Constructive Induction >fost methods of induction produce descriptions which involve the same descriptors which were present in the initial data. These methods operate by selecting descriptors from the input data and putting them into a form which is an appropriate generalization. Such methods perform non -constructive induction. A method performs constructive induction if it includes mechanisms which can generate new descriptors not present in the input data. These new descriptors are generated by applying rules of con- structive induction. Such rules may be written as procedures or as produc- tion rules and may be based on general knowledge or on problem-oriented knowledge (for examples of constructive generalization rules see [16]). Constructive induction rules can interpret the input data in terms of knowledge about the problem domain. Frequently, the solution to a problem is dependent upon finding the proper description for the problem; as in the mutilated checkerboard problem. An inductive program should contain facil- ities for constructive induction including a library of general construc- - 7 - tive induction rules. The user should be able to suggest new rules for the program to examine. In order to activate those rules which would be most useful, the program must be able to efficiently search the space of possi- ble constructive induction rules. Programs which perform constructive induction are more likely to find use- ful and interesting patterns in complex data since they have the ability to examine the data using many different representations. 1 . 6 General versus Problem-oriented Methods It is a common view that general methods of induction, although mathemati- cally elegant and theoretically applicable to many problems, are in prac- tice very inefficient and rarely lead to any interesting solutions. This opinion seems to have lead certain workers to abandon (at least temporari- ly) work on general methods and concentrate on some specific problem (e.g., Buchanan, et. al. [1,2,3] or Lenat [11]). This approach often leads to in- teresting and practical solutions. On the other hand, it is often diffi- cult to extract general principles of induction from such problem-specific work. It is also difficult to apply such special-purpose programs to new areas . An attractive possibility for solving this dilemma is to develop methods which incorporate various general principles of induction (including con- structive induction) together with mechanisms for using exchangeable pack- ages of problem-specific knowledge. In this way a general method of induc- tion, provided with an appropriate package of knowledge, could be both easily applicable to different problems and also efficient and practically - 8 - useful. This idea underlies the development of the INDUCE programs [13,15,16]. 2. COMPARATIVE REVIEW OF SELECTED METHODS 2.1 Evaluation Criteria Ue evaluate the selected methods of induction in terms of several criteria considered especially important in view of the remarks in section 1. i) Adequacy of the representation language . The language used to represent input data and output generalizations determines to a large ex- tent the quality and usefulness of the output descriptions. Although it is difficult to assess the adequacy of a representation language out of the context of some specific problem, recent work in AI has shown that languages which treat all phenomena uniformly must sacrifice descriptive precision. For example, researchers who are attempting to build natural- language systems prefer the richer knowledge representations such as frames and semantic nets (with their tremendous variety of syntactic forms) to more uniform and less structured representations such as attribute-value lists and PLANNER-style databases. In our own work on inductive learning, we have chosen to use the representation language VL-. (see below) which has a wider variety of syntactic forms than our earlier language VI . Although languages with many syntactic forms do provide greater descriptive precision, they also make the induction process more complex. In order to control this complexity, a compromise must be sought between uniformity and richness of forms. In the evaluation of each method, a review of the operators and syntactic forms of each description language is provided. - 9 - ii) Pules of generalization Implemented * The generalization rules im- plemented in each algorithm are listed. iii) Computational efficiency . The exact analysis of the computational efficiency of these algorithms is very difficult due both to the inherent complexity of the algorithms and to the lack of precise formulations of the algorithms in available publications. However, it seems useful to have some data comparing the efficiency of these algorithms even if that data is approximate and based on hand-simulations. To get some indication of the efficiency we measure the total number of description generations or com- parisons required by each method to perform a test example (see Fig. 1). Ue also measure the ratio of the number of output conjunctive generaliza- tions to the total number of generalizations examined on this example. Since these numbers are derived from only one example, it is not appropri- ate to draw strong conclusions from them concerning the general performance of the algorithms. Our conclusions are based primarily on the general behavior of the algorithms. iv) Flexibility and extensibility . Mere conjunctive characteristic gen- eralizations are not particularly useful for conceptual data analysis be- cause of their limited format and their lack of formal mechanisms for han- dling errors in the input data. It is important in evaluating these algo- rithms to consider the ease with which each method could be extended to a) discover descriptions with forms other than conjunctive generaliza- tions (see section 1.3), b) include mechanisms which facilitate the detection of errors in the in- - 10 - put data, c) provide a general facility for incorporating domain-specific knowledge into the induction process as an exchangeable package (Ideally, the domain-specific knowledge should be Isolated from the general-purpose in- ductive process.) » and d) perform constructive induction. It is difficult to assess the flexibility and extensibility of the algo- rithms presented here. We base our evaluation on the general approaches of the methods and on extensions which have already been made to them. In the following sections, we describe each method by presenting the description language used, sketching the underlying algorithm, and evaluat- ing the method in terms of the above criteria. Fach method will be illus- trated using the test example shown in Fig. 1. r, □ A Figure 1 2.2 Data-driven Methods : Hayes -Pot h and Vere . Methods can be divided into bottom-up (data-driven) , top-down (model- driven) , and mixed methods. Bottom-up methods generalize the input events pairwise until the final conjunctive generalization is computed: - 11 - F1=C1 E2 E3 E4 C.2 is the set of conjunctive generalizations of El and E2. Gi is the set of conjunctive generalizations obtained by taking each element of Gl-1 and generalizing it with Ei . TT e consider here only the methods described by Hayes-Roth and Vere. Other bottom-up methods include the candidate elimination approach described by Mitchell [18] and the Uniclass method described by Stepp [20]. 2.2.1 Hayes-Roth- Program SPROUTER [5-?] Hayes-Poth uses the term maximal abstraction or interference match for max- imally specific conjunctive generalization. He uses parameterized struc- tural representations (PSRs) to represent both the input events and their generalizations. For example, consider the two events described in Fig. 2: o □ El D. o £2 The PSRs for these could be Figure 2 El: {{circle :a}{square:b }{ small :a> {smal 1 :b}{ontop :a, underrb}} E2: {{c1rcle:c}{square:d>{circle:e> - 12 - {small :c >{ large :d>{ small :e} {ontop:c, under :d} {lnsldere, outsiderd}) The expressions such as {small :a} are case frames made up of case labels (small, circle, etc.) and parameters (a, b, c, d). The PSR can be inter- preted as a conjunction of predicates of the form small (a) where the param- eters are existentially quantified variables which are assumed to be dis- tinct. The interference match attempts to find the longest one-to-one match of parameters and case frames (i.e., the longest common subexpression). This is accomplished in two steps. First the case relations in El and E2 are matched in all possible ways to obtain the set M. Two case relations match if all of their case labels match. Each element of M is a case relation and a list of parameter correspondences which permit that case relation to match in both events: M = {{circle:((a/c)(a/e))}{square:((b/d))> {small : ( (a/c) (b/c) (a/e) (b/e) ) } {ontop, under: ((a/c b/d))>> The second step involves selecting a subset of the parameter correspon- dences in M such that all parameters can be bound consistently. This is conducted by a breadth-first search of the space of possible bindings with pruning of unpromising nodes. The search can be visualized as a node- - 13 - building process. Here is one such (pruned) search: M Interference match {ontop, under) a/c b/d The nodes are numbered in order of generation. One at a time, a node is examined and joined with all other consistent nodes which have already been examined. The nodes 5, 8, and 9 are conjunctive generalizations. Node 9 binds a to c (to give 1) and b to d (to give 2) to produce the conjunction: {{circle : 1}{ square: 2 }{ small: 1} {ontop: 1, under:2}} The node-building process is guided by computing a utility value for each candidate node to be built. The nodes are pruned by setting an upper limit on the total number of possible nodes and pruning nodes of low utility when that limit is reached. Evaluation : i) Representational adequacy. The algorithm discovers the following conjunctive generalizations of the example in Fig. 1 : 1. {{ontop:!, under : 2>{medium: 1 >{clear: 1>> There ts a medium clear object ontop of - 14 - something. 2. {{ontop:l, under: 2}{medium: lKlarge: 2} {clear:2}> There is a medium object ontop of a large, clear object. 3. {{medium: 1>{ clear: 1}{ large: 3 }{ clear: 3} {shaded :2>> There is a medium sized clear object, a large sized clear object, and a shaded object. PSRs provide two symbolic forms: parameters and case labels. The case la- bels can express ordinary predicates and relations easily. Symmetric rela- tions may be expresed by using the same label twice as in {same!size:a , same! size :b}. The only operator is the conjunction. The language has no disjunction or internal disjunction. As a result, the fact that the top element in Fig. 1 is always either a square or a diamond cannot be discovered. ii) Rules of generalization. The method uses the dropping condition and turning constants to variables rules. iii) Computational efficiency. On our test example, the algorithm re- quires 22 comparisons and generates 2D candidate conjunctive generaliza- tions of which 6 are retained. This gives a figure of 6/20 or 30% for com- putational efficiency. Four separate interference matches are required since the first match of Fl and E2 produces three possible conjunctive gen- - 15 - eralizations. lv) Flexibility and extensibility. Payes-Poth has indicated (personal communication) that this method has been extended to produce disjunctive generalizations and to detect errors in data. Hayes-Roth has applied this method to various problems in the design of the speech understanding system Hearsay II. However, no facility has been developed for incorporating domain-specific knowledge into the generalization process. Also, no facility for constructive induction has been incorporated although Hayes-Roth has developed a technique for converting a PSR to a lower-level finer-grained uniform PSR. This transformation permits the program to develop descriptions which involve a many-to-one binding of parameters. 2.2.2 Vere: Program Thoth [21-24] Vere uses the term maximal conjunctive generalization or maximal unifying generalization to denote the maximally specific conjunctive generalization. Each event is represented as a conjunction of literals. A literal is a parenthesized list of constants called terms. For example, the objects in Fig. 1 would be described: P.l: (circle a)(square b)(small a)(small b) (ontop a b) F 2: (circle c) (square d) (circle e) (small c) (large d) (small e) (ontop c d)(inside e d) Although these resemble Hayes-Roth's PSRs , they are quite different. There are no distinguished symbols. All terms are treated uniformly. The algorithm operates In four steps. First, the literals in each of the - 16 - two events to be generalized are matched in all possible ways to generate the set of matching pairs MP. Two literals match if they contain the same number of constants and they share a common term in the same position. For the example of Fig. 2, MP» { ((circle a), (circle c)), ((circle a), (circle e)), ((square b), (square d)), ((small a), (small c)), ((small a), (small e) ) , ((small b), (small c)), ((small b), (small e)), ((ontop a b),(ontop c d)) } The second step involves selecting all possible subsets of MP such that no single literal of one event is paired with more than one literal in another event. Each of these subsets eventually forms a new generalization of the original events. In the third step, each subset of matching pairs selected in step 2 is ex- tended by adding to the subset additional pairs of literals which did not previously match. A new pair p is added to a subset S of MP if each literal in p is related to some other pair q in S by a common constant in a common position. For example, if S contained the pair ((square b), (square d)) then we could add to S the pair ((ontop a b), (inside e d)) because the third element of (ontop a b) is the second element of (square b) and the third element of (inside e d) is the second element of (square d) (Vere calls this a 3-2 relationship) . We continue adding new pairs until no more can be added . In step 4 the resulting set of pairs is converted into a new conjunction of literals by merging each pair to form a single literal. Constants which do - 17 - not match are turned into new constants which may be viewed as variables* For example, ((circle a), (circle c)) would be converted to (circle 1). Evaluation: i) Representational adequacy. When applied to the test example (Fig. 1) this algorithm produces many generalizations. A few of the significant ones are listed here: 1. (ontop 1 2) (medium 1) (large 2) (clear 2) (clear 3) (shaded 4) (5 4) i There is a medium object on top of a large clear object. Another ob- ject is clear. There is a shaded object. (Mote also the vacuous re- lationship 5 derived from unifying circle and triangle) . 2. (ontop 1 2) (clear 1) (medium 1)(9 1) (5 3 4) (shaded 3) (7 3) (6 3) (clear 4) (large 4) (8 4) There is a medium, clear object on top of some other object and there are two objects related in some way (5) such that one is shaded and the other is large and clear. (Note the vacuous relationships 6, 7, 8, and 9) . 3. (ontop 1 2) (medium 1) (clear 2) (large 2) (5 2) (shaded 3) (7 3) (clear 4)(6 4) There is a medium object on top of a large clear object. There is a shaded object and there is a clear object. (Note the vacuous rela- tionships 5, 6, and 7). The representation is very general. By convention the first symbol of a literal can bp interpreted as a predicate symbol. The algorithm, however, treats all constants uniformly. This creates difficulties. For instance - 18 - the algorithm generates vacuous literals in certain situations. Literals can be formed by pairing (red x) with (big y) to produce meaningless gen- eralizations. One advantage of this relaxation of semantic constraints is that the program can discover conjunctive generalizations involving a many-to-one binding of variables. The language contains only a conjunction operator. No disjunction or internal disjunction is included. ii) Rules of generalization. The algorithm implements the dropping condition rule and the turning constants to variables rule. iii) Computational efficiency. From the published articles [21-24] it is not clear how to perform step 2. The space of possibilities is very large and an exhaustive search could not possibly give the computation times which Vere has published. It would be interesting to find out what heuristics are being used to guide the search. iv) Flexibility and extensibility. Vere has published algorit^T- which discover descriptions with disjunctions [23] and exceptions [24]. He has also developed techniques to generalize relational production rule? [22,23]. The method has been demonstrated using the traditional AI toy problems of IQ analogy tests and blocks-world sequences. A facility for us- ing background information to assist the induction process has also been developed. It uses a spreading activation technique to extract relevant relations from a knowledge base and add them to the input examples prior to generalizing them. Since the method has been extended to discover disjunc- tions aud exceptions, it would be expected that the method could also - 19 - operate in noisy environments. 2. 3 Model-driven Methods : Buchanan et » al . , and Michalski . Model-driven methods search a set of possible generalizations in an attempt to find a few "best" hypotheses which satisfy certain requirements. The two methods discussed here search for a small number of conjunctions which together cover all of the input events. The search proceeds by choosing as the initial working hypothesis some starting point in the partially ordered set of all possible descriptions. If the working hypotheses satisfy cer- tain termination criteria, then the search halts. Otherwise, the current hypotheses are modified by slightly generalizing or specializing them. These new hypotheses are then checked to see if they satisfy the termina- tion criteria. The process of modifying and checking continues until the criteria are met. Top-down techniques typically have better noise immunity and can easily be extended to discover disjunctions. The principal disad- vantage of these techniques is that the working hypotheses must repeatedly be checked to determine whether they subsume all of the input events. 2.3.1 Buchanan, et. al.: Program Meta-DENDRAL [1-3,19] The algorithm which we describe here is taken from the RULEGEN program (part of the Meta-DE^DRAL system) . Meta-DENPRAL was designed to discover cleavage rules to explain mass spectrometry data. The descriptive language is based on the ball-and-stlck model of chemical molecules. Each input event is a bond environment which describes some portion of a molecule. The environment is represented by a graph of the atoms in the molecule with four descriptors attached to each atom and forms the left hand side of a - 20 - cleavage rule. The right hand side of the rule predicts a cleavage based on the existence In a molecule of the left-hand side of the rule (breakbond (**) indicates that the ** bond is predicted to be broken). A typical cleavage rule (with atoms w, x, y, and z) is: LEFT-HAND SIDE (BOND ENVIRONMENT) : Kolecule graph: w ** x — y - - z — Atom descriptors: atom type nhs nbrs dots w carbon 3 1 X carbon 2 2 y nitrogen 1 2 z carbon 2 2 RIGHT-HAND SIDE (CLEAVAGE PREDICTION): => Breakbond (**) The algorithm chooses as its starting point the most general bond enviro- ment ( x ** y ) with no properties specified for either atom. During the search, this description is grown by successively specializing a property of one of the atoms in the graph or by adding a new atom to the graph. After each specialization, the new graph is checked to see if It is "better" than the parent graph from which is was derived. A daughter graph is better than its parent if it still covers at least half of the input rules (it's general enough) and still focusses on only one cleavage proces (it's specific enough). The cleavage rules built by this algorithm ".re further improved by the program RULEMOD. Evaluation: i) Representational adequacy. The representation was adequate for the specific task of developing cleavage rules. It was not intended to be a gene^ 1 representation for objects outside of the chemical world. The - 21 - descriptions can be viewed as conjunctions. Individual rules developed by tbe program can be considered to be linked by disjunction. ii) Pules of generalization. The dropping condition and turning con- stants to variables rules are used "in reverse" during the specialization process. RULEGEN does not seem to have the ability to handle an internal disjunction but RULFMOD apparently does. For example, it can indicate that the type of atom is "anything except hydrogen". In similar work on nuclear magnetic resonance (NMP), Mitchell presents an example in which the value of nhs is listed as "greater than or equal to one" (which indicates an internal disjunction). iii) Computational efficiency. Because this is a problem-specific al- gorithm, we cannot supply comparison figures here for how this algorithm would work on our test example. The current program is considered to be relatively inefficient [2]. iv) Flexibility and extensibility. Meta-DENDRAL has been extended to handle NMR spectra. The program works well in an errorful environment. It uses domain-specific knowledge extensively. However, there is no strict separation between a general-purpose induction component and a special- purpose knowledge component. It is not clear whether the methods developed for Meta-PENDFAL could be easily applied to any non-chemical domain. The program does not perform constructive induction In any general way. Howev- er, the INTSTIM program does perform sophisticated transformations on the Input spectra In order to develop the bond-environment descriptions. 2.3.2 Michalskl and Pietterich: Program INDUCE 1.2 - 22 - The algorithm described here is one of three algorithms designed by Michal- ski and his collaborators. The others are a data-driven method described by Stepp [20] and a mixed method described by Larson and Michalski [12,13]. The language used to describe the input events is VL_., an extension to first-order predicate logic (FOPL) [16]. Each event is represented as a conjunction of selectors. A selector typically contains a function or predicate descriptor (with variables as arguments) and a list of values that the descriptor may assume. The selector [size(xl)"small, medium] as- serts that the size of xl may take the values small or medium. The events in Fig. 2 are represented as: El: [size(xl)=small] [size(x2)"small] [shape(xl) -circle] [ shape (x2) "square] [ontop(xl ,x2) ] E2: [size(xl)"small] [size(x2)=large] [size(x3)=small] [shape (xl) ^circle] [shape (x2) =square] [ shape (x3) "circle] [ontop(xl,x2)J [inside(x3,x2)] In this method, descriptors are divided into two classes: attribute descriptors and structure-specifying descriptors. Attribute descriptors describe attributes such as size or shape or distance which are applicable to all variables (representing, e.g., object parts). Structure-specifying descriptors include all other descriptors. They typically represent rela- tionships among variables such as ontop or inside. Each input conjunction is broken into two conjuncts — one built of selectors containing only attri- bute descriptors (the attribute conjunct) and one built of selectors con- taining only structure-specifying descriptors (the structure conjunct) . The algorithm is based on the observation that the structure-specifying descriptors are responsible for the computational complexity of generaliz- - 23 - ing structural descriptions. If we could determine conjunctions of structure-specifying selectors which were relevant for describing a partic- ular class of objects, then the generalization of the attribute conjuncts could be handled quickly by an appropriate covering algorithm. The algo- rithm seeks to determine such a set of structure conjuncts which appear likely to be part of a maximally specific conjunctive generalization of all of the input events. It does this by finding conjunctions which are maxi- mally specific generalizations of the input structure conjuncts considered alone. Such conjunctive generalizations of the structure conjuncts must be contained in some maximally specific generalizations of the entire set of Input events. However, there may be maximally specific conjunctive gen- eralizations of the input events which contain few if any structure- specifying selectors. This algorithm also finds these generalizations by considering structure conjuncts which are less than maximally specific. The algorithm operates In two phases. The first phase is the structure- determining phase. A random sample of the input structure conjuncts is taken. This sample becomes the initial set of generalizations G_. In each step, G is first pruned to a fixed size by removing unpromising generali- zations. Then G is checked to see if any of its generalizations covers all of the structure conjuncts. If any do, they are removed from G and placed in the set C of candidate conjunctive generalizations. Lastly, G is generalized to form G by taking each element of G and generalizing it In all possible ways by dropping single selectors. When the set of can- didates G reaches a prespecified size, the search stops. The second phase Is the attribute-determining phase. In this phase, the - 24 - problem Is converted to a multiple-valued logic covering problem using the VL. propositional calculus [14,15]. Each candidate cover A in C is matched against all input events and the relevant variables are identified. For each natch, the appropriate attribute conjuncts are extracted and used to form a VL. event. For example, if A * [ontop(pl,p2)] and Fl = [ontop(pl,p2)] [ontop(p2,p3)] [size(pl)«=l] [size( P 2)«3] [size(p3)-5] [color (pi) =red] [color(p2)=green] [color(p3)*blue] then we get two VL. events: Vl= (1, 3, red, green) and V2= (3, 5, green, blue). These are vectors of attributes which correspond here to the descriptors: (size(pl), size(p2), color(pl), color(p2)) for pi and p2 in A. All input events are converted into VL. events in this manner. In general, more than one VL. event is created from each input event. The set of VL. events can be covered using a covering algorithm. A cover could be ob- tained by forming the union of the values taken on by each VL attribute. Such an approach usually leads to overgeneralization since only one VL event derived from each input event need be covered. We use a beam-search technique to select a subset of the VL. events to be covered. This two-phase algorithm provides two computational advantages. First, the time required to compare expressions in the structure-determining phase is - 25 - reduced because the structure conjuncts are usually much smaller than the full Input conjuncts. Second, the manipulation of VL. formulas Is very easy since they may be represented as bit strings and manipulated using fast bit-parallel operations. The chief disadvantage of this algorithm is that it is difficult to decide when to terminate the structure-determining phase. Fvaluation: i) Representational adequacy. The algorithm discovers, among others, the following generalizations of the events in Fig. 1: 1. [ontop(pl,p2) ] [size(pl)=medium] [shape(p l)«=circle, square, rectangle] [size(p2)=large] f shape (p 2 )=box, rectangle, ellipse] [texture (p 2 )=c lea r] There is a medium-sized circle, rectangle or square on top of a large, clear box, rectangle, or ellipse. 2. [ontop(pl,p2) ] [size(pl)=medium] [shape(pl)=polygon] [texture (pi )=clear] [size(p2)=medium,large] [ shape (p 2 )=rec tangle, circle] There is a clear, medium-sized polygon on top of a medium or large circle or rectangle. 3. [ontop(pl,p2)] [size(pl)«=medium] [shape (p 1 )*polygon] [size(p 2) -medium, large] [shape (p 2 )=rec tangle ,ellipse .circle] There is a medium-sized polygon on top of a large or medium rectan- gle, ellipse or circle. U. [size(p l)=smal 1, medium] [shape(p 1 )-circle .rectangle] [tpxture(pl ) -shaded] - 26 - There is a shaded object which is either medium or small In size and has a circular or rectangular shape. This algorithm implements the conjunction, disjunction and internal dis- junction operators. It provides a fairly non-uniform set of representa- tional facilities. Descriptors, variables, and values are all dis- tinguished. Descriptors are further analyzed into structure-specifying descriptors and attribute descriptors. The current method provides for descriptors which have unordered, linearly ordered, and tree ordered value sets. This variety of possible representations permits a better "fit" between the description language and any specific problem. ii) Rules of generalization. The algorithm uses all rules mentioned in section 1.4 and also a few constructive induction rules (see below). All constants are coded as variables. The effect of the turning-constants to variables rule is achieved as a special case of the generalization by internal disjunction rule. iii) Computational efficiency. The algorithm requires 28 comparisons and builds 13 rules during the search to develop the descriptions listed above. Four rules are retained so this gives an efficiency ratio of 4/1 3 or 30*. iv) Flexibility and extensibility. The algorithm can easily discover disjunctions by altering the termination criteria for the structure- determining phase to accept structure conjuncts which do not necessarily cover all of the input events. The same general two-phase approach can also be applied to problems of determining discriminant generalizations. - 27 - Larson and Michalski have done work on determining discriminant classifica- tion rules [12,13,14] . The algorithm has good noise immunity. Noise events can be discovered be- cause the algorithm tends to place them in separate terms of a disjunction. Domain-specific knowledge can be incorporated into the program by defining the domains of descriptors, specifying the structures of these domains, specifying certain simple production rules, and by providing constructive induction rules. These forms of knowledge representation are not always convenient, however. Further work should provide other facilities for knowledge representation. A few simple constructive induction rules have been incorporated into the current implementation as a preprocessor. Other constructive induction rules can be specified by the user. Using the built-in constructive induc- tion rules, the program produces the following conjunctive generalization of the input events in Fig. 1: [ft p's with texture clear=2] [top-most(pl) ] [ontop(pl ,p2) ] [size(p l)=medium] [shape (pl)=polygon] [texture (pi )=clear] [size(p2)=medlum,large] [shape (p 2) =clrcle .rectangle] There are exactly two clear objects In each event. The top most object is a medium sized, clear polygon and it is on top of a large or medium sized circle or rectangle. hope to expand this constructive induction facility in the future. - 28 - 2.4 Summary The comparison of various methods is summarized in Fig. 3. The table shows the distinct advantages and disadvantages of top-down methods as opposed to bottom-up methods. Bottom-up methods tend to be faster but noise immunity and flexibility suffer as a consequence. Top-down methods have good noise immunity and are easily modified to discover disjunctive and other forms of generalization. They do tend to be computationally more expensive. By separating the structure-determining phase from the attribute-determining phase in our method, a considerable speed-up has been achieved. 3.0 CONCLUSION One of the problems of current research on induction is that each research group is using a different formal language and terminology. This makes the exchange of information difficult. This paper was intended to help readers to get a better understanding of the state of the art in this area. Some important problems to be addressed in future research include: i) the development of adequate formal languages and knowledge represen- tations for hypothesis formulation and modification: ii) extension of the scopes of operators and forms which an inductive program can efficiently use during hypothesis formulation; iii) the development of general mechanisms of induction which can be guided by problem-specific packets of knowledge; and iv) incorporation in the program of extensive facilities for construc- tive induction and multi-level schemes of description. In particular, an - 29 - ■*ethod: Criterion Fayes-Poth Vere Buchanan et.al. Michalak! Intended application: general general discovering mass spectro- metry rules genera] Language : syntactic concepts: operators Parameterized Structural Representation case frames parameters case labels A Ouantifier- free FOPL literals constants Chemical model Variable-valued logic system VL21 molecule graph selectors A attributes descriptors constants in dummy variables in value sets constants in value sets A,V, A,V, internal V internal V yes yes yes yes yes yes no yes no yes Ceneralization Pules: dropping condition? constants to variables? generalizing by internal v? climbing tree? closing intervals? yes yes no no no yes yes no no no Ff f iciency : comparisons : conjunctions generated during search: ratio output to total* 22 20 6/20=30% complete algorithm not known not applicable 28 not applicable not applicable 13 4/13=30% Fxtensibility: applications disjunctive forms? noise immunity domain knowledge? constructive induction? speech analysis no low no none yes probably good yes no mass spectro- metry, NMP yes excellent yes, built-in to program no soybean disease diagnosis yes very good yes limited facility Figure 3. - 30 - inductive program should be able to assign names to various subdescriptions and use these names in the formulation of hypotheses (i.e. generate hierarchical forms). Finally, an important principle which should guide future research is what we call the principle of comprehensibility . This principle states that the descriptions which an AI program uses and the concepts which it generates should be easily comprehensible by people. Tn the context of work on in- duction, the comprehensibility principle requires that the descriptions be short and use operators which can be easily interpreted in natural language. Furthermore, systems should be designed to provide flexible in- teractive facilities. This approach has been adopted in our work because we expect that the most significant applications of AT inductive programs will be as interactive tools for conceptual data analysis and computer- aided acquisition of rules for knowledge-based expert systems. 4. A rKNOWLFPOFMFNTS The authors gratefully acknowledge the partial support of NSF under grant MCS-76-22940 and MCS-79-06614. 5. PFFFPFT-TCFS [1] Buchanan, B. C, F. A. ^eigenbaum, J. Lederberg, "/ Feuristic Program- ming Study of Theory Formation in Science," in Froc . IJCAI-2, 197], pp. 40-48. [2] "uchanan, B.C., D. H. Smith, W. C. White, P. J. Critter, F. A. Fei^en- baum, J. Lederberg, C. Tjerassi, J. Am. Chem . Foe 98 (1976) p. 6168. - 31 -r [3] Buchanan, **. G. , F. A. Feigenbaum, "Pendral and Meta-Pendral, Their Applications Dimension," Artlf. Tntell . 11 (1Q7S) pp. 5-24. [A] ^ietterich, T., "User's Cuide for ITTDUCE 1 . 1 , " internal report, r» Pp t. of Corp. Pel., Univ. of Tllnois, Urhana, 197B. [5] T, ayes-Poth, v ., "Collected p apers on the Learning and Recognition of Structured Patterns", n ept. of Conp . Sci., Carnegie-Uel Ion Univ., Jan. 1975. [{■] Dayes-^oth , F. , "Patterns of Induction and Associated Knowledge Ac- quisition Algorithms," Pept . of Conp. Sci., Carnegie-Mellon Univ., May 1976. [7] "ayes-Poth , F. , J. f 'cPermott, "knowledge Acquisition from Structure Descriptions", Tn Froc TJCAT-5, 1977, pp. 356-362. [P] "ayes-^oth, F., J. M cDermott, "An Interference Matching Technique for Inducing Abstractions", CACM 21:5, 197°, pp. 4D1-41P. [°] ' 4 unt , F.R., Fxperiments in Induction , Academic Press, 1966. [If" 1 ] T'napnan, John, "A Critical Peview of T, inston's Learning Structural Pescriptions fron Fxamples," AISF Quarterl y Issue 31, September 1973, pp. 31Q-320. [11] T,enat, D. f "/?'• An artificial intelligence approach to discovery in mathematics as heuristic search," Conp. Sci. Pept., Pept. STAU-CS-76-570, Stanford Univ., July l n 76- - 32 - [12] Larson, J., and R.S. Michalski, "Inductive Inference of VL Decision Rules," SIGART Newsletter , June 1977, pp. 38-44. [13] Larson, J., 'Inductive Inference in the Variable Valued Predicate Log- ic System VL21 : Methodology and Computer Implementation' , Rept . Mo. P69, Pept. of Comp. Sci., Univ. of 111., Urbana , May 1977. [14] Michalski, R. S., "Variable-valued logic and Its application to pat- tern recognition and machine learning," In Comp . Sci . and Multiple-Valued Logic , ed. P. C. Rine, Morth-Folland, 1977, pp. 506-534. [15] Michalski, R.S., "Toward Computer-aided Induction: a brief review of Currently Implemented AOVAL programs," In Proc. IJCAT-5, 1977. [1*] Michalski, R.S. "Pattern Recognition as Knowledge-Guided Induction," "ept. °27, Pept. of Comp. Sci., Univ. of 111. Urbana, 1°7P (an updated version to appear in IEEE Trans, on Pattern Analysis and Machine Learning, loro) . [17] Michle, P., "New Face of AI," Experimental Programming Pepts.: No. 33, UIPU, Univ. of Edinburgh, 197 7. [IP] Mitchell, T. M., "Version Spaces: A Candidate Elimination Approach to Rule Learning," In Proc. IJCAI-5, MIT, 1977. [19] Schwenzer, G. M. , T. M. Mitchell, "Computer-assisted Structure Eluci- dation TTsing Automatiically Acquired Carbon-13 IIKR Pules," in ACS Symposium Series, Mo. 54, 'Computer-assisted Structure Elucidation,' D.H. Smith (ed), 1977. - 33 - [20] Stepp, R. "Learning without Negative Examples via Variable-Valued Logic Characterizations: The Uniclass Inductive Program AQ7UNI," Rept. No. 982, Dept. of Comp. Sci. , Univ. of 111., Urbana, 1979. [21] Vere, S.A., "Induction of Concepts in the Predicate Calculus," In Proc. IJCAT-4, 1975. [22] Vere, S. A., "Induction of Relational Productions in the Presence of background Inf ornation ," Tn Proc. IJCAI-5, 1977. [23] Vere, S. A., "Inductive Learning of Relational Productions", in Pattern-Pirected Inference System s, P. A. Waterman and F. Hayes-Poth (eds) , Academic Press, 1P7P. [24] Vere, S. A., "Multilevel Counterf actuals for Ceneralizations of Rela- tional Concepts and Productions," Pept. of Inf. Fng'g, Univ. of 111., Chi- cago Circle, 197P. - 34 - BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-80-1007 3. Recipient's Accession No. 5- Report Date February 1980 4. Title and Subtitle LEARNING AND GENERALIZATION OF STRUCTURED DESCRIPTIONS: Evaluation Criteria and Comparative Review of Selected Methods 7. Author(s) Thomas G. Dietterich and Ryszard S. Michalski 8* Performing Organization Rept. No. 9. Performing Organization Name and Address Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract/Gram No. NSF MCS-76-22940 NSF MCS-79-06614 12. Sponsoring Organization Name and Address National Science Foundation Washington, DC 13. Type of Report & Period Covered 14. 15. Supplementary Notes 16. Abstracts Sone recent work in the area of learning structural descriptions from exam- ples is reviewed in light of the need in many diverse disciplines for pro- grams which can perform conceptual data analysis. Such programs describe complex data in terms of logical, functional, and causal relationships which cannot be discovered using traditional data analysis techniques. Various important aspects of the problem of learning structural descrip- tions are examined and criteria for evaluating current work is presented. Methods published by Buchanan, et .al . [1-3, 20] , Hayes-Roth [6-9], and Vere [22-25], are analyzed according to these criteria and compared to a method developed by the authors. Finally some goals are suggested for future research. 17. Key Words and Document Analysis. 17a. Descriptors Machine Learning, Inductive Inference, Knowledge Acquisition, Structural Learning Computer Inference 17b. Identifiers/Open-Ended Terms 17c. COSATI Field/Group 18. Availability Statement 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 22. Price FORM NTIS-3S ( 10-70) USCOMM-DC 40329-P71