HHHMWHWmII mm HIS WssBi HI HI BHHffl HBH1 H IB WW EH n IK HKBH Hi 91 ■nBS H 33&H SUB ■ I IB Hi ;^ft &-'.■ ■ ■ H £ A ,-*r HWBBW STlIHI mini ■ HH1 I LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510. 84- IJ*6r *o.6&7-£72 cop.2 uiucDcs-R-7^-671 August, 197k f)i^L4< 1^ l^v I OCT 3 1 197/ LEARNING BY INDUCTIVE INFERENCE R. S. Michalski uiucdcs-r-7^-671 LEARNING BY INDUCTrVE INFERENCE "by R. S. Michalski August, 197^ Department of Computer Science University of Illinois at Urb ana-Champaign Urbana, Illinois 6l801 Invited paper for the NATO Advanced Study Institute Seminar on Computer Oriented Learning Processes, Aug. 26 - Sept. 7, 197*+> Bonas, France. (Preprint for limited distribution. ) Digitized by the Internet Archive in 2013 http://archive.org/details/learningbyinduct671mich LEARNING BY INDUCTIVE INFERENCE R. S. Michalski University of Illinois Urbana, Illinois 6l801 SUMMARY. The paper is addressed to learning processes which employ inductive inference. A system of variable -valued logic, called VIg, is briefly described and its application to imple- menting inductive learning processes is discussed. The VL2 can be characterized as a 'multi-valued first order predicate logic 1 . An example of learning by a computer program the difference between two classes of objects is given. INTRODUCTION Learning processes can be generally viewed as the processes of determining and representing relationships which exist among objects. These relationships are determined and represented within the system which learns ('STUDENT') using a source of information about the objects ('TEACHER'). It has been observed (e.g., Bongardl), that the smaller the degree of STUDENT-oriented organization of information which the TEACHER provides, the greater must be the complexity of the STUDENT. Consequently, the learning processes can be classified according to the degree of organization of information provided by the TEACHER. Thus, we can distinguish, e.g., learning 'by being born' (innate capabil- ities) or being designed (the greatest organization on the part of the TEACHER), learning by being programmed, learning from examples, from observation (Vithout teacher'), learning by 'inspiration'. In this paper we are concerned with problems which belong to the area of 'learning from examples*. Like physical processes which are governed by a law of minimum energy, it seems (still only intuitively) that information processes, thus also learning processes, may be governed by a corresponding law of 'minimum-complexity' (or 'maximum- simplicity' ). In other words, information processes seem to have an overall tendency to achieve given information processing goals by the simplest means (which, in special cases, just means the minimum number of operations). An evidence of the existence of such a tendency in the area of human literary expression is the Zipf's law. 2 It seems that all human infor- mation processing activities, in particular scientific activities, are oriented toward determining adequate and, at the same time, simple descriptions or explanations of surrounding environment and phenomena. The ability to create the simplest descriptions, which use only the 'most significant' concepts, and disregard the 'irrelevant details', is highly regarded and considered an evidence of intelligence. But how can we formally define such concepts as the 'simplest description'. How can we create machines which have the ability of determining such descriptions? As Banerji^ pertinently observed, a simple concept for one person may not be simple for another. His explanation of it is that 'there is something in the human mind which, given constant exposure to a concept, however complicated, makes it simple'. This explanation can be deepened by saying that a seemingly complex concept becomes simple if it is well understood, which, in turn, means that its relationship to the well-known concepts has been clearly established. Therefore, in order to be able to define a measure of simplicity of descriptions, two requirements have to be first satisfied: (1) A language in which descriptions are expressed has to be assumed. (2) A measure of 'semantic equivalence' of descriptions has to be established. This condition is necessary because for determining the 'simplest description' of whatever we describe, we want to compare only descriptions which convey the same information (i.e., which are semantically equivalent ). Having satisfied (l) and (2), a measure of simplicity of descriptions can be easily formalized. It can be, e.g., a monotonically decreasing function of the length of a description (measured, e.g., by the number of certain assumed constructs of the language which occur in the description). If there is given a 'simplicity function' over the individual constructs, then one can consider a weighted sum of constructs. If only a preference order of constructs is assumed, then one could use the lexico- graphic functional defined by Michalski. In this paper we present some recent results from our work on the theory and computer implementations of systems which can learn the 'simplest descriptions' by executing an inductive inference process ('inductive learning'). LANGUAGE FOR EXPRESSING DESCRIPTIONS: SYSTEM VL2 The formal system which we are currently developing as a tool for expressing descriptions and implementing inductive learning is a variable-valued logic system VL?. This system is an extension of the system VL^ described by Michalski. > 5,6 The VLg system gives a sound formal basis for developing an 'algebra of descriptions' which would enable one, for example, to build descriptions, to simplify them, generalize to various degree, to compare descriptions of individual objects or classes of objects, to infer a description of a class of objects from examples of objects of this class, etc. The full definition of the system VL2 is not yet available. For the purpose of this paper we will briefly and informally describe some* of the concepts of the system, most relevant to our subject. To do it simply, we will relate our description of the system to the presently widely used first order predicate logic (FOPL) : 1. In FOPL, the atomic formulas (k-ary predicate symbols followed by k occurrences of variables, function forms and/or constants) are assumed to be binary valued (true or false). In the VLg, these formulas (called atomic forms ) are treated as functions which, as well as their arguments, range over independent domains. These domains are determined as most appropriate for the interpretation of the atomic forms and their arguments, or the problem at hand. 2. The atomic forms occur in a wff of VL2 (a VL2 formula ) as parts of a broader concept of a selector, and are not, generally, the VLg formulas when standing alone (except for the case when a VLg formula reduces to a FOPL formula). 3. VL2 formulas range over an output domain, denoted D, which is a linearly ordered set having the smallest and the largest element . * In the full definition of VLg there are more operations than those described here and the concept of selector has a broader meaning . h. The sclent or in defined as a selector st atemen t, SS, enclosed in brackets: [SS] (1) The selector statement is either a condi t ional statement : L#R (2) or a quantified statement Q(L#R) • (3) where L - called the left part of the conditional statement or the refe ree, is either a VL2 formula (see point 5) or a form which can be described as a quantifier -free FOPL formula over atomic forms. It will be assumed for the purpose of this paper that this FOPL formula is in a disjunctive normal form, and that or is denoted by ', x , and by '.' and negation by a bar over the predicate symbol. For example, a FOPL formula P 1 (x,f(y))A^P 2 (y)VP 5 (x,y,c) (k) where P 1 (x,f(y)),p 2 (y),p_(x,y,c) — atomic forms x,y — variables, f(y) - a function of y c — a constant is written as P 1 (x,f(y))'p" 2 (y),p 5 (x,y,c) (5) # denotes '=' or '^' R - called the r~j ght part of the conditional statement or reference, is a subset of the union of the domains of atomic forms in L, or a VIo formula. Q - a sequence of existential, 3xj_, and/or universal, V x i> quantifier forms, where Xj_ are variables in atomic forms of L. Examples of a selector: [p(x,y) --- 3] (6) [p 1 (x,a).p 2 (y,z) = 2,4] (7) [3x,vy(p 1 (x,y,b)vp 2 (y,c) = 0,2)] (8) The selector in which SS is a conditional statement is called a conditional s elector (e.g., (6) and (7)), else it is called a quantified selector (e.g., (8)). A conditional selector [L # R] in which the referee L is a single atomic form Pj_ and the reference R is a subset of its domain, is called a simple selector . A simple selector [P^ = R] ([Pj_ ^ R] ) is said to be satisfied, iff the value of the atomic form P^ is (is not) an element of R. If P, P]_ and P 2 are atomic forms then: [P = R] is satisfied, iff [P ^ R] is satisfied [P ^.R] is satisfied, iff [P = R] is satisfied [P «P # R] is satisfied iff [P-j_ # R] and [P 2 # R] are satisfied [P-pPp # R] is satisfied iff [P x # R] or [P 2 # R] is satisfied [(3x) (P # R)] is satisfied iff, for given values of all free variables in P (i.e., variables other than x), there exists a value of x which satisfies the selector [P # R] [(yx) (P # R)] is satisfied, iff, for given values of free variables, the selector [P # R] is satisfied for all values of x. 5. A Vl£ formula is defined by the following rules: (i) an element of the output domain D or a selector standing alone is a VL2 formula, (ii) if V, V]_ and Vg are VI^ formulas then so are: i(V) called the inverse of V V-i A V p (written also V]V 2 ) called the conjunction of V;|_ and V 2 ' V, V V ? called the disjunction of V-j_ and V 2 . A Vlrj formula in the form of a disjunction of terms, where term is a conjunction of selectors and an element of D, is called a disjunctive simple VI^ formula and denoted as DVIg. A VLg formula which includes only conditional selectors is called a conditional or quanti fi or -free formula. In what follows we will discuss only conditional VL^ formulas. 6. Each VLo formula V is assigned a value v(V)eD depending on the values of atomic forms in it: (i) The value of an clement of D standing alone is this element itself. (ii) The value of a selector is the largest element of D, if the selector is satisfied, otherwise the smallest element of D. (iii) If the value V is the k-th smallest element of D, then the value of the inverse -i(v) is the k-th largest element of D. V2.V2 is assigned the smaller of the values of V]_ and V 2 V-^V VV) is assigned the larger -of the values of V]_ and V>>« For illustration, below is an example of a VL2 formula and its interpretation: ^[p. 1 (x 1 ,x 2 ).p 2 (x 2 ,x^)^medium][p^=true] \j 3Cp*= unknown] V l[p 1+ (x 2 ,x J+ )=yellow J> red] (9) Suppose that the domains of atomic forms p-j_(x-j_, x 2 ), P2(x2,x*), P3> Plj. (xq) x^ ) ar e, respectively, D-jj=D 2 = { small, medium, large] , ' D^={ unknown, false, true] and Dl^={white, yellow, blue, red, black] . And that the output domain of the formula (9) is D={0, 1, 2, 3,h), ordered as indicated by numbers. The formula (9) is assigned value ( has value) h, iff atomic forms Pi(x]_, X2) and P2(x2,xi|.), for given values of X]_, xg and xx take value not equal 'medium', and px takes value 'true'. If the above condition is not satisfied, and px takes value 'unknown', then (9) has value 3* If both of the above conditions do not hold and pl4.(x2,xi(.), for given values X2,xl+, takes value 'yellow' or 'red', then (9) has value 1. If none of the above conditions hold, (9) has value 0. BASIC CONCEPTS UNDERLYING INDUCTIVE INFERENCE BY MEANS OF VLg The subject of inductive inference by means of the VL2 system is very broad. For the limitation of space, we will only delineate some of its major concepts. Suppose that the domains of all atomic forms in a VL2 formula are D]_, D2, . . «,D n . The set of all possible sequences of values of atomic forms, that is set D^xDgX . . . x D n , is called an event space of the formula, and its elements are called event s ♦ The event space of a formula V is denoted by E(v). If the output domain of V is set D, then V expresses a function f: E(V) » D (10) The atomic forms in a VLg formula denote functions of the similar type, namely an atomic form Pi(x]_, x>>) denotes a function p.: D XD -D (11) 1 1 lg 1 where Dj_ is the domain of -p^ix^y^) and Vi and Dj domains of xi and X2, respectively. The atomic forms, however, do not express the functions (ll), they only denote their names and arguments. For further considerations we will make a simplifying assumption, that these functions are fixed and can be computed for any given values of their input variables. Let V"i and V2 be two VL2 formulas having comparable sets of atomic forms* (i.e., one set includes or is equal to another set). And let E be a subset of the event space E, specified by domains of the larger of the two sets of atomic forms. Formulas V]_ and V2 are called semantic ally E - equivalent, which we write V-l = V 2 (12) iff for every eeE v(V x ) = v(V 2 ) (13) If E = E, then V-j_ and V2 are called semantically equivalent and we write V]_ = v>>. A rule which transforms one formula into another, semantically equivalent formula, is called an equivalence - preserving transformation rule . Below are given examples of such rules (read '=' as: 'the formula on the left side may be replaced by the formula on the right side')* Assume that V is an arbitrary VLp formula; Bj_, P]_, P2 are atomic forms; R]_, R2 £ Dj_ (domain of Pj_), and R c Di = D2 (domains of P]_ and P2). The atomic formulas are here considered equal if they represent functions which differ only in that some of their arguments are substituted by a value from the domains of the arguments. V[P. = R 1 ]VV[P i = R 2 ] s V[P. = R x Uiy {Ik) v[r. / R 1 ]vv[p. / R 2 ] - v[p. f. \^\] (15) If R^U Bg = D ± and P^fll^ =■. (empty set) then (ik) and (15) reduce to (16) and (17): V[P i = R i^V[P. = R 2 ] = V (16) V[P. / R ] _]\ <[t\ 4 R 2 ] =Y (17) V[P ] _ - R]W[P 2 = R] = V[P 1 ,P 2 = R] (18) V^ = R][P 2 = R] = V[\'? 2 = R] (19) V[P 1 = R][P 2 ^ R] s V[P 1 'P 2 = R] (20) Suppose now that the output domain of a DVL2 formula V is a set D whose smallest element is *. Suppose further that all elements of D, except * t denote certain 'specified decisions' about events, and element * denotes an 'unspecified decision'. Let E+ and E* denote subsets of E(v) for which V takes specified and unspecified decisions, respectively. Events of E+ are those which satisfy at least one term in V (i.e., satisfy all selectors in the term), while E* are the remaining events in E, i.e., E*=E(v)\E + . We will call the set E + a set of recognizable events of V and E* a set of not -recognizable events of V. Elements of E* will be called *- events . Let V]_ be a VL2 formula and E^ its set of recognizable events. A rule which transforms the formula Vn into a new formula V2 (whose set of recognizable events is E£), is called a deductive inference rule (DR) if e£ V^^ = V 2 and E+ c E+ (21) and is called an inductive inference rule (IR) if E+ ~ V 1 = V 2 and E 2 3 E+ (22) According to (22), a rule is an IR, iff Y 2 makes the same specified decisions as Vj_ for events of Ej, but, also, makes specified decisions for some other events than Ej. A question arises of how these 'other' events should be selected and what decisions should be made about them. To answer this question, a criterion governing an inductive' rule is needed. We accept a criterion which can be characterized as a 'criterion of simplicity'. That is, we design a 'simplicity functional' for VL2 formulas (which can be modified according to application) and employ inductive rules which maximize the assumed functional. An important inductive rule of this type is the one which assigns to ^-events such decisions which permit one to apply to a given formula rules (ll|-)-(20) whenever it could lead to the simplification of the formula according to the accepted measure of simplicity (which, at the same time, means a generalization of the formula). An inductive program, called AQVAL/l, which operates on such principles, has been developed at the University of Illinois and already experimentally applied to selected learning and recognition problems from the area of medicine5 and plant pathology (the current version of AQVAL/l implements a subset of VL2 called VL]_). It should be mentioned that problems of inductive learning by means of variable -valued logic have a strong relationship to the problems of grammatical inference. ' DESCRIBING OBJECTS IN TERMS OF VL2 In the application of VL2 to describing objects, atomic forms are used to represent certain functions called descriptors . Descriptors are functions which a learning system uses to describe objects. Let pi denote a descriptor: p, : X D. . - D. (23) where X denotes the cartesian product J = {1,2, ...,k} D-ji - input domains of the descriptor D^ - the output domain of the descriptor Special cases of a descriptor: 1. J = {1}, i.e., p^ is a unary function. If D. denotes a set of objects, and Dj_ a set of the values of a specific characteristic of the objects, then j>± is called a feature . 2. J= (1,2, ...,k}, k=2,3, ..., D i;L =D i2 =...Di r , Dj = { true, false} If Dji denotes a set of objects, then pi can be interpreted as a k-ary relation among these objects. If Pj_(0i-j_, Oj_ , . . ., °ii )-truo, then we say that the relation among 0±. f 0^ nf . . .,0j, holds, othervrisc does not hold. If Dj is not a binary-valued' set, but has a finite number of values, then we will say that Pj_ i s a w ultj -valued 1s- ary relation . As we can sec a descriptor has a very broad meaning. Example Suppose D. = D^ denote a set of parts of a certain physical object. To express a fact that, e.g., a relation 'above' holds between certain parts of the object, we can use a function: ABOVE: D. x D. -* {true, false} (2k) X l X 2 If the relation 'above' holds between 0]_ and 2 we write [ABOVE ( Op 2 ) = true], or, since the output domain is just binary, simply ABOVE (0p0 2 ). Suppose, however, that we want to distinguish between 3 possibilities: not above, little above, much above. In this case we assume that D 1 = [not, little, much} (25) To express the fact that Oi is much above 2 , we use a selector [ABOVE (Op 2 ) = much] (26) If in describing a class of objects we observe that the part 0]_ is either much above or not above the part 2 , we would write: [ABOVE (0 X , 2 ) = not, much] (27) In describing individual objects we can distinguish the following classes of descriptors: 1. Global, O-level, descriptors. These are features which characterize objects as a whole (e.g., color, size, texture, length, etc.) 2. Local 1-level descriptors which characterize basic (l-level) parts and k-ary, k=2,3, ••• relationships among them. 3. Local k-level, k=2, 3, •••> descriptors which characterize k-level parts and relationships among parts of the k-1 level parts. AN EXAMPLE OF IEAKNING THE SIMPLEST DESCRIPTION OF THE DIFFERENCE BETWEEN TWO CLASSES OF OBJECTS Suppose we want to develop a machine which, given examples of objects from certain classes, could learn the simplest (according to some defined criteria) description of the object classes or the differences between classes. Let us assume that the machine has already built-in certain elementary abilities, such as the ability to recognize a triangle or rectangle, to measure their size and orientation, to determine various relation- ships between the recognized oujects, e.g., a relation 'on top of, 'in between', etc. The problem of implementing the abilities of this type is quite difficult by itself. Though, there have already been developed computer programs which can, to a limited degree, measure the descriptors of the kind described above (see, e.g., Winston^). It is important to observe, however, that the number of 'such elementary descriptors, which potentially may be needed is not very large, and therefore each of them could be implemented by a specially designed software or hardware device. On the other hand, the number of potential combinations of these descriptors, which may occur in descriptions of real objects, is prohibitively large. Therefore, an important problem, to which we are addressing ourselves is how to implement very efficient inference and learning processes which create goal oriented descriptions of objects or object classes, assuming that these elementary descriptors are available. This type of problem is illustrated by the following example. Fig. 1 presents two classes of 'TABLES'. The objective is to implement a learning process which would produce the simplest description, with regard to an assumed simplicity functional, of the difference between these two classes of TABLES. Suppose that the following descriptors and their domains are used to describe the TABLES: 1. global descriptors: length, {short, long] # parts, 13,10 2. a) features of individual parts Pj_, i=l, 2,3,^, (top rectangle, left triangle, right triangle, bar): part-type (Pi), {0, □, 7,^, = ) part -length (Pj_), {0, short, long} part-texture (P^, {0, ©,©,©,©} (0 means 'not relevant' - when a part does not exist) b) binary relations among parts on-top: (P^ , P^ ), {above -middle, above-left, ebove^rightj c) ternary relations among parts: in-between (Pi,P-<,Pv), (low, high] (part I 1 ^ is between \\- and l\) . Using these descriptors, the machine describes each object in terms of the VJj2 system, as a conjunction of selectors. For example, object 1 in class 1 would be described as: f length- short! [#parts=4] [part-type (P^ )=| 1 1 [ part -type (Po )= 7 ] [ part -type (P,)= ^ ][part-type(P^)= t=a][ part -length (P]_) = short] [ on-top (Pp P 2 )=above-right][ in-between (P^, P 2 , P^high] (28) Suppose that T^_, T^ , T.j_, and Tj, denote the descriptions of objects 1,2,3,4 in class i, 1=1,2, respectively. A description of the class 1 (which is the 'least general') could then be: CLASS1(T 11 V T^VT^VT^) (29) and of class 2: CLASS2(T 21 V T 22 VT 23 VT 2ij .) (30) where {#CIASS1, CIASS2} is the output domain of the formulas. Events which do not satisfy any of these formulas are *-events. Suppose now that as a simplicity criterion we accept a criterion demanding that a formula has the minimum number of terms, and, with the secondary priority, the minimum number of selectors. A way to attain the simplest, in the above sense, description of the difference between the two classes, is to maximally simplify and generalize the formulas (29) and (30) under the restriction that the resulting formulas will have the empty intersection. (The 'empty intersection' means that there will be no events which satisfy both formulas.) This is done by assigning to ^-events such decisions which lead to the maximal simplification and generalization of formulas (29) and (30) by using rules (l^)-(20) (without violating the above-mentioned restriction). Such an inductive process can be very efficiently executed by the previously mentioned computer program AQVAL/l. The simplest formulas, according to our criterion, for both classes obtained from the AQVAL/l were: CLASSl[length=short][part-texture(ri | )= ([),© ] (31) C LASS2[ lengths long ]V [part-texture (P. )= ,@ ] (32) (the execution time was less than 3 sec. on the IBM 3^0/75; AQVAL/l is written in PL/l). These formulas state that TABLES of class 1 are 'short* and the texture of the bar is ({]]) or @ , and that TABLES of class 2 are either long or the texture of the "bar is <^h or there is no bar. This description of the classes seem to agree well with what a human might accept as a 'most simple* difference between the two classes. ACKNOWLEDGMENT The author gratefully acknowledges the financial support he obtained from the Department of Computer Science of the University of Illinois at Urbana-Champaign for conducting the research reported in this paper. It is also his pleasant duty to express thanks to* Mr. A.B. Baskin for the fruitful discussions, criticism and proofreading of the paper. REFERENCES 1. Bongard, M. M., Probliema uz nayania, izd. Nauka, Moscow 19^7 • (English trans. : Pattern recognition, New York Spartan Books, 1970). 2. Cherry, C, On Human Communication, the M.I.T. Press, Cambridge, Mass., 196^. 3. Banerji, R. B., Simplicity of concepts, training and the real world, Artificial and Human Thinking, edit. A. Eli thorn, D. Jones, Jossey-Bass Inc., Publishers, 1973* k. Michalski, R. S., A Variable -Valued Logic System as Applied to Picture Description and Recognition, GRAPHIC LANGUAGES, edit. F. Nake, A. Rosenfeld, North -Holland Publishing Company ( Proceedings of the IFIP Working Conference on Graphic Languages , Vancouver, Canada, May 1972 ) . 5. Michalski, R. S., AQVAL/l — Computer Implementation of a Variable-Valued Logic System and the Application to Pattern Recognition, Proceedings of the First International Joint Conference on Pattern Recognition, Washington, D.C., . October 30-November 1, 1973. 6. Michalski, R. S., VARIABLE -VALUED LOGIC: System VLi, Proceedings of the International Symposium on Multiple -Value d Logic, West Virginia University, Morgantown, West Virginia, May 29-31, 197^. 7 . Baskin, -A. B., A comparative discussion of variable-valued logic and grammatical inference, Report Ho. G(o, of the Department of Computer Science, University of Illinois, Urbana, July l