Microsoft Word - esa1999.doc Page 1 An Adaptive Knowledge Acquisition System using Generic Genetic Programming Man Leung Wong Department of Computing and Decision Sciences Lingnan University Tuen Mun Hong Kong mlwong@ln.edu.hk Page 2 An Adaptive Knowledge Acquisition System using Generic Genetic Programming Abstract The knowledge acquisition bottleneck greatly obstructs the development of knowledge-based systems. One popular approach to knowledge acquisition uses inductive concept learning to derive knowledge from examples stored in databases. However, existing learning systems cannot improve themselves automatically. This paper describes an adaptive knowledge acquisition system that can learn first-order logical relations and improve itself automatically. The system is composed of an external interface, a biases base, a knowledge base of background knowledge, an example database, an empirical ILP learner, a meta- level learner, and a learning controller. In this system, the empirical ILP learner performs top-down search in the hypothesis space defined by the concept description language, the language bias, and the background knowledge. The search is directed by search biases which can be induced and refined by the meta-level learner based on generic genetic programming. It has been demonstrated that the adaptive knowledge acquisition system performs better than FOIL on inducing logical relations from perfect or noisy training examples. The result implies that the search bias evolved by evolutionary learning is better than that of FOIL which is designed by a top researcher in the field. Consequently, Generic genetic programming is a promising technique for implementing a meta-level learning system. The result is very encourage as it suggests that the process of natural selection and evolution can successfully evolve a high performance learning system. Area: Evolutionary Computation, Genetic Programming, Knowledge Acquisition 1. Introduction The knowledge acquisition bottleneck greatly obstructs the development of knowledge-based systems. One popular approach to knowledge acquisition uses inductive concept learning to derive knowledge from examples stored in databases. The knowledge acquired can be expressed in different knowledge representations such as first-order logical relations, decision trees, decision lists, and production rules. Existing learning systems such as CART (Breiman et al. 1984), C4.5 (Quinlan Page 3 1992), ASSISTANT (Cestnik et al. 1987), AQ15 (Michalski et al 1986), and CN2 (Clark and Niblett 1989) use attribute-value language for representing the training examples and the induced knowledge and allow a finite number of objects in the universe of discourse. This representation limits them to learn only propositional descriptions in which concepts are described in terms of values of a fixed number of attributes. Dzeroski and Lavrac show that Inductive Logic Programming (ILP) can be used to induce knowledge represented as first-order logical relations (Dzeroski and Lavrac 1993, Dzeroski 1996). ILP is more powerful than traditional inductive learning methods because it uses an expressive first-order logic framework and facilitates the application of background knowledge. In this formalism, domain knowledge represented in the form of relations can be used in the induced relational descriptions of concepts. Moreover, ILP has a strong theoretical foundation from logic programming and computational learning theory. The task of inducing first-order logical relations can be formulated as a search problem (Mitchell 1982) in a hypotheses space of logical relations. Various approaches (Quinlan 1990; 1996, Muggleton and Feng 1990) differ mainly in the search strategy and the heuristics used to guide the search. The search space is extremely large, so strong heuristics are required to manage the problem. Most systems are based on a greedy search strategy. They generate a sequence of logical relations from general to specific (or from specific to general) until a consistent relation is found. Each relation in the sequence is obtained by specializing (or generalizing) the previous one. For example, FOIL (Quinlan 1990, 1996) applies a hill climbing search strategy guided by an information-gain heuristic to search relations from general to specific. But these strategies and heuristics are not always applicable because these systems may become trapped in local maxima. In order to overcome this problem, non greedy strategies should be adopted. Moreover, existing ILP systems cannot improve themselves automatically. In this paper, we describe an adaptive knowledge acquisition system that induces first-order logical relations and improves itself during learning. We formulate Page 4 the definitions of inductive concept learning and adaptive knowledge acquisition in the next section. The system is based on a generic genetic programming approach that is presented in Section 3. A generic top-down first-order learning algorithm is described in the next section. The fifth section contains a description of a meta-level learner that induces search bias. The experimentation and some evaluations of the system are reported in Section six. Finally, the conclusion is presented in the last section. 2. INDUCTIVE CONCEPT LEARNING AND ADAPTIVE KNOWLEDGE ACQUISITION The goal of machine learning is to develop techniques and tools for building intelligent learning machines. Machine learning paradigms include inductive, deductive, genetic-based, and connectionist learning. Multi-strategy learning integrates several learning paradigms. This section focuses on supervised inductive concept learning. If U is a universal set of observations, a concept C is formalized as a subset of observations in U. Inductive concept learning finds descriptions for various target concepts from positive and negative training instances of these concepts. In machine learning, formal languages for describing observations and concepts are called object and concept description languages respectively. Typically, object description languages are attribute-value pair descriptions and first-order languages of Horn clauses. Concepts can be described extensionally or intensionally. A concept is described extensionally by listing the descriptions of all of its instances (observations). Thus extensional concepts are represented in the object description language. On the other hand, intensional concepts are expressed in a separate concept description language that permits compact and concise concept descriptions. Typical concept description languages are decision trees, decision lists, production rules, and first-order logic. Inductive concept learning can be viewed as searching the space of hypothesis descriptions. A bias is a mechanism employed by a learning system to constrain the search for target hypotheses. A search bias determines how to conduct the search in Page 5 the hypothesis space while a language bias determines the size and structure of the hypothesis space. A strong search bias, such as the hill-climbing search strategy, employs existing knowledge about the size and structure of the hypothesis space to exploit promising solutions of the space, thus it can find the target concept quickly. But it may trap the system in a local maximum. A weak search bias, such as depth-first and breath-first search, explores the space completely; the learner is guaranteed to find the target concept that can be represented by the concept description language. Nevertheless, a weak bias is very inefficient. In other words, the search bias introduces the efficiency/completeness tradeoff into a learning system. A strong language bias defines a less expressive description language such as the propositional logic. The hypothesis space created by the bias is comparatively smaller and the learning can be performed more efficiently. But the learner may fail to find the target concept which is not contained in the small hypothesis space. A weak bias defines a larger space and thus the target concept is more likely to be expressible in the space. The disadvantage is that the learner is less efficient. The language bias introduces the efficiency/expressiveness tradeoff into a learning system. Background knowledge B is declarative prior knowledge that can be used by either the search bias to direct the search more efficiently, or the language bias to express the hypothesis space in a more natural and concise way. Background knowledge plays an important role in relational concept learning. Relational concept learning induces a new relation for the target concept (i.e., the target predicate) from training examples and known relations from the background knowledge. The training examples, the hypothesis space, and the background knowledge are represented in first order Horn clause languages (Muggleton 1992). Tradeoffs between expressiveness and efficiency are introduced by some additional restrictions on the three languages representing the examples, the hypothesis space, and the background knowledge. The background knowledge B provides definitions of known predicates qi which can be used in the definition of the target predicate p. It also provides Page 6 additional information to ease the learning. The information includes argument types, symmetry of predicates in pairs of arguments, input/output modes, rule models, predicate sets, parametrized languages, integrity constraints, determinations and any knowledge that can modify the operation of the search and language biases (Lavrac and Dzeroski 1994). An adaptive knowledge acquisition system is a relational concept learning system that can improve itself on the learning capability. It maintains various sets of background knowledge and biases. It improves itself by modifying its biases and background knowledge. Since a hypothesis space for learning is defined through the concept description language, the language bias, and the background knowledge, the size and structure of the hypothesis space can be modified by changing the language bias and the background knowledge. The search strategy and heuristics are changed if the system's search biases are modified. Here, we formulate the task of an adaptive knowledge acquisition system in Table 1. Given: -A set E of positive E+ and negative E- training examples of the target predicate p. Training examples are represented as ground atoms -A concept description language L -A set of learning biases BIASES -A set of various background knowledge BKs Find: -A modified set of learning biases BIASES' -A modified set of background knowledge BKs' -A concept definition H for the target predicate p expressible in L such that H is complete and consistent with respect to (w.r.t.) the training examples E and a background knowledge B' in BKs' H is complete if every positive example e+ in E+ is covered by H w.r.t. the background knowledge B. i.e. B ∪ H |= e+ H is consistent if no negative example e- in E- is covered by H w.r.t. the background knowledge B. i.e. B ∪ H |≠ e- Table 1: The definition of adaptive knowledge acquisition Page 7 External Interface Learning Controller BKbase Biases Base Example database Meta-Level Learner Empirical ILP Learner Data flow Control flow Figure 1: The logical organization of an adaptive knowledge acquisition system The logical organization of our system is depicted in Figure 1. Its components are introduced as follows: (1) External interface: It provides a user-friendly interface between the system and users. It accepts training examples, a set BKs of background knowledge, and a set BIASES of biases and transfers them through the learning controller to the example database, BKbase, and biases base respectively. The interface also provides commands for users to query about the results of an adaptive learning task and to directly control the operations of the learning controller. (2) Biases base: It is a knowledge base that stores all learning biases. Biases can be retrieved, added, deleted, and modified through the interface of this knowledge base. (3) BKbase: It stores various background learning knowledge that can be used in inductive learning. Background knowledge can be retrieved, added, deleted, and modified through the interface of BKbase. Since each entity of it is in fact a complex structure representing background knowledge, BKbase is implemented using object-oriented techniques. Page 8 (4) Examples database: It stores the training examples. (5) Empirical ILP learner: It induces first-order logical relations from the training examples, given a concept description language, a specific background knowledge, a search bias, and a language bias. A search of the hypothesis space can be performed bottom-up or top-down. Bottom-up techniques start from the training examples and search the space by employing various generalization operators. Top-down techniques start from the most general concept descriptions, and search the space by using various specialization operators. Top-down techniques are better suited for learning from imperfect examples because a large number of data are available in every specialization step and the system can employ various statistical techniques to decide how to perform the specialization. Moreover, top-down search can easily be guided by the search bias. In Section 4, a generic top-down first-order learning algorithm is described. (6) Meta-level learner: It learns search biases, language biases, and background knowledge. Search and language biases can be represented declaratively or procedurally. In this paper, we apply GGP to implement the meta-level learner that induces procedural biases. The description of GGP is presented in Section 3. (7) Learning controller: It is a knowledge-based system that controls the empirical ILP learner and the meta-level learner. The knowledge used by the learning controller can be updated by the meta-level learner. 3. Generic Genetic Programming (GGP) Generic Genetic Programming (GGP) is a novel approach that combines genetic programming (Koza 1992; 1994, Kinnear 1994) and inductive logic programming (Quinlan 1990; 1996, Muggleton 1992). Using GGP, programs in various programming languages can be evolved. The approach is also powerful enough to handle context-sensitive information and domain-dependent knowledge which can be used to accelerate the learning speed and/or improve the quality of the programs. Page 9 GGP can induce programs in various programming languages. This is achieved by accepting or choosing grammars of different languages to produce programs in these languages. Most modern programming languages are specified in the notation of BNF (Backus-Naur form) which is a kind of context-free grammars (CFGs). However, GGP is based on logic grammars because CFGs (Hopcroft and Ullman 1979, Lewis and Rapadimitrion 1981) are not expressive enough to represent context-sensitive information for some languages and domain-dependent knowledge of the target program being induced. This section first introduces the formalism of logic grammars followed by the descriptions of GGP. 3.1. Introduction to logic grammars Logic grammars are the generalizations of CFGs. Their expressivenesses are much more powerful than those of CFGs, but equally amenable to efficient execution. In this paper, logic grammars are described in a notation similar to that of definite clause grammars (Pereira and Warren 1980, Pereira and Shieber 1987, Sterling and Shapiro 1986). The logic grammar for some simple S-expressions in Table 2 will be used throughout this section. 1: start -> [(*], exp(W), exp(W), exp(W), [)]. 2: start -> {member(?x,[W, Z])}, [(*], exp-1(?x), exp-1(?x), exp-1(?x), [)]. 3: start -> {member(?x,[W, Z])}, [(+], exp-1(?x), exp-1(?x), exp-1(?x), [)]. 4: exp(?x) -> [(/ ?x 1.5)]. 5: exp-1(?x) -> {random(1,2,?y)}, [(/ ?x ?y)]. 6: exp-1(?x) -> {random(3,4,?y)}, [(- ?x ?y)]. 7: exp-1(W) -> [(+ (- W 11) 12)]. Table 2: A logic grammar A logic grammar differs from a CFG in that the logic grammar symbols, whether terminal or non-terminal, may include arguments. The arguments can be any term in the grammar. A term is either a logic variable, a function or a constant. A variable is represented by a question mark ? followed by a string of letters and/or digits. A function is a grammar symbol followed by a bracketed n-tuple of terms and Page 10 a constant is simply a 0-arity function. Arguments can be used in a logic grammar to enforce context-dependency. Thus, the permissible forms for a constituent may depend on the context in which that constituent occurs in the program. Another application of arguments is to construct tree structures in the course of parsing, such tree structures can provide a representation of the semantics of the program. The terminal symbols enclosed in square brackets correspond to the set of words of the language specified. For example, the terminal [(- ?x ?y)] creates the constituent (- 1.0 2.0) of a program if ?x and ?y are instantiated respectively to 1.0 and 2.0. Non-terminal symbols are similar to literals in Prolog, exp-1(?x) in Table 2 is an example of non-terminal symbols. Commas denote concatenation and each grammar rule ends with a full stop. The right-hand side of a grammar rule may contain logic goals and grammar symbols. The goals are pure logical predicates for which logical definitions have been given. They specify the conditions that must be satisfied before the rule can be applied. For example, the goal member(?x, [W, Z]) in Table 2 instantiates the variable ?x to either W or Z if ?x has not been instantiated, otherwise it checks whether the value of ?x is either W or Z. If the variable ?y has not been bound, the goal random(1, 2, ?y) instantiates ?y to a random floating point number between 1 and 2. Otherwise, the goal checks whether the value of ?y is between 1 and 2. Domain-dependent knowledge can be represented in logic goals. For example, consider the following grammar rule: a-useful-program -> first-component(?X), {is-useful(?X, ?Y)}, second-component(?Y). This rule states that a useful program is composed of two components. The first component is generated from the non-terminal first-component(?X). The logic variable ?X is used to store semantic information about the first component produced. The logic goal then determines whether the first component is useful according to the semantic information stored in ?X. Domain-dependent knowledge Page 11 about which program fragments are useful is represented in the logical definition of this predicate. If the first component is useful, the logic goal is- useful(?X, ?Y) is satisfied and some semantic information is stored into the logic variable ?Y. This information will be used in the non-terminal second- component(?Y) to guide the search for a good program fragment as the second component of a useful program. The special non-terminal start corresponds to a program of the language. In Table 2, some grammar symbols are shown in bold-face to identify the constituents that cannot be manipulated by genetic operators. For example, the last terminal symbol [)] of the second rule is revealed in bold-face because every S-expression must be ended with a ')'. The number before each rule is a label for later discussions. It is not part of the grammar. 3.2. Representations of programs One of the fundamental contributions of GGP is in the representations of programs in different programming languages appropriately so that initial population can be generated easily and the genetic operators such as reproduction, mutation, and crossover can be performed effectively. A program can be represented as a derivation tree that shows how the program has been derived from the logic grammar. GGP applies deduction to randomly generate programs and their derivation trees in the language declared by the given grammar. These programs form the initial population. For example, the program (* (/ W 1.5) (/ W 1.5) (/ W 1.5)) can be generated by GGP given the logic grammar in Table 2. It is derived from the following sequence of derivations: start => [(*] exp(W) exp(W) exp(W) [)] => [(*] [(/ W 1.5)] exp(W) exp(W) [)] => [(*] [(/ W 1.5)] [(/ W 1.5)] exp(W) [)] => [(*] [(/ W 1.5)] [(/ W 1.5)] [(/ W 1.5)] [)] => [(* (/ W 1.5) (/ W 1.5) (/ W 1.5))] This sequence of derivations can be represented as the derivation tree depicted in Figure 2. Page 12 In literature, the terms derivation trees and parse trees are usually used interchangeably. However, we will use the term derivation trees to refer to the tree structures in our framework and the term parse trees to refer to those in GP. The bindings of logic variables are enclosed in a pair of braces. The sub-trees enclosed in a dashed rectangular are frozen. In other words, they are generated by bold-faced grammar symbols and they cannot be modified by genetic operators. An advantage of logic grammars is that they specify what is a legal program without any explicit reference to the process of program generation and parsing. Furthermore, a logic grammar can be translated into an efficient logic program that can generate and parse the programs in the language declared by the logic grammar (Pereira and Warren 1980, Pereira and Shieber 1987, Abramson and Dahl 1989). In other words, the process of program generation and parsing can be achieved by performing deduction using the translated logic program. Consequently, the program generation and analysis mechanisms of GGP can be implemented using a deduction mechanism based on the logic programs translated from the grammars. [(*] exp(W) exp(W) exp(W) [)] start [(/ ?x 1.5)] {?x/W} [(/ ?x 1.5)] {?x/W} [(/ ?x 1.5)] {?x/W} Figure 2: A derivation tree of the S-expression in Lisp (* (/ W 1.5) (/ W 1.5) (/ W 1.5)) This method of translating a logic grammar into a logic program is common in the field of natural language processing (Pereira and Warren 1980, Pereira and Shieber 1987, Abramson and Dahl 1989). The original idea of this approach is to Page 13 rephrase the special purpose formalism of CFGs into a general purpose first-order predicate logic (Kowalski 1979, Colmerauer 1978, Pereira and Warren 1980). This approach is further refined and generalized to Definite Clause Grammars (DCGs) which can handle the properties of context-dependency of natural languages effectively. Since DCGs, a kind of logic grammars, can be translated into efficient logic programs automatically, parsers and generators for the corresponding natural languages can be obtained easily. In other words, researchers in the field of natural language processing only declare the grammar for a particular natural language, and the translation process will produce the corresponding parser and generator for them. Moreover, for some cases, the same logic program can be used as both a parser and generator at the same time. Alternatively, initial programs can be induced by other learning systems such as FOIL (Quinlan 1990; 1996) or given by the user. GGP analyzes each program and creates the corresponding derivation tree. 3.3. The evolution process of GGP In GGP, populations of programs are genetically bred using the Darwinian principle of survival and reproduction of the fittest along with genetic operations appropriate for creating programs. GGP starts with an initial population of programs generated randomly, induced by other learning systems, or provided by the user. Logic grammars provide declarative descriptions of the valid programs that can appear in the initial population. A fitness function must be defined by the user to evaluate the fitness values of the programs. Typically, each program is run over a set of fitness cases and the fitness function estimates its fitness by performing some statistical operations (e.g. average) to the values returned by this program. The initial programs in generation 0 are normally incorrect and have poor performances. However, some programs in the population will be fitter than others. Fitness of each program in the generation is estimated and the following process is iterated over many generations until the termination criterion is satisfied. The reproduction, sexual crossover, and asexual mutation are used to create new Page 14 generation of programs from the current one. The reproduction involves selecting a program from the current generation and allowing it to survive by copying it into the next generation. Either fitness proportionate or tournament selection can be used. The crossover is used to create offspring programs from two parental programs selected. Mutation creates a modified offspring program from a parental program selected. Unlike crossover, the offspring program is usually similar to the parent program. Logic grammars are used to constraint the offspring programs that can be produced by these genetic operations. This algorithm will produce populations of programs which tend to exhibit increasing average of fitness. Finally, GGP returns the best program found in any generation of a run as the result. A high level algorithm of GGP is presented in Table 3. 1. Generate an initial population of programs. 2. Execute each program in the current population and assign it a fitness value according to the fitness function 3. If the termination criterion is satisfied, terminate the algorithm. The best program found in the run of the algorithm is designated as the result. 4. Create a new population of programs from the current population by applying the reproduction, crossover, and mutation operations. These operations are applied to programs selected by fitness proportionate or tournament selections. 5. Rename the new population to the current population. 6. Proceed to the next generation by branching back to the step 2. Table 3: The high level algorithm of GGP 4. A generic top-down first-order learning algorithm This section presents a generic top-down first-order learning algorithm based on FOIL (Quinlan 1990, 1996). The algorithm is depicted in Table 4. The algorithm consists of three steps. In the pre-processing step, missing argument values in training examples are handled by assigning default or random values to them. A training example will be removed if it has too many missing values. If there are no or Page 15 inadequate negative examples in the training set, they can be generated. Different ways of creating negative examples have been proposed (Lavrac and Dzeroski 1994). Input: E: Training examples L: The concept description language BIASsearch: The search bias BIASlang: The language bias B: Background knowledge T: The target concept Output: A relation P which contains a set of clauses. Each clause C ∈ L. Function LEARNING(E, L, BIASsearch, BIASlang, B, T) (1) Pre-processing of the training examples E and producing a modified set of examples E': E' := Preprocessing(E). (2) Let Ecurrent := E'; Let P := {}; Repeat -Let C := T ←; -Find a specialization C' of C. This step constructs a clause C' from C by calling Clause-Construct(C, Ecurrent, B, L, BIASsearch, BIASlang); -If a specialization can be found -Add C' to P to produce a new relation P'. i.e. P' := P ∪ {C'}; -Remove all positive examples covered by P' from Ecurrent to get an updated training set E' E'current := Ecurrent - { positive examples in Ecurrent covered by P' w.r.t. the background knowledge B}; -Let Ecurrent := E'current; -Let P := P' Else -Set the flag No-More-Improvement to true; Until The Covering termination criterion is satisfied. i.e. covering-termination(P, No-More-Improvement, Ecurrent, B) returns true; (3) Post-processing the relation P and producing P'. i.e. P' := Post-processing(P); Return(P'); Table 4: A generic top-down first-order learning algorithm Page 16 The second step performs the construction of a logical relation. This step employs four local variables: Ecurrent (Current training examples set), E'current (Updated training examples set), P (Current relation) and P' (Modified relation). The main component of this step is the covering loop which implements Michalski's covering algorithm (Michalski et al. 1986a). The covering loop construct a relation by iteratively executing the following sub-steps: (a) Construct a clause that covers some positive examples in Ecurrent. (b) Append the clause to the current relation P and generate a modified relation P'. (c) Remove all positive examples from Ecurrent which are covered by P' with respect to the background knowledge B. The covering loop terminates if the terminating conditions are satisfied. A typical condition is that either all positive examples are covered or no more improvement can be achieved by searching for a new clause. The final step attempts to improve the accuracy of the relation induced when classifying unseen examples and to simplify the relation. The covering loop calls the 'Clause-Construct' function which is the core of the generic algorithm. A hill-climbing 'Clause-Construct' algorithm is presented in Table 5. The function constructs a clause Cn = T ← l1, l2, ..., ln starting from the most general clause C0 = T ← with an empty body. A sequence of clauses C0, C1, C2, C3, ...., Cn are generated by a number of specialization steps. At each step, the current clause Ci = T← l1,l2, ..., li is refined by appending a specific literal lj to its body. A literal lj is constructed from the background knowledge B restricted by the concept description language L and language bias BIASlang. The language may limit lj to be function-free while BIASlang may prevent new variable to be introduced in lj. The aim of the procedure is to find a clause which covers most positive examples while excludes all or most negative examples. In a hill-climbing search, the procedure keeps the current Page 17 best clause and refines it using the estimated best specialization at each step, until the stopping condition is satisfied The 'Clause-Construct' function calls the 'Find-Extension' function to find the extension Ei of the current training examples given the partially developed clause Ci = T(X1, X2, ..., Xn) ← l1, l2, ..., li and the background knowledge B. Each training example is a n-tuple where xi, 1≤i≤n, are some constants. To find the extension, the function initializes a clause C0 = T(X1, X2, ..., Xn), then the literal l1 is added to the body of C0 to produce a new clause C1. The literal l1 is either of the form Xj = Xk, Xj ≠ Xk, pm(Y1, Y2, ...,Ysm ) or not pm(Y1, Y2, ...,Ysm ). If the literal contains k new variables, the arity of each tuple in the generated training set E1 increases to (n + k). E1 can be found by performing a natural join of Ecurrent with the relation corresponding to literal l1. The process is repeated for literals l2, l3, ..., li until the extension Ei is found. The most important component of the hill-climbing 'Clause-Construct' algorithm is the 'scoring' function that estimates the performance of each literal. An accurate estimation directs the search towards the global maxima while a misleading one traps the system into local-maxima. By providing different 'scoring' functions to the generic learning algorithm, various learning algorithms can be generated. The performances of a good and a bad learners can be significant different as shown in Sub-section 5.3. Page 18 Input: C: An initial clause C = T← Ecurrent: The current training examples B: Background knowledge L: The concept description language BIASsearch: The search bias BIASlang: The language bias Output: A clause that covers some positive examples in Ecurrent while excludes all or most negatives examples in Ecurrent Function Clause-Construct(C, Ecurrent, B, L, BIASsearch, BIASlang) There is a scoring function stored in BIASsearch, save this function to scoring; Repeat -Set BEST to a bad literal such as X = X where X is a variable appearing in the head of the clause; -Set Best-score to 0; -Find the extension Ei of Ecurrent using the clause C w.r.t. B. i.e. Ei := Find-Extension(C, Ecurrent, B); -Let ni + be the number of positive tuples in Ei; -Let ni − be the number of negative tuples in Ei; -Current-information := − log 2( ni + / (ni + + ni − )); -For all literal l from B that satisfy the constraints imposed by the language L and bias BIASlang -Set C ' = C ∪ {l}; -Find the extension Ei+1 of Ecurrent using the clause C ' i.e. Ei+1 := Find-Extension(C ', Ecurrent, B); -Let ni+1 + be the number of positive tuples in Ei+1; -Let ni+1 − be the number of negative tuples in Ei+1; -Let the number of positive tuples in Ei that have been represented by one or more tuples in Ei+1 be ni ++ ; -Find the score of the literal l by using the scoring function i.e. literal-score := scoring (ni ++ , ni+1 + , ni+1 − , Current-information); -If literal-score > Best-score then -BEST := l; -Best-score := literal-score; -If BEST == X=X then -No-More-Improvement := true; Else -Append BEST to the body of C; Until Clause-Termination(C, No-More-Improvement, Ecurrent, B) is true; Post-processing the clause C to find an improvement i.e. C' := Find-Improvement(C); If Acceptable(C') -Return(C'); Else -Return(No-Specialization-Can-Be-Found); Table 5: A hill-climbing 'Clause-Construct' algorithm Page 19 5. Inducing procedural search biases In this section, GGP is used in the meta-level learner to induce procedural search biases (i.e. the 'scoring' function). In order to employ GGP, a logic grammar must be defined (Table 6). In the grammar, the terminal symbols n-pos-i-plus-1, n-neg-i-plus-1, and n-pos-i represent respectively ni +1 + , ni +1 − and ni + + . With reference to the algorithms in Tables 4 and 5, assume that Ei is the extension of current training examples Ecurrent by current clause Ci, ni + and ni − are respectively the number of positive and negative tuples in Ei. Ei can be extended by using the literal l to Ei+1. ni+1 + and ni+1 − are respectively the number of positive and negative tuples in Ei+1. ni ++ is the number of positive tuples in Ei that have been represented by one or more tuples in Ei+1. The terminal symbol current-information is defined as − log 2( ni + / (ni + + ni − )) . start -> function. s-exp -> term. s-exp -> function. function -> [(], op1, s-exp, [)]. function -> [(], op2, s-exp, s-exp, [)]. op1 -> [ protected-log ]. op2 -> [ + ]. op2 -> [ - ]. op2 -> [ * ]. op2 -> [ % ]. op2 -> [ info ]. term -> [ n-pos-i-plus-1 ]. term -> [ n-neg-i-plus-1 ]. term -> [ n-pos-i ]. term -> [ current-information ]. term -> { random(-10, 10, ?a) }, [ ?a ]. Table 6: A logic grammar for learning procedural search bias The terminal symbols +, -, and * represent functions that perform ordinary addition, subtraction, and multiplication respectively. The symbol % represents Page 20 function that normally returns the quotient. However, if division by zero is attempted, the function returns 1.0. The symbol protected-log is a function that calculates the logarithm of the input argument x if x is larger than zero, otherwise it returns 1.0. The symbol info represents the basic function that calculates − log 2( X / ( X + Y )) given X and Y as inputs. The logic goal random(-10, 10, ?a) generates a random floating point number between -10 and 10 and instantiates ?a to the random number generated 5.1. The evolution process The evolution process of the adaptive knowledge acquisition system is depicted in Figure 3. Firstly, the Biases base is initialized with a population of different 'scoring' functions generated randomly using the logic grammar depicted in Table 6. To estimate the fitness of a specific 'scoring' function, it is combined with the generic top-down learner to produce a specific learner. The performance of this specific learner is then evaluated by using a fitness function. This measure is assigned as the fitness of the specific 'scoring' function. GGP employs selection, crossover, and mutation to generate potentially better functions. The modified functions are stored in the Biases base and the whole evolution process iterates until the best function is found or no computational resource is available Page 21 Biases Base GGP Empirical ILP Learner BKbase Example database Modified Biases Initial Biases Bias Performance of bias Figure 3: The evolution process of the adaptive knowledge acquisition system 5.2. The experimentation setup In this paper, learning curves are used to estimate the performances of various learning systems. The example space is divided randomly into disjoint training and testing sets. The learner is trained on progressively larger portions of the training set and the performance of the induced logical relation is estimated on the disjoint testing set. This process of dividing, training, and testing is repeated for 20 trials and the results are averaged to generate a learning curve. As a running example, we use a traditional problem discussed in the literature (Muggleton and Feng 1990). In the problem of learning the list predicate member, the data consist of all lists of lengths 0 to 3 defined over three constants. The background knowledge B contains definitions of list construction predicates: null which holds for an empty list and component which decomposes a list into its head and tail. The example space contains 75 positive and 45 negative examples. The training sets contain 20 to 52 examples, one-half of each training set is positive examples. The testing set consists of 45 positive and 15 negative examples. Page 22 5.3. Fitness calculation Adjusted and normalized fitness values are used as in Koza (1992). They are calculated from the raw fitness which is estimated by the fitness function. Various fitness functions have been tried and two of them are described here. The impact of fitness function on the generality of the evolved function is also demonstrated. The problem domain of learning the member predicate is used here. For the first fitness function, a random set of 24 positive and 21 negative examples is used. A specific 'scoring' function is combined with the generic top-down learner to produce a specific learner called Adapted-ILP hereafter. Adapted-ILP induces first-order logical relations using the random example set. The quality of the induced logical relations is evaluated by counting the total number of misclassified examples from the same training set. This measure is used as the raw fitness of the specific 'scoring' function. Using this fitness function, only poor 'scoring' functions have been evolved. The learning curve of a poor learner is depicted in Figure 4. For the second fitness function, the raw fitness is developed in several steps. At the beginning of each generation, four instances of the learning task are created randomly from the member domain. Each learning task has a training and a disjoint testing data. The training set contains 20 positive and 20 negative examples. For each learning task, a specific Adapted-ILP induces logical relations from the training set and the relations are evaluated by counting the number of misclassified examples from the testing set. The performance of the Adapted-ILP is the sum of numbers of misclassified examples for all learning tasks. This measure is then used as the raw fitness of the corresponding 'scoring' function. This fitness function can force the evolution of good 'scoring' functions. The learning curve of a good learner is shown in Figure 4. Page 23 20 24 28 32 36 40 44 48 52 Training size 0.7 0.75 0.8 0.85 0.9 0.95 1 A cc ur ac y Poor ILP learner Good ILP learner Figure 4: The learning curves of good and poor learner 6. Experimentation and evaluations This section compares the performance of our system with that of FOIL (Quinlan 1990; 1996). Standard learning tasks in the literature are used in these experiments (Quinlan 1990, Muggleton and Feng 1990). 6.1. The member predicate The learning curves for this problem are depicted in Figure 5. It is interested to find that our system has higher accuracy than FOIL. The difference is significant at 5% level of significance when the training size is less than 36. 20 24 28 32 36 40 44 48 52 Training size 0.75 0.8 0.85 0.9 0.95 1 A cc ur ac y Foil The Adaptive ILP system Figure 5: Learning curves for the member problem Page 24 6.2. The member predicate in a noisy environment Difference amount of noise is introduced into the training examples in order to study the performances of both systems in learning relations in noisy environment. To introduce n% of noise into the examples, n% positive examples are labeled as negative ones while n% negative examples are labeled as positive ones. In this experiment, the percentages of introduced noise are 10% (0.1) and 40% (0.4). Their learning curves are summarized in Figure 6. Our system performs better than FOIL at all noise level. 20 24 28 32 36 40 44 48 52 Training size 0.5 0.6 0.7 0.8 0.9 1 A cc ur ac y Foil (0.1) The Adaptive ILP system (0.1) Foil (0.4) The Adaptive ILP system (0.4) Figure 6: Learning curves for the member problem in a noisy environment 6.3. The multiply predicate In the problem of learning the arithmetic predicate multiply (Muggletion and Feng 1990), the data contain integers in the range from zero to ten. The background knowledge is composed of definitions for arithmetic predicates plus, decrement, zero, and one. The example space has 73 positive and 1258 negative examples respectively. The training sets consist of 400 to 500 examples, one-tenth of each training set is positive and the remainder is negative. The learning curves for multiply are presented in Figure 7. Our system performs better than FOIL when the size of training set is less than 460. The difference is significant at 5% level of significance. Page 25 400 420 440 460 480 500 Training size 0.6 0.7 0.8 0.9 1 A cc ur ac y Foil The Adaptive ILP system Figure 7: Learning curves for the multiply problem 6.4. The uncle predicate Another traditional testbed for relational learners is the domain of family relationships (Quinlan 1990). In this experiment, the uncle predicate is induced and the background predicates are parent, sibling, married, male, and female. The learning curves are presented in Figure 8. 50 70 90 110 130 150 Training size 0.75 0.8 0.85 0.9 0.95 1 1.05 A cc ur ac y Foil The Adaptive ILP system Figure 8: Learning curves for the uncle problem Page 26 7. Conclusion In this paper, we formulate an adaptive knowledge acquisition system which is composed of an external interface, a biases base, a knowledge base of background knowledge, an example database, an empirical ILP learner, a meta-level learner, and a learning controller. An implementation of the adaptive knowledge acquisition system has been developed. In the implementation, the empirical ILP learner performs top- down search in the hypothesis space defined by the concept description language, the language bias, and the background knowledge. The search is directed by search biases which can be induced and refined by a meta-level learner implemented by using generic genetic programming. Generic Genetic Programming (GGP) is a novel, general, and powerful approach of evolving search biases. It is also powerful enough to use logic grammars to represent context-sensitive information and domain-dependent knowledge. The idea of using formal grammars to direct search for knowledge in the hypothesis space or to reduce the size of the space has also been independently studied by other researcher recently (Cohen 1992, Gruau 1996, Whigham 1995; 1996). It has been demonstrated that the induced bias is better than that of FOIL on many standard learning tasks. From these experiments, it can be concluded that the adaptive knowledge acquisition system has superior learning ability compared to FOIL. Since they are different in their search biases only, the result implies that the search bias induced by GGP is better than that of FOIL for the learning problems. This result is surprising because the search biases of the adaptive knowledge acquisition system are initialized by a random process. These biases are normally poor, but the process of natural selection and evolution can successfully evolve a good bias. It is important to mention that the search bias is rather general because it has reasonable performance on many traditional learning problems using the same bias acquired automatically. This paper illustrates that GGP is a plausible approach for implementing a meta-level learning system. For future work, in order to find a Page 27 general, efficient, and effective bias, a large number of learning tasks of different kinds, such as the member, append, quick sort, ackermann, uncle, and grandfather problems, with various characteristics should be used. This adaptive learning approach, though computationally intensive, is rather exciting, as it opens up many opportunities for creating or improving learning algorithms. Page 28 Reference Abramson, H. and Dahl, V. (1989). Logic Grammars. Berlin: Springer-Verlag. Bremen, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Belmont: Wadsworth. Cestnik, B., Kononenko, J. and Bratko, I. (1987). ASSISTANT 86: A knowledge elicitation tool for sophisticated users. In I. Bratko and N. Lavrac (Ed.), Progress in Machine Learning, 31-45. Wilmslow: Sigma Press. Clark, P. and Niblett, T. (1989). The CN2 induction algorithm. Machine Learning; 3, 261-283. Cohen, W. (1992). Compiling Prior Knowledge into an Explicit Bias. In Proceedings of the Ninth International Workshop on Machine Learning, 102-110. CA: Morgan Kaufmann. Colmerauer, A. (1978). Metamorphosis Grammars. In L. Bolc (Ed.), Natural Language Communication with Computers. Berlin: Springer-Verlag. Dzeroski, S. (1996). Inductive Logic Programming and Knowledge Discovery in Databases. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery in Data Mining, 117-152. Menlo Park, CA: AAAI Press. Dzeroski, S. and Lavrac, N. (1993). Inductive Learning in Deductive Databases. IEEE Transactions on Knowledge and Data Engineering, 5, 939-949. Gruau, F. (1996). On Using Syntactic Constraints with Genetic Programming. In P. J. Angeline and K. E. Kinnear, Jr. (Eds.) Advances in Genetic Programming 2, 402-417. MA: MIT Press. Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. MA: Addison-Wesley. Koza, J. R. (1994). Genetic Programming II: Automatic Discovery of Reusable Programs. Cambridge, MA: MIT Press. Koza, J. R. (1992). Genetic Programming: on the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press. Kinnear, K. E. Jr., editor (1994). Advances in Genetic Programming. Cambridge, MA: MIT Press. Kowalski, R. A. (1979). Logic For Problem Solving. Amsterdam: North-Holland. Page 29 Lavrac, N. and Dzeroski, S. (1994). Inductive Logic Programming: Techniques and Applications. London: Ellis Horword. Lewis, H. R. and Rapadimitrion, C. H. (1981). Elements of the theory of computation. NJ: Prentice Hall. Michalski, R. S., Mozetic, I., Hong, J. and Lavrac, N. (1986). The multi-purpose incremental learning system AQ15 and its testing application on tree medical domains. In Proceedings of the National Conference on Artificial Intelligence, 1041- 1045. San Mateo, CA: Morgan Kaufmann. Mitchell, T. M. (1982). Generalization as Search, Artificial Intelligence, 18, 203-226. Muggletion, S., editor (1992). Inductive Logic Programming. London: Academic Press. Muggleton, S. and Buntine, W. (1988). Machine invention of first-order predicates by inverting resolution. In Proceedings of the Fifth International Conference on Machine Learning, 339-352. CA: Morgan Kaufmann Muggleton, S. and Feng, C. (1990), Efficient induction of logic programs, In Proceedings of the First Conference on Algorithmic Learning Theory, 1-14. Pereira, F. C. N. and Warren, D. H. D. (1980). Definite Clause Grammars for Language Analysis - A Survey of the Formalism and a Comparison with Augmented Transition Networks Artificial Intelligence, 13, 231-278. Pereira, F. C. N. and Shieber, S. M. (1987). Prolog and Natural-Language Analysis. CA: CSLI. Quinlan, J. R. (1990). Learning logical definitions from relations. Machine Learning, 5, 239-266. Sterling, L. and Shapiro, E. (1986). The Art of Prolog. Cambridge, MA: MIT Press. Whigham, P. A. (1996). Search Bias, Language Bias and Genetic Programming. In Proceedings of the First Genetic Programming Conference, 230-237. Cambridge, MA: MIT Press. Whigham, P. A. (1995). Inductive Bias and Genetic Programming. In Proceedings of the First International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, 461-466. UK: IEE.