This paper studies a situation is which correct knowledge is harmful to a problem solver even given unlimiitd computational resources. A knowledge base is defined to be sociopathic if all the tuples in the knowledge base are individually judged to be correct and a subset of the knowledge base gives better performance than the original knowledge base independent of the amount of computational resources that are available. Almost all knowledge bases that contain probabilistic rules are shown to be sociopathic and so this problem is very widespread. Sociopathicity has important consequences for rule induction methods and rule set debugging methods. Sociopathic knowledge bases cannot be properly debugged using the widespread practice of incremental modification and deletion of rules responsible for wrong conclusions a la Teiresias; this approach fails to converge to an optimal solution. The problem of optimally debugging sociopathic knowledge bases is modeled as a bipartite graph minimization problem and shown to be NP-hard. Our heuristic solution approach is called the Sociopathic Reduction Algorithm and experimental results verify its efficacy. Department of Computer Science
College of Engineering

Report No. UIUCDCS-R-89-1538
UILU-ENG-89-1757

Sociopathic Knowledge Bases: Correct Knowledge Can be Harmful Even Given Unlimited Computation

David C. Wilkins and Yong Ma

Knowledge-Based Systems Group
Department of Computer Science
University of Illinois
405 North Mathews Ave
Urbana, IL 61801

August 1989

Submitted for Publication: Artificial Intelligence Journal

Sociopathic Knowledge Bases: Correct Knowledge Can be Harmful Even Given Unlimited Computation

David C. Wilkins and Yong Ma
Department of Computer Science
University of Illinois
405 North Mathews Avenue
Urbana, IL 61801 Wilkins and Yong Ma Department of Computer Science University of Illinois 405 North Mathews Avenue Urbana, IL 61801 Abstract This paper studies a situation is which correct knowledge is harmful to a problem solver even given unlimited computational resources. A knowledge base is defined to be sociopathic if all the tuples in the knowledge base are individually judged to be correct and a subset of the knowledge base gives better performance than the original knowledge base independent of the amount of computational resources that are available. Almost all knowledge bases that contain probabilistic rules are shown to be sociopathic and so this problem is very widespread. Sociopathicity has important consequences for rule induction methods and rule set debugging methods. Sociopathic knowledge bases cannot be properly debugged using the widespread practice of incremental modification and deletion of rules responsible for wrong conclusions a la Teiresias; this approach fails to converge to an optimal solution. The problem of optimally debugging sociopathic knowledge bases is modeled as a bipartite graph minimization problem and shown to be NP-hard. Our heuristic solution approach is called the Sociopathic Reduction Algorithm and experimental results verify its efficacy. Contents 1 Introduction 3 2 Inexact Reasoning and Rule Interactions 4 3 Debugging Rule Sets and Rule Interactions 6 3.1 Types of rule interactions 6 3.2 Traditional methods of debugging a rule set 7 4 Minimizing Sociopathic Interactions 8 4.1 Bipartite graph minimization formulation 9 5 Sociopathic Reduction Algorithm 14 5.1 The Sociopathic Reduction Algorithm 14 5.2 Example of sociopathic reduction 16 5.3 Experience with the Sociopathic Reduction Algorithm 18 6 Related Work 20 7 Summary and Conclusion 21 8 Acknowledgements 21 Appendix 1: Calculating G. 22 References 23 1 Introduction Reasoning under uncertainty has been widely investigated in artificial intelligence. Prob- abilistic approaches are of particular relevance to rule-based expert systems, where one is interested in modeling the heuristic and evidential reasoning of experts. Methods devel- oped to represent and draw inferences under uncertainty include the certainty factors used in Mycin (Buchanan and Shortliffe, 1984), fuzzy set theory (Zadeh, 1979), and the belief functions of Dempster-Shafer theory (Shafer, 1976) (Gordon and Shortliffe, 1985). In many expert system frameworks, such as Emycin, Expert, MRS, S.l, and Kee, the rule structure permits a conclusion to be drawn with varying degrees of certainty or belief. This paper addresses a concern common to all these methods and systems. In refining and debugging a probabilistic rule set, there are three major causes of errors: missing rules, wrong rules, and deleterious interactions between good rules. The purpose of this paper is to explicate a type of deleterious interaction and to show that it (a) is indigenous to rule sets for reasoning under uncertainty, (b) is of a fundamentally different nature from missing and wrong rules, (c) cannot be handled by traditional methods for correcting wrong and missing rules, and (d) can be handled by the method described in this paper. In section 2, we describe the type of deleterious rule interactions that we have en- countered in connection with automatic induction of rule sets, and explain why the use of most rule modification methods fails to grasp the nature of the problem. In section 3, we discuss approaches to debugging and refining rule sets and explain why traditional rule set debugging methods are inadequate for handling global interactions. In section 4, we for- mulate the problem of reducing deleterious interactions as a bipartite graph minimization problem and show that it is NP-hard. In section 5, we present a heuristic method called the Sociopathic Reduction Algorithm. Finally, our experiences in using the Sociopathic Reduction Algorithm are described. A brief description of terminology will be helpful to the reader. Assume there exists a collection of training instances, each represented as a set of feature- value pairs of evidence and a set of hypotheses. Rules are in Horn clause form: conclude(H, CF) :- E , where E is a conjunction of evidence, H is a hypothesis, and CF is a certainty factor or its equivalent. A rule that correctly confirms a hypothesis generates true positive evidence; one that correctly disconfirms a hypothesis generates true negative evidence. A rule that incorrectly confirms a hypothesis generates false positive evidence; one that incorrectly disconfirms a hypothesis generates false negative evidence. False positive and false negative evidence can lead to misdiagnoses of training instances. 2 Inexact Reasoning and Rule Interactions When operating as an evidence-gathering system (Buchanan and Shortliffe, 1984), an ex- pert system accumulates evidence for and against competing hypotheses. Each rule whose preconditions match the gathered data contributes either positively or negatively toward one or more hypotheses. Unavoidably, the preconditions of probabilistic rules succeed on instances where the rule will be contributing false positive or false negative evidence for conclusions. For example, consider the following rule: conclude(klebsiella, 0.77) :- (Rl) finding(surgery, yes), flnding(gram_negJnfection, yes) The frequency with which Rl generates false positive evidence has a major influence on its CF of 0.77, where —1 < CF < 1. Indeed, given a representative set of training instances, such as a library of medical cases, the certainty factor of a rule can be given a probabilistic interpretation 1 as a function G(xi,X2,X3), where X\ is the fraction of the positive instances of a hypothesis where the rule premise succeeds, thus contributing true positive or false negative evidence; x-i is the fraction of the negative instances of a hypothesis where the rule premise succeeds, thus contributing false positive or true negative evidence; See Appendix 1 for a description of the function G. The calculations of G give a purely statistical inter- pretation to CFs, and hence do not incorporate orthogonal utility measures as was done in MYCIN(Buchanan and Shortliffe, 1984). and Z3 is the ratio of positive instances of a hypothesis to all instances in the training set. For Rl in our domain, (7(.43, .10, .22) = 0.77 by the formulas in Appendix A, because statistics on 104 training instances yield the following values: X\ : E true among positive instances = 10/23 X2 : E true among negative instances = 8/81 (1) X3 : H true among all instances = 23/104 Hence, Rl generates false positive evidence on eight instances, some of which may lead to false negative diagnoses. But whether they do or not depends on the other rules in the system; hence our emphasis on taking a global perspective. The usual method of dealing with situations such as this is to make the rule fail less often by specializing its premise (Michalski et al., 1983). For example, surgery could be specialized to neurosurgery, and we could replace Rl with: conclude(klebsiella, 0.92) :- (R2) finding(neurosurgery, yes), finding(gram_negJnfection, yes) On our case library of training instances for the R2 rule, G(.26, .02, .22) = 0.92, so R2 makes erroneous inferences in two instances instead of eight. Nevertheless, modifying Rl to be R2 on the grounds that Rl contributes to a misdiagnosis is not always appropriate; we offer three objections to this frequent practice. First, both rules are inexact rules that offer advice in the face of limited information, and their relative accuracy and correctness is explicitly represented by their respective CFs. We expect them to fail, hence failure should not necessarily lead to their modification. Second, all probabilistic rules reflect a trade-off between generality and specificity. An overly general rule provides too little discriminatory power, and a overly specific rule contributes too infrequently to problem solving. A policy on proper grain size is explicitly or implicitly built into rule induction programs; this policy should be followed as much as possible. Specialization produces a rule that usually violates such a policy. Third, if the underlying problem for an incorrect diagnosis is rule interactions, a more specialized rule, such as the specialization of Rl to R2, can be viewed as creating a potentially more dangerous rule. Although it only makes an incorrect inference in two instead of eight instances, these two instances will be now harder to counteract when they contribute to misdiagnoses because R2 is stronger. Note that a rule with a large CF is more likely to have its erroneous conclusions lead to misdiagnoses. This perspective motivates the prevention of misdiagnoses in ways other than the use of rule specialization or generalization. Besides rule modification, another common method of nullifying the incorrect infer- ence of a rule in an evidence-gathering system is to introduce counteracting rules. In our example, these would be rules with a negative CF that concludes Klebsiella on the false positive training instances that lead to misdiagnoses. But since these new rules are prob- abilistic, they will introduce false negatives on some other training instances, and these may lead to misdiagnoses. We could add yet more counteracting rules with a positive CF to nullify any problems caused by the original counteracting rules, but these additional rules introduce false positives on yet other training instances, and these may lead to other misdiagnoses. Also, a counteracting rule is often of less quality in comparison to rules in the original rule set; if it were otherwise the induction program would have included the counteracting rule in the original rule set. Clearly, adding counteracting rules may not be necessarily the best way of dealing with misdiagnoses made by probabilistic rules. 3 Debugging Rule Sets and Rule Interactions Assume we are given a set of probabilistic rules that were either automatically induced from a set of training cases or created manually by an expert and knowledge engineer. In refining and debugging this probabilistic rule set, there are three major causes of errors: missing rules, wrong rules, and unexpected interactions among good rules. We first describe types of rule interactions, and then show how the traditional approach to debugging is inadequate. 3.1 Types of rule interactions In a rule-based system, there are many types of rule interactions. Rules interact by chaining together, by using the same evidence for different conclusions, and by drawing the same conclusions from different collections of evidence. Thus one of the lessons learned from research on MYCIN was that complete modularity of rules is not possible to achieve when rules are written manually (Buchanan and Shortliffe, 1984). An expert uses other rules in a set of closely interacting rules in order to define a new rule, in particular to set a CF value relative to the CFs of interacting rules. Automatic rule induction systems encounter the same problems. Moreover, automatic systems lack an understanding of the strong semantic relationships among concepts to allow judgments about the relative strengths of evidential support. Instead, induction systems use biases to guide the rule search (Michalski et al., 1983). The rule sets that are later analyzed for sociopathicity in this paper were generated by the induction subsystem of ODYSSEUS. The inductive biases used in this system are rule generality, whereby a rule must cover a certain percentage of instances; rule specificity, whereby a rule must be above a minimum discrimination threshold; rule colinearity, whereby rules must not be too similar in classification of the instances in the training set; and rule simplicity, whereby a maximum bound is placed on the number of conjunctions and disjunctions (Wilkins, 1987). 3.2 Traditional methods of debugging a rule set The standard approach to debugging a rule set consists of iteratively performing the fol- lowing steps: • Step 1. Run the system on cases until a false diagnosis is made. • Step 2. Track down the error and correct it, using one of five methods pioneered by Teiresias (Davis, 1982) and used by knowledge engineers generally: — Method 1: Make the preconditions of the offending rules more specific or some- times more general. 2 — Method 2: Make the conclusions of offending rules more general or sometimes more specific. — Method 3: Delete offending rules. — Method 4: Add new rules that counteract the effects of offending rules. Ways of generalizing and specializing rules are nicely described in (Michalski et al., 1983). They include dropping conditions, changing constants to variables, generalizing by internal disjunction, tree climbing, interval closing, exception introduction, etc. — Method 5: Modify the strengths or CFs of offending rules. This approach may be sufficient for correcting wrong and missing rules. However, it is flawed from a theoretical point of view, with respect to its sufficiency for correcting problems resulting from the global behavior of rules over a set of cases. It possesses two serious methodological problems. First, using all five of these methods is not necessarily appropriate for dealing with global deleterious interactions. In section 2 we explained why in some situations modifying the offending rule or adding counteracting rules leads to problems, and misses the point of having probabilistic rules, and this eliminates methods 1, 2 and 4. If rules are being induced from a representative set of training cases, modifying the strength of the rule is illegal, since the strength of the rule has a probabilistic interpretation, being derived from frequency information derived from the training instances, and this eliminates method 5. Only method 3 is left to cope with deleterious interactions. The second methodological problem is that the traditional method picks an arbitrary case to run in its search for misdiagnoses. Such a procedure will often not converge to a good rule set, even if modifications are restricted to rule deletion. The example in section 5.2 illustrates this situation. Our perspective on this topic evolved in the course of experiments in induction and refinement of knowledge bases. Using "better" induction biases did not always produce rule sets with better performance, and this prompted investigating the possibility of global probabilistic interactions. Our original approach to debugging was similar to the Teiresias approach. Often, correcting a problem led to other cases being misdiagnosed, and in fact this type of automated incremental debugging seldom converged to an acceptable set of rules. It might have if we we engaged in the common practice of "tweaking" the CF strengths of rules. However this was not permissible, since our CF values were derived from a representative set of training cases, and have a precise probabilistic interpretation, 4 Minimizing Sociopathic Interactions Assume there exists a large set of training instances, and a rule set for solving these instances has been induced that is fairly complete and contains rules that are individually judged to be good. By good, we mean that they individually meet some predefined quality standards such 8 as the biases described in section 3.1. Further, assume that the rule set misdiagnoses some of the instances in the training set. Given such an initial rule set, the problem is to find a rule set that meets some optimality criteria, such as to minimize the number of misdiagnoses without violating the goodness constraints on individual rules. Now modifications to rules, except for rule deletion, generally break the predefined goodness constraints. And adding other rules is not desirable, for if they satisfied the goodness constraints they would have been in the original rule set produced by the induction program. Hence, if we are to find a solution that meets the described constraints, the solution must be a subset of the original rule set. 3 More formally: Definition 1 (Sociopathic Knowledge Base) A knowledge base is sociopathic if and only if (l) all the tuples in the knowledge base are individually judged to be good; and (2) a subset of the knowledge base gives better performance than the original knowledge base independent of the amount of available computational resources. By the definition of a sociopathic knowledge base, the best rule set is viewed as the element of the power set of rules in the initial rule set that yields a global minimum weighted error. A straightforward approach is to examine and compare all subsets of the rule set. However, the power set is almost always too large to work with, especiaDy when the initial set has deliberately been generously generated. The selection process can be modeled as a bipartite graph minimization problem as follows. 4.1 Bipartite graph minimization formulation A bipartite graph G — (V, E) is a graph whose vertices V can be partitioned into two sets V\ and V 2 so that every edge in E joins a vertex in V\ to a vertex in Vi. For each hypothesis in the set of training instances, define a directed bipartite graph G = (V, E), with its vertices V partitioned into two sets J and R, as shown in Figure 1. Elements of R represent rules, and the evidential strength of Rj is denoted by CFj. Each vertex in I represents a training instance; for positive instances M; is 1, and for negative instances M; is —1. Arcs [Rj, Ii] connect a rule in R with the training instances in I for which its preconditions are If we discover that this solution is inadequate for our needs, then introducing rules that violate the induction biases is justifiable. satisfied; the weight of arc [it,-,/,-] is CFj. The weighted arcs terminating in a vertex in / are combined using an evidence combination function F, which is denned by the user. The combined evidence classifies an instance as a positive instance if the combined evidence is above a user specified threshold CF t . In the example in section 5.2, CF t is 0, while for Mycin, CF t is 0.2. Instance Set Rule Set Ii (Mi) • Ri (CFi) I 2 (M a ) ♦- R 2 (CF 2 ) Im(M m ) • • R n (CF n ) Figure 1: Bipartite Graph Formulation. The left hand nodes, Jj,. . . , J m represent a case Library of m training instances, where Mi indicates whether an instance is a positive or negative example of a hypothesis. The right hand nodes, R\ % .. .,R n represent a knowledge base of prob- abilistic rules, where CFj is the strength of the rule. The links show which training instances 7i,...,/ m satisfy the preconditions of rule Rj. More formally, assume that Jj f . .., I m — training set of instances, and i? l5 ..., R n rules of an initial rule set. Then we want to minimize: subject to the constraints e; = < z = J^e; t=i if F{ ail r u ...,a in r n ) > CF t for M { = 1 if F(an r l7 ..., a in r n ) (F must be l), then at least one g(aijTj) is 1. By the definition of g{ciijTj) above, either Aj appears in C; and Tj — 1 or Aj appears in C{ and Tj — 0. In either case, according to the output transformation, the corresponding clause C{ is satisfied (true). Only if part: Assume that C{ is satisfied by the truth assignment in the final rule set. Then there must exist some atom Aj such that either Aj is in C{ and it is assigned to be true or Aj is in C{ and assigned to be false. In either case, g{a.ijTj) — 1, by the output transformation and the definition of the function. Therefore, F(a,ir 1 , ...,a, n r n ) = 1 and e< = 0. To summarize, g(aijrj) being 1 corresponds intuitively to the positive contribution made by Aj to C{. Finally, it's shown that SAT is satisfiable iff BGMP so constructed has a minimum objective value 0. If BGMP has a solution with z = 0, then e t - = for all i, because 6, = 1. Therefore each C{ is satisfied and thus SAT is satisfiable. Conversely, if the SAT is satisfiable then each C{ can be satisfied by some truth assignment of atoms. Clearly, the final ride set of the BGMP formulation (of SAT) can be easily constructed with z — 0, according to that assignment. □ Corollary 1 Given a positive real number B , the problem of determining if there exists a rule set whose global weighted error z is less than or equal to B in the bipartite graph formulation for heuristic rule set optimization is NP-complete. Proof: To show that this decision problem is in NP, we notice that it is easy to construct a polynomial algorithm for checking whether or not the (weighted) number of misdiagnosis by any given subset of R is less than or equal to B. It is NP-hard by an argument similar to that in the proof of the above theorem. □ 13 5 Sociopathic Reduction Algorithm In this section, a heuristic method called the Sociopathic Reduction Algorithm is described, and an example is provided based on the graph shown in Table 1. 5.1 The Sociopathic Reduction Algorithm The following heuristic hill-climbing search method, the Sociopathic Reduction Algorithm, is one that we have developed and used in our experiments: • Step 1. Assign values to penalty constants. Let p\ be the penalty assigned to a poison rule. A poison rule is a strong rule giving erroneous evidence for a case that cannot be counteracted by the combined weight of all the rules in the rule base that give correct evidence. Let p? be the penalty for contributing false positive evidence to a misdiagnosed case, p$ be the penalty for contributing false negative evidence to a misdiagnosed case, p\ be the penalty for contributing false positive evidence to a correctly diagnosed case, p$ be the penalty for contributing false negative evidence to a correctly diagnosed case, and p$ be the penalty for using weak rules. Let h be the maximum number of rules that are removed at each iteration. Let i? mtn be the minimum size of the solution rule set. • Step 2. Optional step for very large rule sets: given an initial rule set, create a new rule set containing the n strongest rules for each case. • Step 3. Find all misdiagnosed cases for the rule set. If none exists, stop. Otherwise, collect and rank the rules that contribute evidence toward these erroneous diagnoses. The rank of rule Rj is Ya=i Pi n ij-> where: — riij = 1 if Rj is a poison rule or its deletion leads to the creation of another poison rule and otherwise. — n,2j = the number of misdiagnoses for which Rj gives false positive evidence; — n 3 j — the number of misdiagnoses for which Rj gives false negative evidence; — n 4j - = the number of correct diagnoses for which Rj gives false positive evidence; 14 — n 5j - = the number of correct diagnoses for which Rj gives false negative evidence; — riQj = the absolute value of the CF of Rj\ • Step 4. Eliminate the h highest ranking rules. • Step 5. If the number of misdiagnoses is decreased, go to step 3. • Step 6. Else, if the number of misdiagnoses begins to increase and h ^ 1, then — Undo the last deletion, i.e., take back the most recently removed h rules. 4 — hi- h-l. s — Goto step 3. • Step 7. Otherwise, i.e., if the number of misdiagnoses is increased and h = 1, then undo the last rule deletion; output the final rule set and stop. Each iteration of the algorithm produces a new rule set, and each rule set must be rerun on all training instances to locate the new set of misdiagnosed instances. If this is par- ticularly difficult to do, the h parameter in step 4 can be increased, but there is the potential risk of converging to a suboptimal solution. For each misdiagnosed instance, the automated reasoning system that uses the rule set must be able to explain which rules contributed to a misdiagnosis. Hence, we require a system with good explanation capabilities. The nature of an optimal rule set differs between domains. Penalty constants, pi, are the means by which the user can define an optimal policy. For instance, via p2 ai *d j>3, the user can favor false positive over false negative misdiagnoses, or visa versa. For medical expert systems, a false negative is often more damaging than a false positive, as false positives generated by a medical program can often be caught by a physician upon further testing. False negatives, however, may be sent home, never to be seen again. In our experiments, the value of the six penalty constants was p, = 10 6-1 . The h constant determines how many rules are removed on each iteration, and its value is about 5. Rmin is the minimum size of the solution rule set, usually about 90% of the original set; its usefulness was described in section 4.1. 4 It is this step that makes it a hill-climbing algorithm. Since the h is usually small, say about 5, the next incremental step of 1 is the simplest, although the more complicated schema of step decrements can be implemented for a relatively big number of h. 15 I\R #i(+.33)* R 2 {+.75) i2 3 (+-33) J2 4 (-.33)* R s (-.75) c i2 6 (-.33) Io(+) X ii(+) X X Ji(+) XXX X h(+) X X X X h(+r X X X X h(-r X X X h(-y X X X X H-) X X '•(-) X XXX h(-) X X Table 1: An example for Sociopathic Reduction algorithm. There are ten training instances that are classified as positive ( + ) or negative ( — ) instances of the hypothesis. There are six rules shown with their CF strength. The marks indicate the instances to which the rules apply, i.e., when an instance satisfies the premises clauses of a rule. 5.2 Example of sociopathic reduction In this example, which is illustrated in Table 5.1, there are ten training instances I , . . ., J 9 , classified as positive or negative instances of the hypothesis. There are six rules JRj, . . . , R 6 shown with their CF strength. The marks (x) indicate the instances to which the rules apply, i.e., when an instance satisfies the premises clauses of a rule. To simplify the example, define the combined evidence for an instance as the sum of the evidence contributed by all applicable rules, and let CF t = 0. Rules with a CF of one sign that are connected to an instance of the other sign contribute erroneous evidence. Two cases in the example are misdiagnosed: I4 and 1$. The objective is to find a subset of the rule set that minimizes the number of misdiagnoses. Before the details are examined, the following points concerning examples should be made. First, it can be shown that it is impossible to have an example using rules with out degree less than 5 that has all the points to be made from this example, if there are the equal number of positive and negative training instances. The argument is trivial for the rules with out degree of 1 and 2. For a rule with out degree of 3, assume that it has a positive CF value and is to be deleted. Then, it must misdiagnose some negative instance to become a 16 rule to be blamed. And, in order to have a positive CF, it must provide (positive) evidence for two positive instances, provided that the number of positive instances is equal to that of negative instances. Therefore, the number of correct diagnoses for which it gives false positive evidence must be zero, since the only negative instance that it connects to is the misdiagnosed one. Then, its ranking vector is (nij, ri2j, n3j, n 4j -, n 5j -, n&j) — (0,1, 0, 0, 0, CF) which results in the smallest ranking quantity that a blamed rule with positive CF can have. Thus, the algorithm will not guarantee to chose it for deletion. The argument for rules with out degree of 4 is similar to the above, or the CF values are zeroes if the rules connect to two positive instances and two negative ones. It may be possible to devise a heuristic algorithm which gives a better computational performance from this observation. The second point to make is that the CF values attached to the rules are the real values that are calculated based on the formula given in the appendix. Take J?i( + -33) for example. x\ : E true among positive instances = 3/5 Z2 : E true among negative instances = 2/5 xz : H true among all instances = 5/10 (9) Then, x 4 = x\x 3 X1X3 + x 2 {l -x 3 ) = 0.60 (10) Since 2:4 > 2:3, CF z 4 - x 3 a: 4 (l-Z3) 3 - = 0.33 (11) Now the examination of the example is to be preceded. Assume that the final rule set must have at least four rules, hence i2 mtn = 4. Let p, = 10 6- ', for < i < 5, thus choosing rules in the highest category, and using lower categories to break ties. On the first iteration, two misdiagnosed instances are found, 7 4 and 7s, and four rules contribute erroneous evidence toward these misdiagnoses, i?i, R2, #4, and R5. Their ranking vectors are shown in Table 2. Clearly, R\ has the highest ranking quantity £f =1 p t n,j, thus 17 n X j n 2 j n 3j n 4 j n 5j «6j Rx 1 1 0.33 R2 1 0.75 i?4 1 1 0.33 Rs 1 0.75 Table 2: The ranking vectors of blamed rules it is chosen for deletion. On the second iteration, one misdiagnosis is found, 7 4 , and two erroneous rules contribute erroneous evidence, R 4 and iZ 5 . Rules are ranked and iZ 4 is deleted. This reduces the number of misdiagnoses to zero and the algorithm successfully terminates. The same example can be used to illustrate the problem of the traditional method of rule set debugging, where the order in which cases are checked for misdiagnoses influences which rules are deleted. Consider a Teiresias style program that looks at training instances and discovers I4 is misdiagnosed. There are two rules that contribute erroneous evidence to this misdiagnosis, rules R4 and i? 5 . It wisely notices that deleting R4 causes I 6 to become misdiagnosed, hence increasing the number of misdiagnoses; so it chooses to delete i? 5 . However, no matter which rule it now deletes, there will always be at least one misdiagnosed case. To its credit, it reduced the number of misdiagnoses from two to one; however, it fails to converge to an rule set that minimizes the number of misdiagnoses. 5.3 Experience with the Sociopathic Reduction Algorithm Some preliminary experiment with the Sociopathic Reduction Algorithm has been com- pleted, using the Mycin case library which is a collection of 112 solved cases that were obtained from records at the Stanford Medical Hospital. The rule set of about 370 rules was the one after (1) correcting an incorrect domain theory, and (2) using apprenticeship learning to extend an incomplete domain theory (Wilkins and Tan, 1989). The Sociopathic Reduction Algorithm removed 21 rules from the knowledge base after 8 iterations. In Table 3, it is shown that about 10% improvement over the knowledge base tested is obtained. Although our work is pretty much theoretical research oriented one example of ex- periments is not sufficient by any means. Thus, our ongoing experiments involve two kinds 18 Disease Number Before Reduction After Reduction Cases TP FN FP TP FN FP Bacterial Meningitis 16 14 2 13 12 4 4 Brain Abscess 7 1 6 1 6 Cluster Headache 10 8 2 8 2 Fungal Meningitis 8 3 5 4 4 Migraine 10 6 4 7 3 Myco-TB Meningitis 4 4 1 4 3 Primary Brain Tumor 16 3 13 10 6 1 Subarach Hemorrhage 21 16 5 3 16 5 4 Tension Headache 9 8 1 3 8 1 1 Viral Meningitis 11 10 1 12 10 1 6 None 7 12 Totals 112 73 39 39 80 32 32 Table 3: The Sociopathic Reduction Algorithm, when applied to this knowledge base, improves the performance by about 10%. of tests. First, we divide the cases into a training set and a validation set with 70% vs. 30% each, so that it can be shown that the performance improvement is carried over to the validation set. To be more accurate, we would like to randomly split the cases five times and then average the improvements. Second, we like to apply the method just described to various knowledge bases available, for example, a knowledge base after correction of wrong rules only, a knowledge base after case-based learning application, and so on. 19 6 Related Work The original contribution of this paper is to show that correct knowledge can be harmful independent of problem-solving efficiency and that this problem is widespread. Another contribution is to show that the problem of harmful knowledge can be minimized and problem-solving performance improved by a particular form of knowledge base reduction, and that the optimal reduction is NP-hard. The theme of correct knowledge being harmful has been studied by a number of other investigators. Minton has investigated how the learning of correct search control knowledge can slow down a problem solver; his solution approach is to quantify the potential utility of a new piece of control knowledge and only add those with a high utility (Minton and Carbonell, 1987). Markovitch and Scott have shown that any deductively learned knowledge effects the cost of searching a problem space; their solution approach is to use filter functions that can determine whether a piece of past knowledge that has been deductively learned should be used on a current problem (Markovitch and Scott, 1989). Still another approach is to modify learned search control knowledge to increase problem- solving speed (Prieditis and Mostov, 1987). The theme of improving problem- solving accuracy via knowledge base reduction has been studied in conjunction with eliminating or reducing wrong knowledge. For example, the genetic algorithm used in conjunction with a classifier system eliminates as much as half of a knowledge base; it ehminates rules that has not contributed to past problem-solving successes (Holland, 1986). Another approach is to perform a global analysis of a knowledge base and eliminate those rules that are redundant or inconsistent (Ginsberg et al., 1988). Learning systems that perform induction from noisy training instances have also addressed the problem of wrong knowledge. The RULEMOD program of META-DENDRAL selects a subset of rules that have wide applicability, thereby reducing the number of wrong rules (Buchanan and Mitchell, 1978). RULEMOD also selects rules that jointly form a good global cover and hence shares our concern for finding rules that work well together. The TRUNC program of AQ15 deletes those disjunctions of non-probabilistic induced rules that cover the fewest cases (Michalski et al., 1986a; Michalski et al., 1986b). The reduced knowledge bases produced by RULEMOD and TRUNC give equal or superior performance. 20 7 Summary and Conclusion Traditional methods of debugging a probabilistic rule set are suited to handling missing or wrong rules, but not to handling deleterious interactions between good rules. This paper describes the underlying reason for this phenomenon. We formulated the problem of minimizing deleterious rule interactions as a bipartite graph minimization problem and proved that it is NP-hard. A heuristic method was described for solving the graph problem, called the Sociopathic Reduction Algorithm. In our experiments, the Sociopathic Reduction Algorithm gave good results. We believe that the rule set refinement method described in this paper, or its equiv- alent, is an important component of any learning system for automatic creation of proba- bilistic rule sets for automated reasoning systems. All such learning systems will confront the problem of deleterious interactions among good rules, and the problem wiU require a global solution method, such as we have described here. Our future research in this area is to create a theory of sociopathicity that subsumes all AI techniques for uncertainty reasoning, including certainty factors, Bayesian methods, probability methods, Dempster- Shafer theory, fuzzy reasoning, belief networks, and non- monotonic reasoning. For our progress to date, see (Ma and Wilkins, 1990a; Ma and Willdns, 1990b; Ma and Wilkins, 1990c). 8 Acknowledgements We thank Marianne Winslett for suggesting the bipartite graph formulation and for detailed comments, and thank Bruce Buchanan for earlier major collaboration on this work (Wilkins and Buchanan, 1986). We also express our gratitude for the helpful discussions and critiques provided by Bill Clancey, Ramsey Haddad, David Heckerman, Eric Horovitz, Curt Langlotz, Peter Rathmann and Devika Subramanian. This work was supported in part by NSF grant MCS-83-12148, ONR grant N00014- 88K-0124, and an Arnold O. Beckman research award to the first author. We are grateful for the computer time provided by the Intelligent Systems Lab of Xerox PARC and SUMEX- AIM at Stanford University. 21 Appendix 1: Calculating G. Consider rules of the form conclude(H, CF) :- E. Then CF = G — G{x\,x-i,xz) — empirical predictive power of rule R, where: • x\ — P(E + \H + ) — fraction of the positive instances in which R correctly succeeds (true positives or false negatives) • x 2 — P(E + \H~) — fraction of the negative instances in which R incorrectly succeeds (false positives or true negatives) • xz = P(H + ) = fraction of all instances that are positive instances Given xi,Z2> ^3> let ■ «. = f(g+i*+) = .,.,r.,7.-,,) - If x t > x, then G = ^f^ else G = jf^. This probabilistic interpretation reflects to the modifications to the certainly factor model proposed by (Heckerman, 1986). 22 References Buchanan, B. G. and Mitchell, T. M. (1978). Model-directed learning of production rules. In Waterman, D. A. and Hayes-Roth, F., editors, Pattern- Directed Inference Systems, pages 297-312. New York: Academic Press. Buchanan, B. G. and Shortliffe, E. H. (1984). Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Reading, Mass.: Addison- Wesley. Davis, R. (1982). Application of meta level knowledge in the construction, maintenance and use of large knowledge bases. In Davis, R. and Lenat, D. B., editors, Knowledge-Based Systems in Artificial Intelligence, pages 229-490. New York: McGraw-Hill. Ginsberg, A., Weiss, S. M., and Politakis, P. (1988). Automatic knowledge base refinement for classification systems. Artificial Intelligence, 35(2):197-226. Gordon, J. and Shortliffe, E. H. (1985). A method for managing evidential reasoning in a hierarchical hypothesis space. Artificial Intelligence, 26(3):323-358. Heckerman, D. (1986). Probabilistic interpretations for Mycin's certainty factors. In Kanal, L. and Lemmer, J., editors, Uncertainty in Artificial Intelligence, pages 167-196. New York: North Holland. Holland, J. H. (1986). Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In Michalski, R. S., Carbonell, J. G., and Mitchell, T. M., editors, Machine Learning, Volume II, volume 2, chapter 20, pages 593-624. Los Altos: Morgan Kaufmann. Ma, Y. and Wilkins, D. C. (1990a). An analysis of Bayesian evidential reasoning. Working Paper KBS-90-001, Department of Computer Science, University of Illinois. Ma, Y. and Wilkins, D. C. (1990b). Computation of rule probability assignments for Dempster-Shafer theory and the sociopathicity of the theory. Working Paper KBS- 90-002, Department of Computer Science, University of Illinois. Ma, Y. and Wilkins, D. C. (1990c). Sociopathicity properties of evidential reasoning sys- tems. Working Paper KBS-90-016, Department of Computer Science, University of Illinois. Markovitch, S. and Scott, P. D. (1989). Utilization filtering: a method for reducing the inherent harmfulness of deductively learned knowledge. In Proceedings of the 1989 IJCAI, pages 738-743, Detroit, MI. Michalski, R. S., Carbonell, J. G., and Mitchell, T. M., editors (1983). Machine Learning: An Artificial Intelligence Approach. Palo Alto: Tioga Press. 23 Michalski, R. S., Mozetic, L, and Hong, I. (1986a). The AQ15 inductive learning system: An overview and experiments. Technical Report ISG 86-20, UIUCDCS-R-86-1260, Department of Computer Science, University of Illinois. Michalski, R. S., Mozetic, I., Hong, J., and Lavrac, N. (1986b). The multi-purpose incre- mental learning system AQ15 and its testing application to three medical domains. In Proceedings of the 1986 National Conference on Artificial Intelligence, pages 1041- 1045, Philadelphia, PA. Minton, S. and Carbonell, J. G. (1987). Strategies for learning search control rules: An explanation-based approach. In McDermott, J., editor, Proceedings of the 1987 IJCAI, pages 228-235, Milan. Prieditis, A. E. and Mostov, J. (1987). PROLEARN: towards a prolog interpreter that learns. In Proceedings of the 1987 National Conference on Artificial Intelligence, pages 494-498. Shafer, G. A. (1976). Mathematical Theory of Evidence. Princeton: Princeton University Press. Wilkins, D. C. (1987). Apprenticeship Learning Techniques For Knowledge Based Systems. PhD thesis, University of Michigan. Also, Knowledge Systems Lab Report KSL-88-14, Dept. of Computer Science, Stanford University, 1988, 153pp. Wilkins, D. C. and Buchanan, B. G. (1986). On debugging rule sets when reasoning under uncertainty. In Proceedings of the 1986 National Conference on Artificial Intelligence, pages 448-454, Philadelphia, PA. Wilkins, D. C. and Tan, K. (1989). 