PII: S0950-7051(99)00005-2 Applications of rule-base coverage measures to expert system evaluation V. Barrp Department of Computer Science, Hofstra University, Hempstead, NY 11550, USA Received 24 September 1998; received in revised form 5 January 1999; accepted 5 January 1999 Abstract Often a rule-based system is tested by checking its performance on a number of test cases with known solutions, modifying the system until it gives the correct results for all or a sufficiently high proportion of the test cases. This method cannot guarantee that the rule-base has been adequately or completely covered during the testing process. We introduce an approach to testing of rule-based systems, which uses coverage measures to guide and evaluate the testing process. In addition, the coverage measures can be used to assist rule-base pruning and identification of class dependencies, and serve as the foundation for a set of test data selection heuristics. We also introduce a complexity metric for rule-bases. q 1999 Elsevier Science B.V. All rights reserved. Keywords: Rule-base system; Casual-associational network; Logical path graph 1. Introduction Evaluation of a knowledge-based system is a multi- faceted problem, with numerous approaches and techniques. The results generated by the system must be evaluated, along with its features, the usability of the system, how easily it can be enhanced, and whether or not it has a posi- tive impact on the people who are using the system in place of an approach which is not computer based. The system’s performance must also be evaluated in light of its intended use [1]. If the expert system is meant to function as an intelligent assistant then it must satisfy the criterion of being a useful adjunct to the human problem solver. If the system is expected to emulate the reasoning of a human expert then a more rigorous evaluation of the system is needed. During the last 20 years there has been considerable development and use of knowledge-based systems for medi- cal decision support. In this period there has been heavy emphasis on functional analysis, addressing two primary questions: • Does the system give the results we expect on test cases? • Does the system improve the effectiveness of those who use it? The emphasis on functional analysis can lead to seemingly strong statistical statements about the correctness of a system, demonstrating that it gives the correct result, or the same result as a human expert, in a high percentage of test cases. However, functional testing does not guarantee that all parts of the system are actually tested. If a section of the rule-base is not exercised during the functional test then there is no information about that section of the system and whether it is correct or contains errors. Further, many performance problems for rule-bases result from unforeseen rule interactions [2]. A test suite of known cases may never trigger these interactions, though they should be identified and corrected before a system is put into actual use. The method we present enhances functional analysis of rule-based classification systems with a rule-base coverage assessment, overcoming limitations of common methods for rule-based expert systems evaluation. The underlying premise of this work is that an ideal testing method is one that guarantees that all possible reasoning paths through a rule-base have been exercised. As with procedural software, this is often an unreasonable and/or unattainable goal, possi- bly due to a lack of test data, to un-executable program paths, or to the size of the rule-base. Further, even if each possible path is exercised, we cannot realistically do so with each distinct set of test values that could cause its traversal. A reasonable goal is for the rule-base testing process to exercise every inference chain or provide information about the failure of the testing process to do so. Usually verification and validation (V & V) of rule-based systems involves a static structural analysis (verification) method to detect internal inconsistencies, followed by a dynamic, functional, validation in which system behavior on a set of test cases is compared with expected results. The Knowledge-Based Systems 12 (1999) 27–35 0950-7051/99/$ - see front matter q 1999 Elsevier Science B.V. All rights reserved. PII: S 0 9 5 0 - 7 0 5 1 ( 9 9 ) 0 0 0 0 5 - 2 pE-mail address: vbarr@magic.hofstr.edu (V. Barr) weakness of a strictly functional approach to validation is that the test data available may not adequately cover the rule-base, and, at best, limited information about coverage will be obtained. System performance statistics are usually presented as if they apply to the entire rule-base, rather than just to the tested sections. This can lead to false estimates of system performance in actual use. The system performance indicated by the comparison of actual and expected results is relevant only for the tested sections, while performance in the untested sections cannot be predicted. We must also consider completeness of the test set and age of the rule-base by the test data. Completeness of the test set refers to the degree to which the data represents all types of cases, which could be presented to the system under intended conditions of use. Coverage of the rule-base refers to how extensively possible combinations of inference rela- tions are exercised during test data evaluation. In the trivial case, with a correct rule-base and a complete test suite, the test data would completely cover the rule-base, all actual results would agree with expected results, and we could predict completely correct performance of the rule-base in actual use. In the more usual situation we may have errors and incompleteness in the rule-base, as well as inadequacies in the test data. If we only judge the system based on a comparison of actual and expected results, the rule-base could perform well on the test data, but actually contain errors which are not identified due to incompleteness of the test data. This could lead to a false prediction of correct performance on all cases, when in fact we cannot make any accurate prediction about performance of the rule-base in those areas for which there is an absence of test data. Our testing approach, as outlined in Fig. 1, allows clear identification of incompleteness in the test data and poten- tial errors in the rule-base through identification of sections of the rule-base that have not been exercised during func- tional test. This can indicate weaknesses in the test set and/ or sections of the rule-base that may not be necessary. An incomplete test set can be supplemented with additional cases chosen from the available population, guided by a series of heuristics and the coverage analysis information. Alternatively, if there is no test data which covers certain parts of the system, it is possible that those sections should be pruned from the rule-base or modified. Our approach carries out structural analysis of the rule- base using five rule-base coverage measures (RBCMs) which identify sections not exercised by the test data. This makes it possible to improve completeness of the test suite, thereby increasing the kinds of cases on which the rule-base will be tested and improving coverage of the rule-base. In addition to the coverage analysis, we employ a rule- base representation which facilitates application of the coverage measures; a set of heuristics for re-sampling the population of available test cases, based on coverage infor- mation, as shown in Fig. 2; strategies for rule-base pruning and identification of class-dependencies; a rule-base complexity metric. In another study [3] the utility of the aforementioned is illustrated extensively using rule-bases which were prototypes for the AI/RHEUM system [4] and the TRUBAC (Testing with RUle-BAse Coverage), a tool which implements the coverage analysis method. 2. Related work This work builds on both coverage-based testing methods for procedural software (see [5] for a review of methods and [6,7] for a data-flow approach to testing) and earlier work on rule-base analysis. Early approaches for rule-base analysis carried out only verification or validation. A number of systems, such as the ONCOCIN rule checker program (RCP) [8], CHECK [9,10], ESC (expert system checker) [11], and KB-Reducer [12,13] carry out only verification. Beyond their limitation to verification, these systems have additional weaknesses. RCP is limited to identification of static problems at the rule level, and cannot identify problems that result along longer reasoning chains. The CHECK system has better complexity than the RCP, but can be used only for systems developed using LES, the Lockheed expert systems development environment. ESC is very efficient if there are no conflicts or redundancies in V. Barr / Knowledge-Based Systems 12 (1999) 27–3528 Fig. 1. Rule-base evaluation with coverage analysis. Fig. 2. Evaluation with coverage analysis and data re-sampling. the rule-base, but can require exponential time if such problems exist. KB-Reducer is a verification tool, which operates on an implied network of rules. It has the advantage of checking a rule-base for inconsistency and redundancy over inference chains, and not just pairs of rules. A number of dynamic analysis tools have also been developed, such as TEIRESIAS [14], and SEEK2 [15]. TEIRESIAS aids in debugging and knowledge acquisition by allowing the alternation, deletion or addition of rules in order to fix errors in the rule-base that led to incorrect conclusions. However, this process requires that the system tester is sufficiently expert in the problem domain to identify errors in the reasoning the system used to reach a conclu- sion. SEEK2 is an automated rule-base refinement tool which tries out various rule refinements based on the system’s performance on known cases. However, the qual- ity of the refinements produced will be determined by the breadth of the test cases used. SEEK2 does not judge how well the test cases cover the range and domain of the system being evaluated. 2.1. Causal-associational network Graph-based methods for rule-base analysis involve the creation of a graph representation of the rules. An early example of such a representation (though not specifically used for V & V) is the causal-associational network (CASNET) for glaucoma diagnosis [16]. While our approach is similar to the approach used in CASNET, there are a number of differences, which stem from the ways the two methods use the graph structure. In CASNET the network serves as a direct representation of the knowl- edge, with interior nodes representing intermediate stages of disease progression. In our approach the graph is a direct representation of the knowledge base and intermediate nodes represent intermediate hypotheses in a logical sense, but may have no particular meaning relative to the problem domain unless the system designer built that into the rules. 2.2. KB-Reducer As mentioned earlier, KB-Reducer [12,13] uses the implied network of rules to do the rule-base analysis. The process of knowledge base reduction involves calculation of all possible logically independent and minimal sets of inputs under which the knowledge base will conclude each asser- tion (each class). In order for the reduction process to work, the rules of the knowledge base must form an acyclic network under the depends-on relation, defined in Ref. [12]. If the network is acyclic, then KB-Reducer proceeds to label each hypothesis H with the set of environments that lead to the assertion of H, where an environment is itself a set of findings. KB-Reducer carries out the labeling process on the rules in an order such that no rule is processed before any rules on which it depends. As each rule is processed, KB-Reducer updates the partial label for the hypotheses the rule asserts, and checks for redundancy and contradiction. 2.3. Completeness Verifier COVER (COmpleteness VERifier) [17–19] is another approach which combines both a functional and structural analysis of the rule-base. COVER carries out seven verifica- tion checks: redundancy, conflict, subsumption, unsatisfi- able conditions, dead-end rules, circularity and missing rules. The rules must either be written in or converted to a language based on first-order logic, and COVER must be given the set of final hypotheses (classes), as well as infor- mation about any semantic constraints. A primary difference between COVER and the work described here is in the granularity of the graph constructed. In COVER the nodes of the graph represent rules, and the edges represent dependencies or relations between rules, while in our representation each rule is itself represented by a small sub-graph, which allows us to carry out a more detailed analysis of which inference relations have and have not been exercised by the test data. 2.4. Pr/T nets The graph representation we are using is closer in some respects to the Pr/T net representation of a rule-base than it is to any of the other graph-based methods. In the majority of the graph-based methods a rule from the rule-base is equivalent to a node in the graph, while in our graph nodes correspond to individual findings or hypotheses, not to entire rules. Similarly, in the Pr/T net representation [20] the level of details is the findings and hypotheses, not the rules. However, only static analysis of the rule base is carried out using the Pr/T net representation. 2.5. Path Hunter and Path Tracer The VV & T approach based on the execution path model, incorporated in Path Hunter and Path Tracer [21,22], shares the fundamental premise upon which our work is based: functional validation may show that the system performs well on the test cases, but there may still be problems in portions of the rule base that were never exercised during testing. The goal of Path Hunter/Path Tracer is the selection of a set of test cases that exercise the structural components of the rule-base as exhaustively as possible. This involves firing all rules, and also firing every “causal sequence” of rules. The model used to identify all possible dynamic causal rule firing sequences is the rule execution path (equivalent to a sub-DAG in our representa- tion). Path Hunter uses structural path analysis to detect poten- tial interactions between rules in a rule-based and to identify problems within the rule-base, essentially using a path enumeration step. The complexity is controlled by precom- puting the logical completion for each subproblem and by V. Barr / Knowledge-Based Systems 12 (1999) 27–35 29 the use of equivalence classes of rules, formed by collecting redundant rules together into one class, which reduces the number of paths that Path Hunter must generate. (In our approach the step of explicit identification of redundant and ambiguous rules is unnecessary, as they will be identi- fied through the graph construction process. Therefore, while there may be redundant rules in the rule-base, there will be no duplication of those rules within our graph repre- sentation.) Path Tracer is a tool for structural rule-base testing, using the paths generated by Path Hunter, in conjunction with traces of dynamic rule firings, to determine how extensively the possible execution paths are covered by the test data. While there are a number of similarities in the premises which underlay our approach and Path Hunter/Path Tracer, there are also significant differences between the two approaches. First, the graph we construct models the rule- base at the level of findings and hypotheses, rather than at the rule level as is carried out in the Path Hunter representa- tion. While this may make the graph somewhat larger, it allows us to carry out both V & V with one representation. This is in contrast to COVER, which requires two represen- tations, the first-order logic translation of the rules and the dependency graph, in order to carry out verification alone. Second, the methods by which rule-base coverage are determined or measured are quite different. Path Tracer does its assessment of path coverage based on the number of causal dependencies observed in the trace file after the test data is run, using a number of strategies to map concrete paths observed at run time to the abstract paths generated by Path Hunter. In our approach, as the effect of concrete firings is indicated directly in the graph representation of the rule-base, we can determine rule-base coverage directly from the graph after the test data is run. We determine the extent of rule-base coverage by considering whether the state of the graph after the test data is run satisfies the four rule-base coverage measures. 2.6. Logical path graph model The logical path graph (LPG) model [23,24] is based on program control flow analysis. It attempts to apply cyclo- matic complexity and basis path testing to the rule-base environment to find a graphical representation of rule- bases which could then be used to determine rule-base complexity and determine a set of paths through the rule- base that, when executed, would adequately test the rules and their interactions. The logical path graph is a directed graph in which the nodes represent individual rules and the edges are deter- mined by logical paths through the rule base. The goal of Kiper’s work [23,24] is to use the LPG to determine a set of paths through the rule-base such that traversal of those paths during system testing represents an adequate test of the rules and their interactions. However, there are certain problems that arise in the use of logical paths. If there are multiple edges entering a node in the graph, they can be interpreted either as an AND or an OR relation. In order to avoid the possibility of OR edges the node must be replicated, in effect creating in the LPG the kind of redundancy which we usually try to remove from rule-bases. This results in the possibility of a single rule being represented by multiple nodes. If both AND and OR edges are to be allowed then the person evaluating the LPG must know what type each edge is. In addition, because of the possibility of multiple nodes representing a single rule, each node has to be labeled not just with the rule number but also with the condition set, the set of all conditions asserted by nodes on the path leading to the node. There are significant differences between the LPG and our approach. The graph structure we propose is based on findings and hypotheses and directly models the logical relations within rule antecedents, whereas the structure used for the LPG is built at the rule level. This difference in the graph construction allows us to avoid putting identi- fying information on edges or replicate rule representations, as is the case in the LPG. 3. Testing with rule-base coverage measures The first step in rule-base testing with coverage measures is to build a graph representation of the rule-base. Our method uses a directed acyclic graph (DAG) representation. We assume a generic propositional rule-base language [3] into which other rule-base languages can be translated. During construction of the DAG, pairwise redundant rules, pairwise simple contradictory rules and potential contradictions (ambiguities) are identified. After DAG construction is complete, static analysis (verification) of the rule-base reports dangling conditions (an antecedent component that is not defined as a finding and is not found as the consequent of another rule), useless conclu- sions, and cycles in the rule-base. At this point the rule- based could be modified to eliminate or correct any static problems. The static analysis phase is followed by dynamic analysis of the rule-base using test cases. As test cases are processed, one or more of several rule-base coverage measures (RBCMs) can be reviewed in order to determine the quality of the test data supplied thus far. Additional information about the rule-base and its testing can also be used by the system tester to guide the selection of future test data. The tester would start by providing sufficient test data to satisfy the simplest functional measure (conclude each class of the system) and proceed to the more difficult structural measures. Finally, if the user is not able to provide sufficient data to attain the desired degree of rule-base coverage (according to the selected criterion), the user can use the DAG representation to synthesize data, which can then be reviewed by an expert to determine if the data represents a valid case in the problem domain. V. Barr / Knowledge-Based Systems 12 (1999) 27–3530 This testing approach, described more fully later, has been implemented in the TRUBAC tool (Testing with RUle-BAse Coverage) [3,25]. 3.1. Rule-base representation In order to evaluate the rule-base coverage, the rule-base must be in a form which allows identification of sections which have and have not been covered by the test data. Our representation is based on the AND/OR graph implicit in the rule base [26]. The DAG has a source node, corresponding to working memory, and a sink node, corresponding to success in reaching one of the classes (diagnoses or goals) of the system. Interior nodes are sub-class nodes (SUBs), representing intermediate hypotheses, and operator nodes, representing the allowable operators AND, OR, and NOFM1 nodes. These operator nodes represent the fact that the conjunction and/or disjunction of multiple components of an antecedent must be true in order for the conclusion of a rule to be entered into working memory. There are edges from the source to each finding and from each class to the sink. The antecedent of each rule is represented by a subgraph, which connects findings and sub-class nodes to operators as indicated by the antecedent. Each antecedent- consequent connection represented by a rule is also repre- sented by an edge from the subgraph for the antecedent to the node for the consequent: For example, the representation of the two rules If P1 and P2 then R1 If R1 and R2 then Q is shown in Fig. 3. P1, P2 and R2 could be findings or sub- class nodes, while R1 and Q are either class or sub-class nodes. The complete graph of a rule-base is constructed by linking together the individual structures for successive rules. Using this framework we can easily represent rule- bases, including those in which the certainty factors are hard coded within the consequent definitions (rather than computed during the inference process) such as the rules in Fig. 4. 3.2. Rule-based coverage measures Using the DAG structure described earlier, we define execution paths in a rule-based system. Each reasoning chain through the rule-base corresponds to a sub-DAG, which includes all nodes and edges in the DAG correspond- ing to the rules fired during a particular chain of inferences. This include: source node; sink node; all nodes correspond- ing to the findings involved; the node for the concluded class; all nodes corresponding to antecedent components, operators, and rule consequents; and all edges involved in antecedent –consequent links formed by the rule connec- tions used in the reasoning chain. An individual rule firing involves all edges and nodes in the graph that corresponds to that rule. An execution path is, therefore, the sub-DAG that corresponds to all the rules fired along a particular reasoning chain executed due to a specific set of findings. Ideally, in testing a rule-base, we would like to provide sufficient test data to cause every possible execution path of the DAG to be traversed. This corresponds to firing all rules in every combination possible. As this is usually not reasonable in the testing environment, we propose the following hierarchy of rule-base coverage measures (RBCMs) in order to guide the selection of test data and give an objective measure of how well a test suite has covered the rule-base. Each-class :Satisfied if the test data causes traversal of one execution path to each class of the system. This is equivalent to providing, for each class, one test case that concludes that class. This is a very minimal coverage measure, which should be satisfied by all expert systems developers, regardless of their overall testing strategy. Each-hypoth: Satisfied if the test data causes traversal of execution paths such that each sub-class is reached, as well as each class. This shows that each sub-class is actu- ally reachable from the source and appears to be a rele- vant part of the system. The set of test data which satisfies this coverage measure is a superset of that which satisfies Each-class. Each-class-every-sub: There may be many execution paths that connect each sub-class to each class (and no execution paths for some sub-class to class combina- tions). This coverage measure is satisfied if, for each sub-class to class combination connected by some execu- tion path, at least one execution path, which includes the combination is executed. This coverage measure is stron- ger than Each-hypoth. Each-class-every-finding: There may be many execution paths that connect each finding to each class (and no execution paths for some finding-class combinations). This coverage measure is satisfied if, for each finding- class combination connected by some execution path, at least one execution path which includes the combination V. Barr / Knowledge-Based Systems 12 (1999) 27–35 31 Fig. 3. DAG representation of two related rules. Fig. 4. Rules with hard coded certainty factors. 1 NOFM nodes represent the construction “if N of the following M things are true, then…”, which is a feature of EXPERT [27] rule-bases. is executed. This coverage measure is stronger than Each- class but is incomparable to Each-hypoth as Each-hypoth can be satisfied without complete traversal of any execu- tion paths for some finding-class combinations. All-edges: This RBCM is satisfied if the data causes traversal of a set of execution paths such that every infer- ence relationship (every edge in the graph) is utilized along some inference path. While this will not guarantee that all rules will be used in every combination possible, it will guarantee that every rule is used in all possible ways along some execution path. In a rule-base with no NOFM nodes, the data that satisfies this RBCM will be a superset of the data necessary to satisfy Each-class-every-finding and Each-class-every-sub. In typical usage the tester will run a number of test cases and then query TRUBAC to determine whether each RBCM has been satisfied. If a coverage measure is not satisfied then TRUBAC will show what relationships remain to be covered in order to satisfy the RBCM. We assume the existence of an oracle that determines if the result given by the rule-base for a test case is correct 2 or not. In this work we do not consider the issue of specifica- tions and how we determine if an answer is correct or not. If an answer is wrong, then the execution path that led to its conclusion is suspect and must be studied for errors. If the answer is right then the path is only of interest to the extent that it helps satisfy the coverage measure(s) chosen by the tester. In a very complex rule-base, or one for which there is very little test data, the RBCMs help the tester determine ways in which the data is deficient in testing portions of the rule-base. The coverage measures provide the user with information that can lead to the acquisition or development of additional test cases, or will indicate errors in the rule- base. This approach can also address a difficulty often faced by those who test expert systems, which is a paucity of test data. An additional feature of this approach to rule-base testing can help the user gain insight into the correctness of the system even when little test data is available. In addition to evaluating the five coverage measures, the DAG framework can be used to generate test data, which would lead to traversal of any execution paths not exercised by the test data. This synthesized data can then be shown to one or more experts in order to determine if each synthe- sized test case, which is a reflection of the logic embodied within a section of rule-base (corresponding to an execution path), makes sense in the context of the problem domain which the rule-base was designed to handle. If the expert does not agree that the data represents a plausible case in the problem domain then the section of the rule-base repre- sented by that section of the graph must be reviewed for errors. 4. Applications of coverage analysis In addition to providing information about the testing process itself, the coverage analysis can be used to enhance testing and facilitate other kinds of rule-base analysis, as described below. 4.1. Heuristics for test data selection There may be a large redundant pool of available test data, from which a subset of cases must be selected for the test suite. Running all available cases can be infeasible due to the length of time it may require. A random selection of test cases may give statistical confirmation that the system works properly for the tested situations, but a poor selection of cases will lead to an incomplete test set which then leads to incomplete coverage of the rule-base. If a test set which we know is incomplete leads to complete cover- age of the rule-base then the rule-base is incomplete and is not capable of handling precisely those types of cases that are absent from the test suite. If an incomplete test set leads to incomplete coverage, we can use the coverage informa- tion as the foundation for a set of heuristics for test data selection from the available population, in order to construct a test set that will maximize rule-base coverage. To date we have restricted our work to classification systems in which each goal of the system is a class and we consider each intermediate hypothesis to be a sub- class. Assuming that we have a degree of meta-knowledge about the make-up of individual test cases (e.g. what facts or intermediate hypotheses are involved in each case), a set of heuristics for data selection is: 1. For each class, select a test case which concludes only that class. 2. For each class not yet tested, select a test case which will conclude it (and additional classes). At this point Each- class will be satisfied. 3. Select test cases which will conclude unused sub-classes. This satisfies Each-hypoth. 4. Select test cases which will cover sub-class to class rela- tions, and direct finding to class relations for findings which do not lead to an intermediate sub-class. This satisfies Each-class-every-sub. 5. Select test cases which will cover finding to class rela- tions for findings which represent alternative ways to conclude sub-classes. That is, while a sub-class to class relation may have been covered, there may be multiple ways to conclude the sub-class. This satisfies Each-class- every-finding. In [3,29] we show that the use of coverage information, V. Barr / Knowledge-Based Systems 12 (1999) 27–3532 2 For a given test case the experts may not agree on one answer, in which case the system will be considered correct if its answer is in the set of answers agreed on by the experts as possible for that case. There are a number of issues that are raised by the desire to have a “gold standard” [28] against which to measure the expert system’s answers. For example, if the system’s answer agrees with that of the expert, should we consider the system to be correct, even if we subsequently learn that both were wrong. along with meta-knowledge about the pool of available cases, can successfully be used to select the test cases in a way that contributes to a useful test of the rule-base. This results in greater assurance that the system has been tested and works correctly for the situations that we expect to encounter most often, as well as clearly identifying those parts of the system that still require more examination, test- ing, or refinement. 4.2. Class dependence Another aspect of rule-base analysis or evaluation is the identification of class dependencies, in which the rules that lead to the conclusion of one class are also highly involved in the conclusion of another class. If two classes, C1 and C2, are dependent, and we have a large number of test cases that will be classified as C1, it may be that some of those cases will also be classified as C2, although with differing certainty factors. Because of this, testing for one class may help achieve coverage for the other. Further, if C1 and C2 are dependent classes and we change rules for C1 then we should rerun test cases, which conclude C1, and rerun test cases, which conclude C2. This will verify that changing rules for C1 did not inadvertently affect the system’s ability to properly classify C2 as well. To determine if two classes have rules in common we look at sharing or overlap among the sets of sub-classes that can lead to a class. A high degree of overlap among sub-classes implies overlap among the rules and a degree of dependency between those classes. This information can be obtained immediately from data collected while the DAG representation is built. For each class a list of all the sub-classes which can help to conclude the class is formed, and then these lists are compared for pairs of classes. Table 1 shows the overlap figures for a small prototype of the AI/RHEUM system for rheumatology diagnosis [4]. These figures indicate that 57% of the sub- classes, which lead to RA (rheumatoid arthritis) lead to MCTD (mixed connective tissue disease), while 36% of the sub-classes, which can lead to MCTD also lead to RA. There is asymmetry in these figure’s results because the absolute number of sub-classes which can lead to the classes is different. These figures indicate that modifications made to the rules for RA would possibly also affect the perfor- mance of the system on cases that should be classified as MCTD, while there would be a lesser affect on the system’s classification of cases as RA if the rules for MCTD were modified. These results are consistent with those found in [30], which uses Monte-Carlo simulation-based techniques to carry out rule-base evaluation. However, in our approach we obtain this information as a by-product of DAG construction, without the overhead of rule-base execution. 4.3. Rule-base pruning Once a rule-base has been constructed, it is possible that not all the rules are necessary for the rule-base to perform correctly. For example, during incremental development some early rules may be supplanted by rules added later. If the system can be pruned by removing rules or compo- nents within rules, and the performance on test cases is not affected, then it is possible that there were unnecessary rules or that the test cases are not adequate to evaluate the entire rule-base [30]. Further, we expect that a smaller rule-base will run more efficiently. Coverage information can focus the pruning steps on sections of the rule-base which have not been executed by the test data. If a section of the rule-base is never executed during test suite execution then either there are unnecessary rules, which can be pruned, or the test suite is not sufficiently rich. In the latter case, additional test cases are necessary to cover the un-executed section of the rule- base. In general the developer and/or the expert will decide whether the proper approach is to prune or to add test cases. If the test suite is truly representative of cases that will be found in the application environment, and the rule-base performs correctly on the test set, then it is likely that the uncovered portion of the rule-base is in fact unnecessary and can be pruned. If the test set is complete and the rule-base performs incorrectly then the uncovered sections of the rule- base are candidates for rule refinement. The coverage measures provide information about why the rules were never used, based on findings and sub-classes which appear in rule antecedents but are not present in any test cases. Removal of these antecedent components from the rules may generalize them sufficiently that they will correctly handle some of the test cases and will be covered by an existing portion of the test suite. In experiments run on the RHEUM rule-base, the first phase of pruning was based on TRUBAC’s static analysis results. We were able to prune 5 rules out of 76, as well as eliminate 85 findings and 3 classes. This significantly reduced the overall size of the graph over which coverage was evaluated from 333 nodes to 241 nodes [3]. Two addi- tional pruning iterations, based on the coverage data and the assumption that the test set was complete, eliminated 20 rules, 6 findings, and 15 components of rule antecedents. This resulted in an overall 34% reduction in the number of rules and a 40% reduction in the graph size. V. Barr / Knowledge-Based Systems 12 (1999) 27–35 33 Table 1 Overlap of sub-classes for class pairs PM PSS SLE MCTD RA PM 0 0 22.22 22.22 PSS 0 16.67 16.67 16.67 SLE 0 9.09 0 0 MCTD 18.18 9.09 0 36.36 RA 28.57 14.29 0 57.14 4.4. A metric for rule-based systems The coverage measures can be very useful for the testing process. However, it would also be useful to have a way to predict or measure the complexity of the system and of the testing process before testing is begun. We would also like to be able to compare the complexity of different rule-bases, particularly if there are multiple rule-bases that handle problems within a common domain. This raises the issue of whether there is a reasonable analog to the control flow graph and, more importantly, a complexity metric for rule- bases which would reflect the relationship between rule- base structure and system complexity. The graph representation proposed here serves as a suita- ble foundation for such a complexity metric for rule-based systems. The graph imposes no particular execution order on the rules, and it represents all logical relations that are inherent within the rule-base. However, graph-based metrics such as McCabe’s cyclomatic complexity metric cannot adequately determine the number of execution paths in a rule-base. The actual number of execution paths is based on the logical relationships in the rule-base, using the following mechanism: • For each finding, we assume there is only one path to it. • For each OR or SUB node, consider the parent nodes. The number of paths to the OR or SUB node is the sum of the paths to the parent nodes. • For each AND node, compute the product of the number of paths that lead to each parent of the AND. • Class nodes are treated like OR or SUB nodes. The total number of paths through the rule-base is computed by adding up the number of paths to each class node. In Fig. 5 there are a total of five paths to G based on three paths to the SUB node and two paths to the OR node. This metric can be computed fairly easily by visiting the nodes in topological order and saving in each node the number of paths to that node. This execution path metric can serve a number of purposes in rule-base development and analysis. The total number of execution paths represents the maximum number of test cases needed for complete coverage of the rule-base according to the strongest rule-base coverage measure (All- edges). However, usually the actual number of data sets needed will be less than the number of execution paths, as often, particularly in diagnosis systems, one test set may cover a number of execution paths to different diagnoses. 5. Conclusions and future work This work shows that there are numerous uses of rule- base coverage data in the testing process. Rule-base perfor- mance evaluation can be misleading unless care is taken to identify problems with both the test data and the rule-base. Both the test data and the rule-base can be improved by using information about the extent to which the test data has covered the rule-base under test. This work can be extended in a number of directions. Quantitative performance prediction can be computed based on performance of the system on test cases, a measure of how well the test data covers the rule-base, and a measure of the degree to which the test set is representative of the population for which the system is intended. A second area of extension is for systems which have dynamic computa- tion of certainty factors, which requires modification of the rule-base coverage measures [3] as well as changes to the implementation and the data selection heuristics. It would also be useful to extend the approach to systems that are not acyclic and prepositional. This would greatly increase the practicality of this method for testing rule-bases in a variety of application areas. Another area which should be studied in the future is that of the relationship between rule-base complexity and the difficulty of carrying out the testing process. While intui- tively it may seem that a more complex rule-base should undergo more complex and stringent testing, it may in fact be the impact of an incorrect result that should determine the quality of the testing. For example, in a medical diagnosis system if an incorrect result could lead to not providing treatment to an ill patient then all steps possible should be taken to ensure that the system works correctly, no matter how complex or simple the rule-base is. Finally, it may be possible to extend this approach to analyze the Bayesian belief networks. The probabilistic relationships between nodes of the network, with informa- tion flow which is bi-directional along the arcs, and nodes which may be dependent in some contexts and independent in others, significantly complicates this task. References [1] P. Jackson, Introduction to Expert Systems, 2, Addison-Wesley, Reading, MA, 1990. [2] R. O’Keefe, D.E. O’Leary, Expert system verification and validation: a survey and tutorial, Artificial Intelligence Review 7 (1993) 3–42. [3] V. Barr, Applications of rule-base coverage measures to expert system evaluation, PhD thesis, Rutgers University, 1996. [4] L.C. Kingsland, The evaluation of medical expert systems: experi- ences with the AI/RHEUM knowledge-based consultant system in rheumatology, Proceedings of the Ninth Annual Symposium on Computer Applications in Medical Care, Washington DC, 1985 pp. 292–295. [5] W.R. Adrion, M.A. Branstad, J.C. Cherniavsky, Validation, verifica- tion, and testing of computer software, ACM Computing Surveys 14 (2) (1982) 159–192. V. Barr / Knowledge-Based Systems 12 (1999) 27–3534 Fig. 5. Graph of rule-base with OR and SUB. [6] P. Frankl, E. Weyuker, A data flow testing tool, Proceedings of IEEE Softfair II, San Francisco, December 1985. [7] S. Rapps, E. Weyuker, Selecting software test data using data flow information, IEEE Transactions on Software Engineering 11 (4) (1985) 367–375. [8] M. Suwa, S.C. Scott, E.H. Shortliffe, An approach to verifying completeness and consistency in rule-based expert system, AI Maga- zine 3 (4) (1982) 16–21. [9] T.A. Nguyen, W.A. Perkins, T.J. Laffey, D. Pecora, Checking an expert systems knowledge base for consistency and completeness, Proceedings of the Ninth IJCAI, Menlo Park, CA, 1985, pp. 374–378. [10] T.A. Nguyen, W.A. Perkins, T.J. Laffey, D. Pecora, Knowledge base verification, AI Magazine 8 (2) (1987) 69–75. [11] B.J. Cragun, H.J. Steudel, A decision-table-based processor for checking completeness and consistency in rule-based expert systems, International Journal of Man-Machine Studies 26 (1987) 633–648. [12] A Ginsberg, A new approach to checking knowledge bases for incon- sistency and redundancy, Proceedings of the Third Annual Expert Systems in Government Conference, Washington DC, 1987, pp. 102–111. [13] A. Ginsburg, Automatic Refinement of Expert System Knowledge Bases, Pitman, London, 1988. [14] R. Davis, Interactive transfer of expertise, in: B.G. Buchanan, E.H. Shortliffe (Eds.), Rule-Based Expert Systems, Addison-Wesley, Reading, MA, 1984, pp. 171. [15] A. Ginsberg, S. Weiss, P. Politakis, SEEK2: a generalized approach to automatic knowledge base refinement, Proceedings of IJCAI-85, 1985. [16] S.M. Weiss, C.A. Kulikowski, S. Amarel, A. Safir, A model-based method for computer-aided medical decision-making, Artificial Intel- ligence 11 (1978) 145–172. [17] A.D. Preece, Verification of rule-based expert systems in wide domains, Research and Development in Expert Systems VI, Proceed- ings of Expert Systems ’89, British Computer Society Specialist Group on Expert Systems, London, 1989, pp. 66–77. [18] A.D. Preece, R. Shinghal, Foundation and application of knowledge base verification, International Journal of Intelligent Systems 9 (1994) 683–701. [19] P.D. Grogono, A.D. Preece, R. Shinghal, C.Y. Suen, A review of expert systems evaluation techniques, Workshop on Validation and Verification of Knowledge-Based Systems, 11th National Conference on Artificial Intelligence, Washington DC, 1993 pp. 120–125. [20] D. Zhang, D. Nguyen, A technique for knowledge base verification, IEEE International Workshop on Tools for Artificial Intelligence, 1989, pp. 399–406. [21] C. Grossner, A.D. Preece, P.G. Chander, T, Radhakrishnan, C.Y. Suen, Exploring the structure of rule based systems, Proceedings of the 11th National Conference on Artificial Intelligence, Washington DC, 1993, pp. 704–709. [22] A.D. Preece, C. Grossner, P.G. Chander, T. Radhakrishnan, Structural validation of expert systems using a formal model, Workshop on Validation and Verification of Knowledge-Based Systems, 11th National Conference on Artificial Intelligence, Washington DC, 1993, pp. 19–26,. [23] U. Gupta, J. Kiper, B. Ly, A. Preece, Developing criteria for compar- ing specific V and V tools (from First Winter Workshop on Verifica- tion and Validation of Knowledge-Based Systems), AAAI-92 Workshop on Verification and Validation of Knowledge-Based Systems, San Jose, CA, 1992. [24] J. Kiper, Structural testing of rule-based expert systems, ACM Trans- action on Software Engineering and Methodology 1 (2) (1992) 168– 187. [25] V. Barr, TRUBAC: a tool for testing expert systems with rule-base coverage measures, Proceedings of the 13th Annual Pacific Northwest Software Quality Conference, Portland, OR, 1995. [26] P. Meseguer, Structural and performance metrics for rule-based expert systems, Proceedings of the European Workshop on the Veri- fication and Validation of Knowledge Based Systems, pp. 165–178, Cambridge, England, 1991. [27] S.M. Weiss, K.B. Kern, C.A. Kulikowski, M. Uschold, A guide to the EXPERT consultation system. Technical Report CBM-TR-94, Department of Computer Science, Laboratory for Computer Science Research, Rutgers University, 1987. [28] B.G. Buchanan, E.H. Shortliffe, Rule-Based Expert Systems The problem of evaluation, Addison-Wesley, Reading, MA, 1985. [29] V. Barr, Rule-base coverage analysis applied to test case selection, Annals of Software Engineering, 1997. [30] N. Indurkhya, Monte-carlo simulation-based evaluation and refine- ment of rule-based systems, Technical Report DCS-TR-277, Depart- ment of Computer Science, Rutgers University, 1991. V. Barr / Knowledge-Based Systems 12 (1999) 27–35 35