k LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAICN Digitized by the Internet Archive in 2013 http://archive.org/details/heuristicalgorit474alst yo */7i -Re p° rt N °- ^ /r/#s&~ HEURISTIC ALGORITHMS FOR CONSTRUCTING NEAR-OPTIMAL DECISION TREES August 17, 1971 by Joan Manning Alster NOV 9 1972 UNIVERSITY QF ILLINOIS AT URBANA-CHAMPAIGN DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Report No. kfk HEURISTIC ALGORITHMS FOR CONSTRUCTING NEAR-OPTIMAL DECISION TREES by Joan Manning Alster August 17, 1971 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 618OI *This work was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, August 1971? and was supported in part by the National Science Foundation and the Department of Computer Science. iii ACKNOWLEDGMENT I would like to express my appreciation to Professor Jurg Nievergelt for the guidance he gave me with this thesis. Also, I would like to thank Professor J. N. Snyder, the University of Illinois Department of Computer Science, and the National Science Foundation for providing the computer time necessary for my research. IV TABLE OF CONTENTS Page ACKNOWLEDGMENT iii 1. INTRODUCTION 1 2. HEURISTIC ALGORITHMS. . . ■ 10 2.1. Algorithm 1 (Constant Cases) 10 2 .2 . Algorithm 2 (Weight Function) 12 2.3. Algorithm 3 (Weight Function) 13 3- AN IMPROVED HEURISTIC ALGORITHM 17 3-1. Algorithm k (Dash-Count) 2k LIST OF REFERENCES 29 1. INTRODUCTION The logical structures of certain types of problems may be represented by decision trees. A decision tree is a binary tree whose internal nodes represent points in time at which decisions must be made to take either the left or right branches of these nodes. The root of a decision tree represents the status of a problem before any decisions have been made, and the leaves of the tree represent all possible out- comes of the problem (which the tree represents) which could result from all possible combinations of decisions made at the internal nodes. There has been a great deal of study done on related tree problems. Two major areas of study have been optimal search trees and Huffman's tree constructions for minimum redundancy codes. Decision trees are, in fact, a generalization of these other types of trees, and there are many optimality problems which arise in the area of decision trees which cannot be solved using the algorithms derived for handling these other, related tree problems. The problem of decision trees arises in the study of decision tables and in the conversion of limited-entry decision tables to decision trees for the purpose of computer programming. Algorithms have been found for converting a decision table to a computer program which uses a minimum amount of storage. The storage requirement for the program is minimized by using the given decision table to construct a corresponding decision tree which has a minimum number of nodes and constructing the program from this optimal decision tree. There are also algorithms for converting a decision table to a computer program which executes in a minimum amount of time. The execution time is minimized by assigning a probability, or frequency of occurrence, to each possible outcome (column) in the table and constructing the corresponding decision tree so that the most likely outcomes are resolved at relatively low level nodes of the tree and less likely outcomes are resolved at the higher level nodes of the tree. The decision tree for minimizing execution time will probably contain more than the minimum number of nodes because frequently occurring outcomes will be resolved in as few decision steps (nodes) as possible even if this necessitates additional decision steps for resolving outcomes which seldom occur. [For discussion of the two algorithms mentioned here, see Pollack, "Conversion of Limited Entry Decision Tables to Computer Programs." Communications of the ACM , Vol. 8, No. 11, November 1965, pp. 677-682.] The decision table application of decision trees is not entirely general. Below is an example of a decision table: c i C 2 C n p l N Y . . . Y P 2 Y - . . . N • • • . . . • P m — N . . . Y Figure 1 A particular case, C , is determined "by whether each of certain predicates (conditions) p, , p , ..., p holds (entry in table is Y), does not hold (entry in table is N), or does not apply (entry in table is — ) . Each — , or "don't-care" entry under a particular case C in the table is a substitute for listing two configurations (assign- K. ments of Y or N to each p.) of the P., p~, .... p in that case (one 1 1* 2' ' m with a Y where the dash occurs, the other with an N where the dash occurs) Thus, cases containing one or more don't-care entries actually include several configurations of p.,..., p • However, the possible combina- tions of configurations of p, , ..., p which can be represented by a single column containing don't-care entries is only a subset of all possible combinations of configurations which could be included in a case. There are certain decision tree applications for which it is necessary to be able to assign any combination of configurations of the p , ..., p to each case (each configurations will belong to only one C, ) • Construction of optimal decision trees for each applications cannot be accomplished by using the previously-mentioned algorithms devised for converting decision tables to decision trees. One such general application of decision trees arises in the problem of trying to optimize the efficiency of branching in a computer program. Consider the following simple example which is illustrated in Figure 2: We are working with the x and y coordinates of points in the Euclidean plane. Assume x and y are non-zero. We wish to branch to different parts of the program depending upon whether we have case 1 (the point is in the first quadrant), case 2 (the point is in the second quadrant), or case 3 (the point is in the third or fourth quadrants). y Case 2 Case 1 Cas<; 3 Figure 2 -> x The two possible ways to program this branching process are shown in the two decision trees below. no yes yes Figure 3& Figure 3b Since the number of tests required to resolve any case in Figure 3b is always less than or equal to the number of tests required to resolve a case in Figure 3a, Figure 3b represents the preferable programming logic For this example, there were only two possible decision trees, so it was convenient to examine "both trees and choose the better of the two. However, the relatively complex logic of most problems programmed for the computer makes such a trial-and-error analysis highly impractical. Therefore, we would like to devise a systematic method for analyzing a programming problem and creating a decision tree which will optimize its branching processes. The remainder of this paper will be concerned with this problem. A programming branching problem can be represented by a table from which a decision tree can be constructed. For example, the problem represented by Figure 2 can be represented by the following table : c i C 2 °3 P x (x > 0) l 1 P 2 (y > o) l 1 = 1 = false true Figure k There are two predicates (conditions), p, and p ? , each of which may be either true or false. Therefore, there are 2 = k possible combinations for these two predicates. If p and p ? are both true, we have case C, ; if p, is false and p p is true, we have case C p ; and if p p is false (and p either true or false), we have case C_. If we make use of the "don't-care" symbol in constructing our table, then the table in Figure h could be represented as the decision table in Figure 5. (More complicated problems generally will not be representable in decision table form.) However, for the present, we shall choose not to use the "don't-care" representation; therefore, if we have n predicates our table will always contain 2 columns. This means that if certain combinations of ones and zeroes do not concern us in a particular problem, they still must be entered in the table. However, all these combinations may be grouped into a single C , the "else" case. C l C 2 C 3 p l 1 - P 2 1 1 Figure 5 Our present study is based on two assumptions: (l) that the cost of making a decision remains constant over all nodes of the decision tree, i.e. the truth value of a particular p. may be determined with equal ease for all p., and (2) each case occurs with equal probahility. When these assumptions hold true, the best branching logic for a computer program will be that for which the corresponding decision tree has a minimal number of nodes. We shall refer to the tree with the minimal number of nodes as the optimal tree. There may be more than one optimal tree for a given problem. The search for a systematic method for finding the optimal tree has not yet yielded an algorithm which is guaranteed to produce the optimal tree on the first try. However, I have found several heuristic algorithms which, when applied, greatly reduce the effort that is required to produce the optimal tree by trial-and-error methods. These heuristic algorithms will be discussed in Chapters 2 and 3- Let us first consider the obvious algorithm of finding an optimal decision tree by exhaustive search. Consider the problem illustrated by the following table: c i C 2 C 3 ■ p M l 1 1 1 1 1 1 1 1 p 2 1 1 1 1 1 1 1 1 P 3 1 1 1 1 1 1 1 1 p k 1 1 1 1 1 1 1 1 Figure 6 Here n=4 so we have 2 =16 columns in the table. From this table, we shall construct a tree which will have the following structure: To be filled with some p. Resolved Cases Figure 7 The maximum number of nodes the tree can contain is 15 (= 2 -1, where n is the number of predicates), but it may contain fewer than 15 nodes if parts of some cases (some columns from the table) can be resolved without testing all of the p. . If we conduct an exhaustive trial-and-error search l for the optimal tree, how many trees must we check? Realizing that each root-to-leaf path can contain a particular p. at most once, we see that 2 we have h choices for the level zero node, 3 choices for the level one k 8 nodes, 2 choices for the level two nodes, and 1 choices for the level three nodes. Thus, the maximum number of trees that would have to be P h R inspected is k-3 *2 »1 = 576 trees. Of course, if some of the trees have fewer than 15 nodes, there will be fewer trees to investigate, but the number will remain quite large--too large, in fact, to make trial- and-error investigation feasible even when n is as small as k. When the exhaustive trial-and-error search was programmed and run on the computer, 9 it was found that for this example, kQk trees had to be inspected to discover that there exist two optimal trees with six nodes each. In general, when there are n predicates, the upper bound for the number of trees which must be inspected to ensure obtaining the optimal tree is given by: Maximum number of trees to inspect = „( 2 °).(„.i)( 2l ).(„- 2 )( 22 ).....(„. k )( 2k ). ... .(i)^"' 1 *) Obviously, an exhaustive trial-and-error search for the optimal decision tree requires too much work to be feasible. 10 2. HEURISTIC ALGORITHMS 2.1. Algorithm 1 (Constant Cases) Step 1 ; To determine which p. to select when constructing the decision tree, look at the table and choose the p. for which the most cases are constant (either all zeroes or all ones), if such a p. exists. For example, in Figure k, for p, , cases C, and C p are constant, hut for p_, all three cases are constant. Therefore, p p should be chosen first. Indeed, as shown by Figures 3a and 3b, Po is the preferable choice . Step 2 : If no such p. exists, proceed as though by the exhaustive search method and apply this "constant case" algorithm to resulting subtables whenever possible. Whenever this algorithm yields a "best" choice for a particular node, no other p.'s need to be tried for that node (unless, of course, in proceeding through the exhaustive search we change a p. which lies on the path between the root and our particular node). Therefore, we eliminate looking at some of the trees we might otherwise have considered when searching exhaustively for the optimal tree. Note that when two or more p. 's have the same number of constant cases, each of these P. ' s should be tried; they might not all prove to be equally good choices. To illustrate Step 2, refer to Figure 6 There are no constant cases for any of the p. 's. Therefore, proceed as by the exhaustive search method. Suppose p is chosen to be the root of the tree. Construct two subtables: 11 p l = ° C l C 2 C 3 P 2 1 1 1 1 P 3 1 1 1 1 p k 1 1 1 1 p l = = 1 c l C 2 p 2 1 1 1 1 B 3 1 1 1 1 % 1 1 1 1 Figure 8 To obtain the left descendant of the root (p = 0), note that "both Pp and p~ have two constant cases so both of these predicates must be tried, but p. need not be tried. To obtain the right descendant of the root (p = l), note that p ? has three constant cases, more than either p or p. , so choose p p . p_ and p. never need to be considered. The tree now looks like this: (Must also try p here) Figure 9 Now construct two more subtables for each bottom node and continue constructing the tree until all the leaves of the tree are resolved cases. After constructing all possible trees with p, at the root 12 (eliminating the consideration of some, of course, by using the algorithm), go back and do the same for roots of p p , p , and Pi . (Since the algorithm did not, in this example, yield any information about which p. -would be a best choice for the root of the decision 1 tree, all roots must be tried.) The optimal decision tree is the best tree found by the above-described search. When Algorithm 1 was programmed and run on the computer, it was found that for the above example, 36 trees were inspected to find the two optimal trees with six nodes each. This is in contrast to the U8U trees examined in the exhaustive search. 2.2. Algorithm 2 (Weight Function) Step 1 ; For each predicate, p., calculate a weight given by n-1 WT X = Z Z K • P R all cases k=l where (a) P is the number of groups of 2 zeroes or ones there are within a case, (b) no zeroes or ones are counted more than once, and (c) longer strings are counted over shorter strings (i.e. four zeroes 2 1 would be counted as one group of 2 zeroes, not two groups of 2 zeroes). For example, if one case within a p. contains six zeroes and three ones, the weight for that case would be given by: wt == 2 • 1 v + „ 1 • 2 - poop s v S* ry\ x one group of \ / one group of 2 zeroes \ (one 1 left over) 2^ zeroes J ( + one group of 2^ones J / I = two groups of 2>P zeroes/ \or ones / 13 To find WT., for p., sum the weights over all cases 1 i 7 Step 2 ; Choose the predicate for which the weight is a maximum. If more than one predicate has the maximum weight, all those with the maximum weight must be tried to ensure that the next node of the decision tree will be filled with the "best" predicate. We shall evaluate the weights for all the predicates in the example in Figure 6 . p.: WI 1 = (2-1 + 1-1) + (1-1) + (2-1 + 1-1) = 7 P 2 : WT 2 = (2-1 + 1-1) + (1-1) + (2-1 + 1-1) = 7 P 3 : WT 3 = (2-1 + 1-1) + (1*1) + (2-1 + 1-1) = 7 p^: WI k = (2-1 + 1-1) + (1-1) + (1-2) = 6 The algorithm happens not to be decisive for choosing the root of the decision tree, though it does show that p. would be the worst choice for the root (so p. need not be considered as a possible root). When Algorithm 2 was run on the computer, 22 trees were inspected to find the two optimal trees with six nodes each. This is somewhat better than the 36 trees inspected when using Algorithm 1. 2.3. Algorithm 3 (Weight Function) given by Step 1 : For each predicate, p., calculate a weight n-1 WT 2 = Z Z K • P R all cases k=2 where the notation is the same as that described for Algorithm 2 Ik Step 2 ; Same as Step 2 for Algorithm 2 . Again, we shall evaluate the weights for the predicates in Figure 6. Pl : (2 2 -l + 1 2 -1) + (1 2 -1) + (2 2 -l + 1 2 -1) = 11 P 2 : (2 2 -l + 1 2 -1) + (1 2 -1) + (2 2 -l + 1 2 -1) = 11 P 3 : (2 2 -l + 1 2 -1) + (1 2 .1) + (2 2 -l + 1 2 -1) = 11 P k : (2 2 -l + 1 2 -1) + (1 2 -1) + (l 2 -2) = 8 Algorithm 3 has the effect of weighing more heavily the larger groups of ones and zeroes than does Algorithm 2. As with the earlier algorithms, Algorithm 3 is indecisive for selecting a root in the example shown here. However, when this algorithm was run on a computer, it was found that only eight trees needed to be investigated to find the two optimal trees with six nodes each. This is a considerable improvement over the 22 trees which had to be investigated when Algorithm 2 was used. The three algorithms plus the exhaustive search were programmed for the computer and run for eight different problem situations, each with n=U (hand testing is fairly easy for n<3)« The eight trials were not random problem situations, but rather, were carefully selected to represent as wide a range of different types of situations as possible. The number of cases varied from two to five. The results are summarized below: 15 En 3 H pc, K O Eh t< MO 0) DO) P^ II Pi a o ?h w 3 EH c P CH (1) -H 0) Xi -P O ft Eh S O CO W fd S (D C p -H HH -rl k . IT 3- AN IMPROVED HEURISTIC ALGORITHM After extensive experimentation with Algorithms 1, 2, and 3, I became convinced that the single most important consideration when selecting a "best" predicate is the number of occurrences of zeroes or ones in groups of powers of two within cases. This suggests consider- ing decision tables which contain "don't-care" (-) entries since the dashes are helpful in indicating where powers of two occur. For example, the case C, in Figure 11a could be represented as in Figure lib with the dash (-) signifying that p is an undesirable choice as far as case C, is concerned, k C k p l 1 P 2 1 1 P 3 1 1 % Figt ire 11a C k p l - P 2 1 P 3 1 % Figure lib The Quine-McCluskey minimization procedure is used to construct the table with "don't-care" entries. For n < k, a Karnaugh map is also useful. Let us use the Quine-McCluskey procedure to convert the table in Figure 6 to a table containing "don't-care" entries. Each case must be handled separately, so we begin with C, • First, we find the prime implicants as shown below: 18 Decimal Representation of Binary Sequence Complete Sequences of p . ' s l in Case 1 Derived "Don't-Care" Sequences (2) (*0 1 n/ 0100\l (2,3): 1 -n/ (2,6): 0-10 (2,10): - 1 On/ (U,5): 010- (^,6): 1-0 (2,3,10,11): - 1 - (3) (5) (6) do) OOllJ 1 In/ onoJ 1010J (3,11): - 1 In/ (10,11): lOlJ (ii) 10 11 Prime Implicants (non-checked sequences) -01- 0-10 10- 1-0 Figure 12 19 The following comments are made in regard to Figure 12: (1) The columns in Case C in Figure 6 are represented horizontally as complete sequences of p.'s. (2) To facilitate finding derived sequences, the complete sequences (and thus, the derived sequences also) are listed in order of the number of l's they contain. (3) In order for two sequences to be combined, they must be identical in all but the one position in which one sequence will contain a 1, the other a 0. (h) When two sequences are combined, they are checked (v) in the table. The prime implicants are all those sequences which are unchecked after no more sequences can be combined to form more derived sequences. (5) Every sequence must be compared with all sequences below it which contain the same number of l's or one more 1 than the given sequence (even if some of these sequences are already checked), and all possible derived sequences must be written. After the prime implicants have been obtained, a McCluskey Chart is constructed. For our example, the following chart is obtained. 20 Derived Don't-Care Sequences Derived From Which Complete Sequences Seque 2 ■nces Belonging to Case C 3 k 5 6 10 11 1- 0-10 10- 1-0 (2,3,10,11) (2,6) (^,5) (h,6) X X X X ) X X X Essential Prime Implicants: Non-Essential Prime Implicants: Ol- IO- Must have one of these to cover "6" column Figure 13 In the new table containing "don't-care" entries, C n is represented as follows: c l J— —. i— — | p l - P 2 1 - or 1 P 3 1 1 - % - - Figure lU 21 The table entries for C ? and CL can also "be derived using the Quine-McCluskey procedure to yield the following table: C l C 2 C 3 p l - ~0~~ 1 1 p 2 1 - or 1 1 1 P 3 1 1 - 1 - % _ _ - 1 _ _ Figure 15 -Figure 15 could also have been all or partly derived from the following Karnaugh map which can be obtained directly from Figure 6. Of course, use of a Karnaugh map is only practical for n < k. Entries in map indicate to which case the corresponding sequence of l's and O's belongs. Figure 18 23 .(2) When there is a choice of which non-essential prime implicant (or combination of non-essential prime implicants) to select, list all choices in the table, each in brackets, and join the brackets by the word "or". Call such a group of bracketed entries joined by the word "or" an or-group (each bracketed entry will be called a component of the or-group), and define the or-number of an or-group to be the number of components contained in the or-group. There may be more than one or-group in a case. For example, a case could have the following structure: []or[] []or[]or[]. The first or-group has an or-number of 2, the second an or-number of 3* (3) Define the case-count for a case to be 2 where r = (number of dashes occurring in non-bracketed entries in the case) + Z (maximum number of dashes in any all or-groups component of the or-group) in the case (U) Define the dash-count for a particular p. to be the sum of the case-counts corresponding to each non-bracketed dash in the row plus the sum of f- - — j <> (corresponding case-count) for each bracketed dash in the row. The preceding comments and definitions lead to the statement of Algorithm k for constructing an optimal decision tree. The notions of case-count and dash-count resemble those described by Pollack in his article, "Conversion of Limited-Entry Decision Tables to Computer Programs" (cited earlier in this paper). 2k 3.1. Algorithm k (Dash-Count) Step 1 : Using a Karnaugh map or the Quine-McCluskey procedure plus the bracketing rules described above, construct a table containing "don't-care" entries. Compute the case-count for each case. Compute the dash-count for each p. . Select the p. for which the dash-count is a minimum. Step 2: Step 3: Step k: If more than one p. has minimum dash-count, select any of the p. 's with a minimum dash-count. The case-counts and dash-counts for Figure 15 are shown below: Case Count 2 3+1 = 2 U =16 2 1 = 2 2 3 = 8 C 2 c 3 Dash Counts p l - ~0~ 1 1 16 P 2 1 - or 1 1 1 (1/2) (16) = 8 P 3 • 1 1 - 1 (1/2) (16)46 = 16 p h - - _0_ 1 - 2(16)4242(8) = 50 Figure 19 Since p p has the minimum dash count, choose p p to be the root (p p is indeed the root of both optimal trees). Use the original table (not the one with the "don't-care" entries) to create two subtables and apply Algorithm k to each of the subtables. 25 One very significant advantage of Algorithm k over the other algorithms is that cases which are of no concern in a particular problem can be designated as such (as a "d" in a Karnaugh map or Quine-McCluskey chart) rather than having to be combined into a single 'else" case. For example, suppose we have a problem in which n = k and in which we assign only 13 predicate sequences to cases C, through Ci . We have no concern for what happens to the three sequences 0010, 0011, and 0111. In order to apply our earlier algorithms, we would have to first combine these three "else" sequences into a single case--C c; . However, for Algorithm h, this "else" case is not necessary. Suppose the original table is: C l C 2 C 3 % p l 1 1 1 1 1 1 1 1 P 2 1 1 1 1 1 1 1 P 3 1 1 1 1 1 P l4 1 1 1 1 1 1 1 Figure 20 table : 26 The following Karnaugh map can he constructed from this Figure 21 The "d" entries provide a far more accurate representation of the problem than would a fifth, "else" case; and as the Karnaugh map indicates, a far better optimal tree will result when the "d" entries are used. Figure 22 shows an optimal decision tree obtained using an "else" case and one obtained using "d" entries. 27 Optimal Decision Tree Using CV as "Else" Case (8 nodes) Optimal Decision Tree Using "d" Entries (5 nodes) Figure 22 Algorithm k has yielded an optimal decision tree in many examples on which it was tested. However, Toshio Yasui (Ph.D. student, University of Illinois) has found the following counterexample to Algorithm k. Given the decision table in Figure 23, 28 c i C 2 C 3 c k C 5 C 6 C 7 C 8 C 9 p l - 1 - 1 1 p 2 - 1 - 1 1 1 P 3 1 - 1 - 1 p k 1 1 1 - 1 Figure 23 Algorithm k indicates that the optimal tree would have p> for its root. In fact, however, all the optimal trees have either p., P~, or p for their roots with a root of p, yielding no optimal trees. Therefore, although Algorithm h provides an efficient, systematic method for finding a very "good" decision tree, it will, in some situations, fail to yield the optimal decision tree. Algorithm h is judged, however, to be generally more reliable than the algorithms presented earlier in this paper. 29 LIST OF REFERENCES Pollack, Solomon, L., "Conversion of Limited Entry Decision Tables to Computer Programs, " Communications of the ACM , Vol. 8, No. 11, November 1965, pp. 677-682. Yasui, Toshio, "Some Combinatorial Aspects of Decision Table Optimization Problems," Ph.D. Dissertation (in progress), University of Illinois at Urbana-Champaign, Urbana, Illinois. \ p %