The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN MR 1 ? |^ APR 03 (1996 JAN 6 1997 L161— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/methodforconstru904sche S-R-77-904 A Method for Constructing Compressed Parsers for a Parser Generating System by Richard Marion Schell, Jr. November, 1977 UILU-ENG 77 1756 A METHOD FOR CONSTRUCTING COMPRESSED PARSERS FOR A PARSER GENERATING SYSTEM BY RICHARD MARION SCHELL, JR. A.B., University of Illinois, 1972 THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1977 Urbana, Illinois S/o S'/ n c ' f 2, ACKNOWLEDGEMENTS I would like to express ray gratitude to Professor M.D. Mickunas for his guidance and patience as well as for his advice and helpful suggestions. I also owe my thanks to Alfred Weaver for his encouragement and for his part in getting me involved in this research. Finally, my thanks go to my wife, Barbara, for her constant faith and unwavering patience. iv TABLE OF CONTENTS Chapter Page 1 . INTRODUCTION 1 1.1. SURVEY OF METHODOLOGY 3 1.2. OVERVIEW OF THE THESIS 5 2 . GRAMMARS AND ACCEPTORS 6 2.1. DEFINITIONS 6 2.2. NORMAL FORM GRAMMARS 9 2.3. BOUNDED CONTEXT ACCEPTORS 12 2.4. EXTENDED BOUNDED CONTEXT ACCEPTORS 15 2.5. CONTEXT COMPUTATION AND CANONICAL COVERS 1 8 3 . TABLE REDUCTION STRATEGIES 25 3.1. REPRESENTING W* 26 3.2. MOTIVATION FOR IMPROVING THE TABLES 27 3.3. TABLE TRANSFORMATIONS 29 3.4. CONFLICT GRAPHS 37 3.5. FINDING A BEST COMPRESSION SET 43 3.5.1. ANALYSIS 43 3.5.2. A BACKTRACK ALGORITHM 47 3.6. TABLE ORDERING 49 3.7. OBSERVATIONS 54 4 . THE PARSER GENERATING SYSTEM 55 4.1. PGS INPUT 55 4.2. THE STRUCTURE OF THE PGS 56 4.3. TRANSFORMING THE INPUT GRAMMAR 56 4.4. CONTEXT COMPUTATION 61 4.4.1. ALGORITHMS 61 4.4.2. REPRESENTING CONTEXT SETS 64 4.5. CONFLICT GRAPH GENERATION 65 4.6. STATE TABLE GENERATION 66 4.7. ARRANGING 72 4.8. ENCODING THE TABLES 75 5 . RESULTS AND CONCLUSIONS 79 5.1. RESULTS 79 5.2. CONCLUSIONS 82 5.3- EXTENSIONS AND REFINEMENTS 82 REFERENCES 84 APPENDIX A 87 APPENDIX B 96 CHAPTER 1 INTRODUCTION The subject of this thesis is a method for automatic parser generation. In particular, it explores a technique for producing compact and efficient parsers. The principal objective of generating parsers automatically is to take advantage of powerful parsing methods and at the same time to spare the compiler writer the burden of writing the parser. We will demonstrate techniques for generating parsers which satisfy this objective and which can be produced for a broad class of programming languages. And we will describe a parser generating system (PGS) that uses these techniques effectively. The automatic generation of recognizers and parsers has been an area of continued interest since the late 1950 's, when the prototypic "modern" programming languages were developed. Concurrent with the development of programming languages has been the development of theory and technology for recognizing and parsing these languages; this technology has resulted in the development of algorithms for constructing parsers mechanically. There have been several major issues faced in developing these methods. Among them are: 1. Allowing a broad class of languages to be specified easily. Some methods preclude desirable syntactic constructs from the set of languages they are capable of parsing; others require substantial effort on the part of the compiler writer to arrange his input grammar so that it is acceptable to the system. 2. Generating a useable parser. Parsers that are generated by mechanical systems must be both space and time efficient in order to meet the constraints of practical machine environments. 3. Generating a useful parser. Parsers must be integrated into the compiling process. To fit, they must provide capabilities for code generation and other semantic analysis and for detecting and repairing, or recovering, from programmer syntax errors. Several methods have been developed for constructing parsers. To establish a perspective, we present some of the most representative methods. 1 . 1 SURVEY OF METHODOLOGY The methods presented here are all deterministic bottom-up parsing methods. For a thorough survey, see LAho 72] and [Feldman 68]. Mechanical generation of top-down parsers is rarely done, and top-down methods are inherently less powerful; consequently, LL parsing, though widely used, is not considered here. Among the oldest methods for mechanical parser generation are the precedence parsing methods. [Floyd 63], [Wirth 66]. Such methods are conceptually simple and the generating algorithms are both simple and efficient. However, most precedence parsing techniques have undesirable characteristics. Not all programming languages are precedence parsable. Those that are often require substantial rewriting of the grammar for the grammar to be acceptable. Furthermore, the parsers produced are often large. The technique of Ichbiah and Morse [Ichbiah 70], which is based on weak precedence and uses Floyd-Evans productions [Floyd 61] and [Evans 64], is relatively space efficient but is still less general than other techniques available. Generally, precedence parsing is no longer a competitive method for generating parsers automatically. An extension to precedence parsing is mixed strategy precedence parsing; a parser generating system based on this method is in common use [McKeeman 71 J. Again, the resulting parsers are large. And while it is theoretically possible to find grammars for languages in a broad class, in practice it is not always feasible to find one acceptable to both the system and to the compiler writer. Currently, the most favored method in use is LR parsing [Knuth 65]. LR methods are powerful; the general class of LR parsers is capable of satisfying the requirements discussed previously. This is true also of the commonly used LR methods, LR(1), SLR(1) and LALR(1). Although LR parsing had obvious promise from the beginning, there were several disadvantages. Developing construction algorithms from the discussions in the early literature was difficult. When LR parsers were developed, the space required to store the parse tables was prohibitively large. Several independent researchers eliminated the latter problem, developing algorithms that produce parsers of a practical size. [Pager 70], [DeRemer 71] and [Lalonde 71].' Further research has demonstrated that LR construction algorithms can be made to produce very efficient parsers. [Joliat 73] and [Anderson 74]. Parser generating systems based on these techniques are widely available. The method used by the PGS described here is a bounded context method. Bounded context parsing predates LR parsing [Floyd 63], [Eickel 63], but did not receive the same attention. Requirements on the input grammar tended to be very restrictive and the space requirements were also severe. A system for generating bounded context parsers was developed at Purdue University [Mickunas 74] which solved these problems; the system accepts reasonable input grammars and produces compact parse tables. The research and development reported here is a product of a modification to and extension of this system. The parsers produced are competitive with those produced by LR-based systems over a wide range of applications and is particularly advantageous in cases where space is at a premium. 1.2 OVERVIEW OF THE THESIS This thesis presents a description of the parser generating system. It also provides a detailed discussion of the techniques and algorithms used to minimize the size of the parsers it produces. Results and methods produced during the course of the thesis research are substantiated and proved. Remaining chapters are devoted to these tasks. Chapter 2 provides background material. It introduces notation and terminology and introduces a formal model which is an abstraction of the parsing method used. Chapter 3 describes a process for constructing parse tables in a space-efficient manner. Methods and algorithms are introduced. Chapter 4 provides an overview of the PGS and shows how the methodology described in chapter 3 is integrated into the system. The chapter outlines the structure of the PGS and the implementation of the algorithms actually used. Chapter 5 presents results and summarizes the thesis presentation. It also suggests refinements and extensions. There are two appendices; the first consists of four sections and illustrates the process of generating parse tables from an input grammar. The second is a full set of parse tables for an actual programming language. CHAPTER 2 GRAMMARS AND ACCEPTORS This chapter develops the mechanics of the parsing method used. In particular, it discusses bounded context grammars and a class of automata which accept languages defined by a subset of these grammars. It will be convenient to use terminology from formal language theory and parsing theory. We present some basic notation and terminology. The format follows Gray and Harrison [Gray 72], and Aho and Ullman [Aho 72]. 2.1 DEFINITIONS Definition. A (context-free) grammar (CFG) is a four-tuple, G=(V,T,P,S) where V is a finite nonempty set of symbols, the vocabulary T9V is a finite set of symbols, the terminal vocabulary N = V-T is the nontermina l vocabulary S is a distinguished element of N, the goal symbol P is a finite subset of N X V*, the productions We will denote an element (A,x) of P as A -> x. For any rule A -> x, A is called the left-part and x the right-part of the rule. We will find it convenient to use binary relations on symbols. Definition . Let p be a binary relation on a set Y,p Y X Y. Define P to be the identity relation and P 1 = p x p . Define the reflexive transitive closure p* = Up^; and the (non-reflexive) transitive closure p = p*p. We will use pu to denote {v€Y| vpu} . Definition . Let G = (V,T,P,S) be a CFG. For u, v in V* define u=>v if there exist x, w in V*, y in T* and A in N for which u=xAy, v=xwy and A -> w is in P. By this definition, we restrict derivations to rightmost , or canonical , derivations. Definition . The set of (canonical) sentential forms for a CFG, G is denoted by CSF(G) and is defined as (xeV*! S=>*x}. The sentences of the grammar are terminal strings in CSF(G). The language generated by G is the set of sentences of G. Definition . If x.=>x. , then we say that x. directly derives x. . If i j* i j x =>x.=>. . .=>x , then we say that x^ derives x ; the sequence is called a (canonical) derivation of x from x n , Definition . A CFG G is said to be unambiguous iff every sentence in the language of G has exactly one canonical derivation (from the goal symbol). A CFG which is not unambiguous is said to be ambiguous . Definition . A string u in V* is a phrase if there is a derivation S=>*wAv=> wuv. If wAv=>wuv, then u is a simple phrase . The leftmost simple phrase of a sentential form is called the handle . Definition. A CFG, G = (V,T,P,S) is said to be 1. e-free if P£N X V , 2. reduced if for each A in V-{S} there are x, y in V* such that xAy is in CSF(G) and if no nonterminal is useless . A nonterminal is useless if it does not derive some terminal string. For the remainder of the discussion, we insist that all grammars be reduced, e-free, and unambiguous. The following conventions for naming strings hold throughout the discussion: A,B,C,D... are nonterminal symbols a,b,c,d,... are terminal symbols U,V,...,Z are either terminal or nonterminal symbols u,v,...,z are elements of V* 2.2 NORMAL FORM GRAMMARS The parsing method used is called bounded right context . Informally, a grammar is (m,n) bounded (right) context if given a sentential form in a (leftmost) parse, the handle can be determined by examining at most m symbols to the left and n to the right of a possible handle. The grammars used internally by the PGS are (1,1) bounded right context. It is proved in [Aho 72] that for every LR(k) language there is a (1,1)BRC grammar; further, given an LR(k) grammar, there is a mechanical transformation to a bounded context grammar of the specific form given by the definition below [Graham 74], [Mickunas 76]. Definition 2. 1 A normal form grammar , G, is a 5-tuple G = (N,Q,T,S,P) where N is a finite set of symbols called the pushdown or stack vocabulary . Q is a finite, non-empty set of symbols disjoint from N called the state vocabulary. T is a finite set of symbols called the input or terminal vocabulary . V = NuQuT is called the vocabulary. 10 V-T is called the nonterminal vocabulary. S is a distinguished element of Q called the goal or sentence symbol. P is a finite subset of (NuQ) X (NXQuQXTuQuT) called the production rules of G. The production rules of G are of the following four forms: 1. p= E -> AB, called stack reducing rules, 2. p = E -> Ba, called input erasing rules, 3- p = E -> B, called renaming rules and, 4. p = E -> a, called initial state rules, The grammars as used here are constrained so that right parts of all rules except renaming rules must be distinct; clearly, a grammar for which this condition does not hold can be transformed into one for which it does by coalescing productions and adding new nonterminals and renaming rules. This requirement can be relaxed but it is convenient to assume it holds in most cases. 11 Given a normal form grammar, a pair (A, a) in N X T is a context for a production p in P iff there is a derivation S =>+ xABay for which the last production is p. The stack symbol A in a context (A, a) is called the left context ; the input symbol a is called the right context . (1,1) bounded context grammars in this form have several important properties. 1 . It is possible to determine the handle of a sentential form given a single context - the stack top and the next input symbol. 2. It is possible to distinguish between stack reducing (input erasing) rules on the basis of left (right) context alone. 3- Contexts can be contructed using a simple algorithm. 4. The parser and parse table structure are both simple. The price paid for these properties is that the input grammar must be transformed into the specified internal form. 12 2.3 BOUNDED CONTEXT ACCEPTORS The natural method of implementing a parser for a class of context free grammars is to contruct a pushdown automaton model which is realized as a table-driven parser. In this section, two separate automata are presented, both of which will parse using the grammars described in the previous section; the first is a simple model which corresponds naturally to the normal form grammars; the second is an extended model which properly includes the first and which serves as a good abstraction of the parsing method described in this thesis. The first machine will be referred to as a bounded context acceptor , abbreviated BCA. As used in this thesis a BCA is a deterministic pushdown automaton represented by a sixtuple. Definition 2.2 A bounded context acceptor is a 6-tuple A = (N,Q,T,M,S Q ,S) where N is the finite pushdown alphabet Q is the finite, non-empty set of states T is a finite set, the input alphabet . M is a partial function from N X Q X T into (NuQ) X Z where Z is the set of actions shift, reduce, and rename; M 13 is called the ma pping of A. S is a distinguished element of Q called the initial state . S is a distinguished element of Q called the accept state A possible configuration for the automaton A is a triple (B,A,a) in N X Q X T; the configuration is valid if the mapping is defined for it. When M is defined it maps a configuration into a state-or-stack/action pair: M(B,A,a) = (C,t), where if t is a shift, the next input is consumed; if t is a reduce the topmost stack symbol is removed; and if t is a rename, no erasure occurs. If C is a state, the new state of the machine is C; if C is a stack symbol, then after the action t is performed, C is pushed onto the pushdown stack and the new state of the machine is the initial state. An instantaneous description (ID) of the automaton A is a triple (o,q,w) in N* X Q X T*; o represents the contents of the stack, q the state, and w the remaining (unscanned) input. The automaton A induces a relation \- on the set of ID's as follows: ( a ,q,w) I- ( ' ,q',w') if =uA and w=av and one of the following holds 1. o'=u, w'=av and M(A,q ,a)= (q ' .reduce) 14 2. a'=uA', w'=av and M(A ,q ,a)= (a ' .reduce) 3- o'=uA, w*=v and M(A,q,a)=(q ' .erase) 4. o'=uAA', w'=v and M(A,q ,a)=(a' , erase) 5. a'=uA, w'=av and M(A,q,a)=(q ' .rename! 6. a'=uAA', w'=av and M(A ,q ,a) = (a ', rename) The transitive closure I— * is well-defined. In order that the machine accept a string w, it is necessary that (e,S ,w) |— *(e,S,e). (It is assumed that the machine is initially in state S with an empty stack and ready to consume the first input symbol. If the input is well-formed, then the machine will stop in the accept state with an empty stack, having consumed all the input.) The correspondence between the BCA and the normal form grammar for which the language that it accepts is obvious: given G=(N,Q,T,S,P) , the corresponding BCA is A=(N,QU{S o },T,M,S q ,S) where M is constructed from P as follows 1. If A -> A a, is in P, then for all of its left contexts B 1 jk 1 15 define M(A.,B 1 ,a k ) = (A i ,shift) 2. If A. -> B.A k is in P, then for all of its right contexts a. define M(A lc ,B.,a 1 ) = (A i , reduce) 3. If A.^ -> A.= is in P, then for all of its contexts (B^a-^) define M(A . , B, , a-, ) = (A ..rename) 4. If A. -> a. is in P then for all of its left contexts B, define i J k M(S Q ,B k ,a.)=(A i ,shift) The correspondence between an instantaneous description of the BCA and a canonical sentential form is straightforward. For A,. . .A Bu is in P A p B if A -> uB is in P X >i Y if A -> uXYv is in P. 19 The derived relations first , follow , and precede are defined by: first = (A -1 )* last = P* follow = first a -1 last precede = a first The set of initial elements of a binary relation 3 corresponding to the second element x will be denoted 3(x). The sets so induced by the first, follow and precede relations will form the basis of the canonical covering. In using the relations just defined, the following observations are useful: 1. X first A iff A =>* Xu . This follows immediately. 2. A precede B iff S =>* xABy . If A precede B, then there is a C such that C first B and a production p = D -> uACv. Since D is not useless, s=>*xABy. Conversely, if S=>*xABy then A precede B can be shown by induction on the number of productions in the derivation. Clearly, if S=>xABy in one step, then A precede B. Suppose S=>*xABy for a derivation of length k and for shorter derivations the hypothesis holds; obviously there is a final production in the derivation. There are three cases: 20 1. p = C -> w, B not in w. In this case, S=>*xABuCz in fewer steps, so by hypothesis, A precede B. 2. p = C -> uABv; this case follows immediately. 3. p = C ->Bu; in this case, S=>*xACw and A precede first B. By the definition of precede and the transitivity of first, it follows that A precede B. 3. a follow A iff S =>* xAay . If a follow A then there are B and Y such that a first Y, B Y and B =>*uA. Since B Y, there is a production C-> wBYx in P and since a first Y, it follows that Y=>*av. Since the grammar is reduced, vx=>*z where z is in T*. Thus, C=>*wBaz and C=>*wuAaz for some string z. As C is not useless, the only if part follows. Conversely, if S =>* xAay then a follow A, again by induction on the number of productions. If S=>xAay then a a _ l A. If S=>*xAay in k steps and for every shorter derivation the hypothesis holds, then there is a final production. There are four cases: 1. p = B -> vA; in this case s=>*wBay and so by hypothesis a follow B. Consequently, a follow p A and by the definition of follow and the transitiviy of last, a follow A. 21 2. p = B -> vAau. Here, act A 3. p = B -> au; in this case, S=>*xABy. So A precede B and B first A. It follows immediately that a follow A. 4. p = B -> w, where neither A nor a is in w. Since the derivation is rightmost, S=>*xAauBz in fewer productions. There are three distinct classes of context sets these relations generate : 1. If p = E -> AB, then precede(E) X first(B) is a context set for every production used in any derivation A=>*C, C in Q. 2. If p = E -> Ba or p = E -> a, then precede(E) X {a} is a context set for the production p and for every production used in any derivation B=>*C, C in Q. 3. If p = E -> AB then {A} X follow(E) is a context set for production p and for every production in any derivation B=>*C, C in Q. Contention: if (A, a) is a context for a production then it is contained in a set of one of the above types. This can be shown by cases: 22 case 1 case 2 case 3. Production p is of the type C -> Ba. If (A, a) is a context for p, then S=>*uACv=>uABav. Therefore A precede C and the set of class 2 covers this context. (Similarly for C -> a.) Production p is of the type E -> AB. If (A, a) is a context for p then S=>*uEav=>uABv. Consequently a follow E. A class 3 set will cover (A, a) . Production p is of the type E -> B. If (A, a) is a context then there is an F such that S=>*uAFav=>*uAEav=>uABav. There are two subcases to consider. In the first, F is in Q. Clearly, (A, a) is a context for some production D -> Fa or D -> AF so it is covered by a class 2 or class 3 set. In the second, F is in N. In this case, there is a production in the derivation of the form G ->FD, so that S=>*uAGw=>uAFDw=>*uAEav=>uABav. This implies that A precede G and a first D, so (A, a) is covered by a class 1 set. Using these facts, the covering for the canonical map W* is generated as follows: 1. if p = E -> Ba then W*(B,precede(E) , {a} )= { (E, shift) } . 2. if p = E -> AB then W*(B, {A} , follow(E) )={ (E, reduce) } . 23 3. if p = E -> B then: if E is in Q, then whenever W*(E,X,Y) is defined, define W*(B,X,Y) = {(E, rename)}; and if E in N then W*(B,precede(D) , first(F) )={ (E, rename) } for all D, F such that D -> EF. 4. For all other elements of Q X powerset(N) X powerset(T), W* is undefined. The covering induced in this manner is suboptimal, but is still very good. The following observations justify that claim. First, observe that there is exactly one context set for each stack reducing or input erasing rule. Second, observe that if E -> B is a renaming rule, then if there are two pairs of context sets U X V and X X Y such that W*(B,U,V) = W*(B,X,Y) = {(E, rename)} then if U X V and X X Y are not disjoint, there is a stack symbol F such that F = >*E A-| ■, -> FA 12 A 21 -> FA 22 and U = precede(A-n) , V = first(A-|2) X = precede(A 21 ) , Y = first(A 22 )- This is not difficult to establish. Suppose either context set is class 24 2 or class 3. Then there are two distinct productions for some state symbol D, where D=>*E, for which U X V and X X Y are the respective context sets. But since they have non-empty intersection, there is a context (A, a) for which two separate productions can be applied. This violates (1,1) BRC. 25 CHAPTER 3 TABLE REDUCTION STRATEGIES To make the extended BCA model derived in the previous chapter a useful tool, it is necessary to produce a table-driven parser and parse tables which allow the parser to simulate an EBCA for a given grammar. It is desirable to keep the space required by the parser as small as possible without sacrificing a great deal of speed efficiency. This chapter outlines techniques for producing reduced tables , parse tables produced from the canonical mapping by applying space reducing transformations . As mentioned in the previous chapter, the acceptor mapping is implemented using a table lookup scheme. Such a scheme is obviously preferable to a matrix encoding of the original BCA mapping, which requires one entry for every possible configuration, including the invalid ones. It would require |Q|*|N|*|T| entries to provide this representation. A table lookup method eliminates the need for the entries corresponding to erroneous configurations, at the expense of giving up the direct access of the matrix method. In the last chapter, 26 evidence was presented to show that the mapping W* is a good one. In this chapter, transformations are presented that produce efficient parse tables from the map W*. Algorithms which perform the transformations optimally will be given. 3. 1 REPRESENTING W* To implement W* in an economic manner, separate tables are maintained for each state of the EBCA; there are |Q|+1 separate tables to generate. These separate tables will be referred to as state tables . In this chapter, details of implementation will be ignored in general and the tables will be represented abstractly. An entry in the state table for a given state, B, coresponding to the mapping W*(B,U,V)= (E,t) will be written as W* (U,V)=(E,t); set brackets are B eliminated. A state table is represented as an ordered sequence of such entries: W* B (U 1 ,V 1 )=(E 1 ,t 1 ) W* B (U n ,V n )=(E n ,t n ). Table lookup is performed as follows: given that the machine is in a configuration (A,B,a), the state table for state B is inspected one entry at a time starting with the first. There are two possibilities. If the configuration is valid, then there will be a smallest integer r such that U X V contains (A, a), in which case the action implied by (E ,t ) is performed. On the other hand, if the configuration is 27 erroneous, all of the table entries will be exhausted without any action having been performed, in which case an error is detected. The order of the entries in the table is irrelevant; it is only necessary that the entire state table be exhausted. Because of the deterministic nature of the EBCA, although there may be two entries for which a given configuration could cause an action to be applied, both actions applied would be identical. Later, this ordering property will be found to be advantageous. 3.2 MOTIVATION FOR IMPROVING THE TABLES The representation as described in the previous section requires Idom W*| entries. If a fixed space requirement, or cost, is assumed for each entry, then the only improvement available is to find a better mapping, a possibility already rejected. However, there are good reasons to assume that the cost for each entry is not fixed. It is assumed that there are two possible sources of variable cost: the space required for the set pair; and the space required to encode the entry other than that required by the sets. It is further assumed that these are the only factors contributing to the cost of a state table other than the number of entries. First, consider the set pair space requirement. It is reasonable to assume that each set will be encoded as a bit vector. Although these vectors might be of variable length, it is improbable that such an 28 encoding would provide much space economy, because of the overhead required to keep the vector lengths. Therefore, fixed vectors are assumed, thus requiring that the largest possible set must be accomodated. Hence, for each set of stack symbols, IN I bits are required and for each set of input symbols, |T| bits are required. If this space is maintained for each table entry, the space required for the sets alone would be prohibitively high. A more promising scheme is to maintain an auxiliary table of the distinct sets that are referenced and to use an index into this table in place of the actual set in the entries themselves. If this approach is adopted, the number of distinct sets is a source of variable cost, which should be made as small as possible. It can be assumed that singleton sets can be more economically encoded by using the index as the index of the symbol in the set of symbols. The space required to encode an entry can vary because of the manner in which the inspection of an entry is performed. Given a configuration (A,B,a), the pair (A, a) is tested for membership in a set U X V by performing two tests: one for A in U and one for a in V. Thus, entries W* (U,V)=(E,t) for which U=N or V=T should require less space in the encoded table than other entries do; one (or more) of the memebership tests can be eliminated. For the rest of this chapter, let us stipulate that the cost overhead for each membership check is one unit; then the overhead required by an entry for a pair (U,V) is: if U=N and V=T 1 if U=N or V=T, but not both, 29 2 if neither condition applies. Acting on these assumptions, the cost of a set of parse tables can be reduced by minimizing the number of distinct sets referenced, by minimizing the number of expensive entries, and by reducing the number of entries. Such reductions will be done on a state-by-state basis because the task of global optimization is assumed to be intractible. The reductions are applied to tables generated from the canonical mapping to produce better tables; the latter may not correspond directly to any deterministic EBCA. The context sets for the reduced table entries may map into more than one acceptor action. Subsequently we will show this is allowable. The reduction schema are discussed in the remainder of this chapter. 3.3 TABLE TRANSFORMATIONS The table reduction scheme consists of the application of two transformations to the W* tables. The first transformation is designed to reduce the number of expensive entries; the second is designed to merge entries, thereby diminishing the number of entries. It is assumed that by applying the first transformation, the number of distinct sets is also decreased. Consequently, no explicit transformations to decrease the collection of distinct sets are developed here. The transformed tables will be denoted as W*. 30 Transformation J_ Given a state table in arbitrary order, W* B (U l ,V 1 )=(E l ,t 1 ) W* (U ,V )=(E ,t ). B n n n n the entry for (U^V^) can be transformed to ¥* B (U[,V|)s(E i , t^) where U^9 U| and V^s V[ whenever the action for any valid configuration (A,B,a) where (A, a) is in (U{-U i ) X (V{-Vj,) will be applied for some entry r where r a s ) i s not a valid configuration, then if some action is applied for the i-th entry, one of the following cases exists: the action erases the input, in which case a is in VjJ the action reduces the stack, in which case A r _j is in U^; or, no symbol is consumed. Consider the first case. Let (A^...A r _pA r ,a s ...a n ) be the ID of the machine when the action is applied. In order to reduce the stack to empty, the machine must eventually enter some state in which A r _^ is removed. Therefore, the following sequence of ID's is obtained. (A 1 ...A r _ 1 ,A r ,a s ...a n ) ^(^ . . . A r _ 2 ,A - ,a g . . .a n ) I- (Aj. . .A r _ 2 ,A^' ' ,a t . . .a n ) or \- (A 1 ...A r _ 2 A^",S ,a t ...a n ). In either case, if the stack is reduced A^.' =>* A r u, where u=a g ...a t _i and A^'' -> A r _iA r '. (From the correspondence between ID's and derivations.) Therefore, A . precede A r so 33 that (A , ,a ) is a valid context and the configuration r-1 s (A . ,A ,a ) is also valid. r-1 r' s Consider the second case. To erase a g , the following sequence of ID's must occur. (A,. ..A , ,A ,a ...a ) I— + (A.. ..A ,,A ,a ...a ) 1 r-1 r' s n 1 q-1 q s n I- (A, . ..A , ,A' ,a , . . . .a ) or 1 q-1 q s+1 n I- (A . . .A A' ,S ,a . . .a ) . 1 q-1 q s+1 n where A ...A is obtained from A ...A , by reducing the 1 q-1 1 r-1 * stack. There are two subcases to consider. Either A =S„ or q not. In the second case, it is clear that A =>* wA . q r Futhermore, since the input is reduced in state A , there is a production A' ->A a , so that a follow A and thus a^ follow r q q s ' S q s A . Consequently, (A . ,a ) is a valid context for a r r-1 s production A' -> A ,A . So (A ,,A ,a ) is a valid r r-1 r r-1 r s configuration. In the other case, consider the sequence of ID's which leads to a state in which A , is removed from the q-1 stack. If (A.. ..A ,,A',a ,,...a ) \-+ (A,. ..A , , A' * ,a. . . .a) 1 q-1 q s+1 n 1 q-1 q t n' \- (A.. ..A _,A" ',a ...a ) or 1 q-2 q t n I- (A.. ..A A'",S.,a ...a ) 1 q-2 q t n and there is no shorter sequence which reduces A , , then q-1 A"' -> A A", and A" = >* A'u. It is clear that A' -> a ; q q-i q q q q s therefore, a first A'', implying that a follow A y Hence, since A . = >* wA r , a g follow A r . As before, this leads to the conclusion that (A r _ 1 A r ,a s ) is a valid configuration. In the 34 first two cases, we ensure that the stack is non-empty by adding an additional symbol to mark the bottom of the stack and an extra state which removes this symbol and accepts if and only if the input has been consumed. In the third case, it is easy to see that eventually the machine reaches a configuration in which one of the first two cases applies. In any case, the parser may erase an arbitrary number of input symbols, and it may erase an arbitrary number of pushdown symbols, but eventually it will arrive in a state in which it can detect the error. As an illustration of a transformation of the first type, consider a 3tate B in which all the rules are input erasing rules: Cj -> Ba 1 C n -> Ba n with corresponding left context sets Ui,...,U . The initial W* B table is W* B (U 1 ,{a 1 })=(C 1 , shift) W* (U ,{a })=(C .shift) B n n n For any i, if (A i ,B,a i ) is a configuration, the only possible entry 35 which could be applied is the one for W*_(U. , {a. } ) . Applying D 1 1 transformation 1 , the resulting table is W* B (N,{a 1 })=(C 1 , shift) W* B (N,{a n })=(C n , shift) Similarly, if there are only stack reducing rules in a state, then the corresponding situation exists. In states in which both types of rules exist, or in which renaming rules exist, applying transformation 1 is non-trivial. Consider the following simple case in which state B has one stack reducing rule and one input erasing rule: C. -> Ba, with left context set {A,,A 2 } and C 2 -> A3B with right context set {a,,a 2 J. If the W* table is ordered as written, i.e. D W» B ({A 1 ,A 2 },{a 1 })=(C 1> shift) W* B ({A 3 }, {a ,a 2 })=(C 2 , reduce) no transformations can be made. On the other hand, if the order is reversed, then the table can be transformed to W* B ({A 3 ),T)=(C 2 , reduce) W» B (N,{a 1 })=(C 1> shift). Clearly, the transformations which can be applied are dependent upon the order of the table entries. To simplify the criteria for applying transformation 1, the notion of conflict is introduced. Two entries 36 W* (U.,V,) = (E,,t.) and W* R (U • ,V . ) = (E . , t . ) for which (E • , t, )^(E, , t, ) are Dllll D J J J J J -- L JJ left conflicting if U . r\ U./0; they are right conflicting if V.r\V^«). It is easy to see that two entries cannot be both left and right conflicting; otherwise the machine with mapping W* would not be deterministic . Entries which correspond to shift or rename actions are candidates for left compression; it is possible to transform them to left compressed entries. Likewise, entries which correspond to reduce or rename actions are candidates for right compression. Given these definitions, the criteria for applying transformation 1 can be restated: An entry which is a candidate for left, repectively right, compression can be transformed into a left, respectively right, compressed entry exactly when the state table is ordered so that every right, respectively left, conflicting entry precedes the candidate in the ordering. A set of entries for which there is an ordering of the table that allows all members of the set to be either left or right compressed is a compression set . Note that in general, the largest compression set is a proper subset of the set of state table entries. Consider the following set of table entries in state B: 37 W* ({A, },{a. }) = (C, .shift) B i 1 1 W» B ({A 1 },{a 2 }) = (C 2 ,shift) W* B ({A 2 },{a 1 }) = (C 3 , reduce) W* B ( { A 2 } , { a 2 } ) = ( C 4 , reduce ) It is easy to see that there is no ordering which satisfies the criteria given. In the following sections, we present schema for finding compression sets, table orderings and transformations; in the next section a convenient framework for developing these schema is presented, using tools borrowed from graph theory. 3.4 CONFLICT GRAPHS To assist in understanding this section, we present some terminology and notation from graph theory. In general, the notation used is from Harary [Harary 691. A graph . G=(X,E) consists of a finite nonempty set, X, of p vertices together with a (possibly empty) set, E, of q distinct unordered pairs of distinct vertices from X. Each pair in E is called an edge . An edge e={u,v} can be written as e=uv or as e=vu. If uv is an edge, then u and v are incident with the edge uv, and u and v are adjacent . A vertex v is an isolated vertex if it is not incident with any edges (adjacent to any vertices). A subgraph of a graph G is a graph G' all of whose vertices and edges are in G. If G* is a subgraph of G, we write G'CG. For a subset S of the vertices of G, the subgraph 38 induced by S, written as or , is the maximal subgraph (in the sense that no edge uv in G can be added to the subgraph) of G with vertex set S. A spanning subgraph of G, G', is a subgraph whose vertex set is the same as that of G. The complement G of a graph G is the graph on the same vertex set in which uv is an edge if and only if it is not an edge of G. A complete graph is a graph in which every two distinct vertices are adjacent. The complete graph on p vertices is denoted K . An independent set . S, is a P set of vertices no two of which are adjacent. Hence, if S is an independent set, is a complete subgraph of G. The union of two graphs G and G„ , denoted GjU G 2 is the graph whose vertex set is the union of the vertex sets of G and G~ and whose edge set is the union of the edge sets of G and G„ . 1 2 A bipartite graph G is a graph whose vertex set X can be partitioned into two subsets X. and X ? so that if uv is an edge of G, then u is in X and v is in X (or vice-versa). If every vertex in X, is adjacent to every vertex in X , then G is a complete bipartite graph. The complete bipartite graph with |X | = m and |X I = n is denoted K m . Obviously, a graph is a representation of a symmetric binary relation on a set of objects; here, the binary relation is the conflict relation developed in the previous section. To represent this relation, two graphs, called conflict graphs in this thesis, are defined. 39 Definition 3. 1 A conflict graph pair for a set of state table entries is a pair of graphs G =(X,E ), the right conflict graph . R G =(X,E ) , the left conflict graph such that X is in 1-1 correspondence with the set of table entries and for u, v in X uv is in E iff the entries corresponding to u and v are 1_» left conflicting uv is in E R iff the entries corresponding to u and v are right conflicting Whenever it is not ambiguous to do so, we will use the terms vertex and entry interchangeably. The relations adj, and adj R are defined such that u adi v if uv is in E T and u adj- v if uv is in E p . L L K K- As an example, the graph pair shown in figure 3-1 corresponds to the last set of table entries in the previous section. It is clear that G, is a spanning subgraph of G R and vice versa, inasmuch as two entries cannot be both left and right conflicting. Figure 3.1 HO Graph pairs will be useful in visualizing conflicts, formulating algorithms and proving theorems. The first such theorem yields a simple test for detecting compressible sets given the conflict graph pair. The theorem requires the following simple definition: a free vertex in a subset S is a vertex which either corresponds to a left compression candidate and is an isolated vertex in or corresponds to a right compression candidate and is an isolated vertex in . 41 Theorem 3. 1 A subset Y of X, a vertex set for conflict graphs G and G , is a L K compression set if and only if for every subset, S, of Y there is a free vertex in S. Proof: If . Since Y is a subset of itself, there is a free vertex, y, in Y. Let I ?Y ?...2Y be a chain of subsets of Y such that 1 n Y =Y, Y =Y-{y} and, in general, Y. . is constructed from Y. by removing one of the free vertices in Y , say y . Corresponding to i i the sequence of Y , then, is a sequence of vertices, the y . i i Consider an arbitrary y . Without loss of generality, we can i assume it to be a left compression candidate and isolated on . Since it is isolated on , there is no y , j>i, i R i R j which is right conflicting — otherwise y y would be an edge of i j . From this, it is easy to see that the sequence y_,y. ,...,y corresponds to an ordering of the state table which allows simultaneous compression of the entries in Y. 42 Only if . Let U be a subset of Y containing no free vertex. For any ordering of the vertices of Y, say yity2» #,, » y n there ^ s a smallest j for which y. is in U. Suppose y. corresponds to a left compression candidate; then there is y .y^ in . By the assumption that j is smallest, k>j. Thus there is a conflicting vertex not preceding y.. A similar result holds if y. corresponds to a right compression candidate. Since the ordering was arbitrary, it follows that there is no ordering for which every entry left conflicting with a left compression candidate precedes it and every entry right conflicting with a right compression candidate precedes it. A set which does not contain a free vertex will be called an irreducible set . The theorem just given provides the following algorithm, which tests a set Y for being a compression set. Algorithm 3. 1 Input: A pair of conflict graphs, G =(X,E ) and G =(X,E R ) together with the set of left compression candidates, L and the set of right compression candidates, R. Output: An irreducible subset of X. Method: L and R are searched until a free vertex is found; the vertex is removed from the sets and the algorithm repeats. If no free vertex is found, the algorithm halts, returning L R. 43 (1) Set Y=X. repeat (1) find a vertex, y, in LrvY such that adj yf\X- R (2) if (1) fails, find a vertex in R r> Y such that adj yriY=tf (3) if either (1) or (2) succeeds, remove y from Y; else return Y and halt The algorithm presented can be improved upon; a better algorithm is presented in the next chapter. 3.5 FINDING A BEST COMPRESSION SET 3.5.1 ANALYSIS Although it is easy to test for the property of compressibility, it is not easy to construct an efficient method for constructing a largest possble compression set, as will be shown. First, it is necessary to define what we mean by efficient. It is customary to say that an algorithm is efficient only when the time required to solve a problem is 4U bounded by a polynomial function of the size of the problem. Therefore, a method which examines every subset of the input set and tests for compressibility is inefficient; it requires time exponential in the size of the input set. Unfortunately, the problem of finding the largest compression set belongs to a class of problems called non-deterministic polynomial -time-complete (NP-complete) problems. An NP-complete problem is one for which there is a nondeterministic polynomial-time algorithm and which is as hard as any other problem for which there is such an algorithm. That is, if there is a deterministic polynomial time algorithm for the problem, then there is one for every NP problem. For a discussion of NP-complete problems, see [Karp 72], [Aho 74]. In order to show that finding a largest compression set is NP-complete, it is necessary to show that there are conflict graphs corresponding to any arbitrary pair of graphs G and G such that G c G and G qG . Lemma If G, and G 9 are arbitrary graphs such that G c G and G c G , then *• * 1 z z 1 there is a corresponding pair of conflict graphs G^ and G^ such that G =G and G 2 =G R . (The roles of Gj and G R are reversible.) 45 Proof It is easily seen that G and G must have the same vertex set X. Let the vertex set, X, of G and G be x ,x ,...,x . Construct ' ' 1 2 1 2' n a set of table entries W* d (U.V i )=(E 1 , rename) B 1 1 1 W* (U ,V )=(E .rename) B n n n such that if x x is an edge of G, , U r\ U i<6 and if x .x . is an edge i J 1 i J i J of G , then V r\ V t . This situation is trivially constructible: 2 i j let A. .,..., A be elements of N and a, ,,..., a be elements of T. 11 nn 1 1 nn U and U. contain A., exactly when x.x. is an edge of G^ and i P Suppose the largest independent set of G is of size k. Assuming all vertices represent candidates for both left and right compression (renaming actions), the size of the largest compression set is exactly p+k . Certainly, there is a set which is this large. Take a largest independent set of G, S. Then by theorem 3.1 SOX' is a compression set, since every vertex of S is free on any subset of SAX 1 containing it, and every vertex of X" is free on any subset of X'. On the other hand, if Y is a compression set, it 47 obviously cannot contain more than p vertices from X'. Further, it cannot contain more than k vertices from X. For suppose it did. Then XnY must contain at least k+1 vertices. Consequently there must be two vertices in XaY, say u and v such that uv is in G. In that case, {u,v}uX' is an irreducible set. That the problem is NP is easy to see: an enumeration algorithm which contructs and tests each subset of the vertex set requires polynomial time if it is run on a nondeterministic machine. It should be noted that it is an unsettled question whether there are polynomial-time algorithms for NP-complete problems. It is conjectured that this is not the case, based on the fact that no such algorithms have been found for a number of independent problems. Therefore, while it is marginally possible that there is an efficient algorithm to find a largest compression set, it does not seem likely. 3.5.2 A BACKTRACK ALGORITHM The following algorithm determines the largest compression set for a given state table by partial enumeration. The foundation for this algorithm is the simple fact that if U and U' are subsets of a vertex set X of a conflict graph pair such that: U'cU; U induces an irreducible configuration; and U' is a compressible set, then there is a vertex in U' which is not isolated in one of or but is L R isolated in the corresponding subgraph induced by U'. For example, consider vertex 3 in figure 3.2 'R 48 Algorithm 3.2 Figure 3.2 Input: A pair of conflict graphs, G =(X,E ) and G =(X,E R ) together with the set of left compression candidates, L and the set of right compression candidates, R. Output: The largest compression set for the graph pair. Method : C is a function which returns the largest compression set. It accepts as input parameters U and D, where U is a subset of vertices which contains the remainder of the 49 compression set; and D is a set of vertices which can be discarded from U when an irreducible configuration is discovered. The result returned by C must contain D or be the empty set. C(U,D) is given by: (1) Find an irreducible subset of U, V (see algorithm 3-1). (2) if V=0, return U (3) let x be a vertex in the smallest set contained in {S ! S£D & ((S=adj T vn R)(veVnL) or (S=ad j v r\ L) ( veVHR) ) } L R (if the set is empty, return 6) (4) recur, letting A=C(V-{x} ,D-{x} ) (5) if A*V then recur, letting B=C(V,D-{x}) else return AU(U-V) (6) return the larger of Au(U-V) and Bu(U-V) 3.6 TABLE ORDERING Finding a largest compression set only solves part of the problem; it is desirable to find a good ordering of the table entries. It is important especially to collapse the final, unconditional entries in the table. It is also important to order the tables to take advantage of the merging transformation. There are several possible strategies for 50 ordering the compressible set to obtain good merging to be considered. The first strategy is a simpleminded one. After finding a compressible set Y, algorithm 3.1 is applied to the set. The order in which the algorithm removes free vertices is used to generate the table ordering. The algorithm is clearly suboptimal; the arbitrary selection of the next vertex to remove can eliminate possible unconditional entries or prevent other merging. For example, consider the following state table: W» ({A,}, UjJMEp rename) W* B ({A 2 ) , {a 2 J) = (E 1 , rename) W* B ({A 3 },{a 2 }) = (E 2 , reduce) W» B ({A 1 },{a 3 )) = (E 3 , shift) The conflict graphs corresponding to this table are shown in figure 3-3. G L = G R = Figure 3-3 51 Notice that all four vertices are free. The ordering (1,2,3,4) is equally as likely as any other. This ordering results in the following transformed table. W* n (N,{a 1 })=(E 1 .rename) B 1 1 W* ({A },T)=(E. .rename) W* B ({A 3 },T)=(E 2 , reduce) W* B (N,{a 3 })=(E 3 , shift) If an optimal strategy is used, the table can be transformed to W* B ({A 3 },T)=(E 2 , reduce) W* B (N,{a 3 })=(E 3 , shift) W* (N,T)=(E, .rename) D i Obviously, the strategy can be improved by employing heuristics in determining which vertex is to be removed. A polynomial-time suboptimal algorithm will result, but one which can avoid the obvious bad orderings such as the one just given. Such a method is actually used in the implementation. If it is sufficient to find an ordering which exploits only the collapsing of unconditional entries, then there is an optimal method for doing this which requires polynomial time. As illustrated in the last 52 example, the renaming rule entries all corresponded to free vertices in the entire vertex set. The decision to choose vertices 3 and 4 allowed the two renaming rule entries to be output last and consequently to be collapsed into one unconditional entry. It is clearly desirable to collapse a large number of entries in general. The following lemma shows that there is a simple method for determining the optimal ordering to do this, assuming that the renaming rule to which the unconditional entry corresponds has been determined. Lemma Let Y be a compression set. Suppose there is an ordering of entries such that a set of vertices, S, corresponds to a group of unconditional entries. Then if U is a subset of Y containing an element not in S, there is a free vertex in U that is not in S. Proof Suppose not. Then there is a subset U such that some x in S is the only free vertex. There is some ordering of the table entries corresponding to vertices in Y that allows a compression transformation. Let S be the set of left compressed entries and Li S the set of right compressed entries in the resulting R transformation. Now, since discarding entries cannot create 53 additional conflicts, it is clear that there is an ordering of the entries in U such that vertices in S T U are left compressible and those in S U are right compressible. Let that ordering on U be y ,y , . . . ,y . Then y =x, because x is the only free vertex in U. 1 2 n 1 Also, m>1, since |U|>1 by the hypothesis. Let y=y~. Without loss of generality, assume adJ L x U=; if it were, then removing y from U-{x} would not create a free vertex. Therefore, y is in S . But since x is in S, all right and left conflicting entries precede it so that S T adj D x= w is a rule which is not SLR(1), then the grammar is modified so that trees of the form are transformed to the form 59 [Ct] [t/D] This process is known as look-ahead reductio n. Look-ahead reduction consists of two transformations. The first is right-context extraction : if A is an offending rule and Bp*A, then the grammar is said to be right-context extracted iff C -> uBy in P implies that either y is the empty string or y=av for some terminal, a; right context extraction is the conversion of a grammar to a right-context extracted form. The second transformation is premature scanning : if A -> w is an offending rule, B p *A and C ->uBav is in P, then Ba is combined to [Ba], a non-terminal--the string Ba is replaced by [Ba] in all right parts in which it appears. Also, rules C -> wX, where Bp*C are replaced by 60 [Ca] -> wXa if X is a terminal symbol and by [Ca] -> w[Xa] if X is a nonterminal. If a grammar is not SLR(1) then the look-ahead reduction transformations are applied to yield a new grammar. The process is repeated until the transformed grammar is finally SLR(1). If the original grammar is LR(k), then at most k applications of the transformation are required. The SLR(1) grammar is subsequently transformed to (1,1 )BRC using 'state splitting' and left stratification ; for details, see [Graham 71]. The (1,1 )BRC grammars are then transformed to the internal normal form by repeated left factorization . Given productions P = {A^ -> uB-^v}, left factorization replaces the set by P ' = {h^ ->[u]B i v} and P' = {[u] -> u}; [u] is added to the nonterminals. Application is repeated as long as there is a factor of length bigger than 2. Eventually, all productions are of the proper form. However, there may be situations such that A -> BC and A ->DB (or A ->Ba). This is called a stack-state conflict; such conflicts can be removed by further application of left factorization, replacing A -> BC by A ->B'C and adding B' -> B. 61 4.4 CONTEXT COMPUTATION 4.4.1 ALGORITHMS The second phase of the system involves the computation of the canonical mapping, W*. For the sake of convenience, the context sets are attached to the productions of the grammar. The productions for each state are kept in lists, so that at the end of the phase, the mapping is available for each state. The computation and collection of context sets is performed in three subphases: the first subphase generates the binary relations (and inverses) given in section 2.5; the second generates the first, follow and precede sets from these relations; and the last associates the context sets with the appropriate productions. As we showed in 2.5, to compute the necessary context sets, it is sufficient to compute the first, follow, and precede sets for each nonterminal. This is done in two steps: first, X , p ,a and a - are computed; in the second, the transitive closures of and are computed and the relations composed to construct the sets. The first step is accomplished in one pass through the production rules. For each production, the partially computed sets, FIRST, FOLLOW, and PRECEDE and the binary relations are computed as follows: 62 1. For p = A -> A. A . A. is added to XA and to pk ±2 and k ±l ii il i2 1 Xi is added to PRECEDE(A.J i2 2 For p = A -> A a , A is added to A A and a is added to i i il i2 i 1J - FOLLOW(A. ,) il 3. For d = A -> A . , , A . is added to XA., and to pA.,. i i il l li 4. For p. = A. -> a,,, a., is added to FIRST(A i ). In this step, we also construct a directed graph, the renaming graph. representing the renaming rule structure of the grammar. The vertices of the graph are the nonterminals of the grammar. A labeled edge from A to A,, is added for all renaming rules p =A ± -> A-q. The label i il applied is the index of the production. To complete the generation of the sets FIRST, FOLLOW, and PRECEDE, 3 requires transitive closure. Closure is computed by the following 0(n ) algorithm, where n is the number of nonterminals. (Actually, the algorithm is (Kn^) where m is the number of computer words required to store n bits. ) Input: A binary relation, 6 , as an array of bit vectors. Output: The transitive closure of 6, 6*. 63 (1) Set 3*x to {x,}u0x, for all x.. ill i (2) for i from 1 to n do for all x B*x include B*x in 0*x . i j i j It is easy to verify that this algorithm is correct. After the closures are computed, FIRST, FOLLOW, and PRECEDE are completed as follows: 1. FIRST(B) is included in FIRST(A) for all A \+ B. 2. PRECEDE(A) is included in PRECEDE(B) for all A \ + B. 3. FIRST(A) is included in FOLLOW(B) for all A, B such that C -> BA. 4. FOLLOW(A) is included in FOLLOW(B) for all A P + B. The context sets are attached to productions in a manner prescribed by the procedure given in 2.5. For each input erasing and stack renaming production, p, the appropriate context sets are determined and attached to the production; further, they are added to every production, p', which labels an edge in the renaming graph that is in any path from a nonterminal in the right part of p. Traversing the renaming graph is a simple process; there can be no cycles in the graph, nor can there be 64 multiple paths between vertices. Therefore, each nonterminal can be encountered at most once in any traversal of the renaming graph. Thus, distributing context sets is an 0(mn) algorithm, where m is the number of input erasing and stack reducing rules and n the number of nonterminals. 4.4.2 REPRESENTING CONTEXT SETS The context information for each production is recorded as an index into a table of context pairs. The pairs map into two tables of bit vectors, one table each for the left and right context sets. Each bit vector in a table is unique; when a new context set is generated during context computation, a hash addressing scheme is used to locate the corresponding bit vector; if it is not present, it is added to that table. The tables are initialized so that all the singleton subsets of N are included in the left context set table and all the singleton subsets of T are included in the right context set table. Entries with the same hash address are chained together. Figure 4.2 illustrates the layout of the context tables. 65 62 131 -> {Tj} > {T 5 ,T 6 ,T 7 } — > > (T 2 ) > -> IT 1 -— > Figure 4.2 4.5 CONFLICT GRAPH GENERATION The third phase generates conflict graphs for each state. The pair of graphs and the sets of left and right compression candidates is 66 generated for all productions in the list for the specific state, graphs are represented by the vertex set and the two adjacency relations; the adjacency relations are computed from the definition given in section 3-4. Each vertex is mapped into a triple consisting of a production rule index and indices into the context set tables. 4.6 STATE TABLE GENERATION Phase four creates state tables from the graphs created in phase three. The tables are generated in three subphases. The first divides the table entries into two partitions: full entries and compressed entries. The second subphase orders the entries, and the third subphase performs the compression and merging transformations and generates an intermediate representation of the state table. In the implementation, the second and third subphases are carried out concurrently. For ease of presentation, these subphases will be treated as separate sequential steps. The process used to partition the full and compressed entries is very simple. The following stucture is used. 67 Algorithm 4.5. 1 Input: A set of vertices, X, and a pair of adjacency relations, adj and adj . L R Output: FULL, a set of full entries, and COMPRESS, a set of compressible entries. (1) Let Y = X. repeat (2) Remove vertices from Y until there is a free vertex in the subset; add the vertices removed to FULL. (3) Generate S, a set with no free vertices, by removing free vertices from Y until it is not possible to continue; add the vertices removed to COMPRESS. (4) Let Y be the set of remaining vertices in S. until S=S| is smallest. (4) If |adj y^Sl < I ad j x^S| then set S = S-adj y; L R L otherwise, set S = S-adj x. R When this algorithm is used as step 3 of algorithm 4.5.1, the result is the same as would be obtained from the backtracking algorithm 3-2 if it halted after finding the first compression set. Since algorithm 4.5.3 is 0(n), algorithm 4.5.1 is 0(n 2 +ne). 70 Given that the partitions have been computed, it is necessary to order both sets. We will address only the problem of ordering the compressed entries here. We showed in section 3.5 that there is a polynomial algorithm to find an ordering which provides maximum exploitation of the unconditional entries; further, we gave a bound for the algorithm. In the current implementation a faster, albeit suboptimal, ordering algorithm is used. Time is saved by not repeating the ordering process for each renaming rule in the grammar; instead, the best choice is "predicted" and the algorithm defers outputting corresponding entries when it can. At the same time, the algorithm attempts to improve table size by setting up mergeable sequences of entries. We present a slightly simplified version of the algorithm in use; additional detail does not add to the discussion. To simplify notation, let R(v) be the largest set containing vertex v and consisting of vertices that correspond to entries for the same rule. Algorithm 4.5.4 Input: A compressible set, Y. Output : An ordering of Y and a compression scheme s, a function from Y to the set {left, right, both, either}. 71 (1) Let L = left compression candidate set, and R = right compression candidate set. repeat (2) If there exists x in (L-R)nY s.t. adj xnY r Y s.t. adj xnY = 6 then output x, set s(x) = right, remove x from Y. (4) else let U = (xt Lr\R| adjxnY = e5}, let U_ = L L K {xe Ln R| adj xnY = , pick y in U s.t. |R(y)nY| is smallest. If y is in U (U ) output R(y)n U (R(y)nU ) and for R L K L all x in R(y)nU D (R(y)nU.) set s(x) = left (right) R L and remove x from U and Y, until U - a.., where the a., are all distinct. If the input symbols which are the right parts of the initial state rules are mapped one-to-one on an interval 1..,n, it is trivial to arrange the initial state table so that the right context for the i-th entry is the i-th input symbol. This allows direct access; if a symbol is in the proper range, the action for the corresponding entry is performed. If the symbol is out of range, an error is detected. By using direct access, the time spent accessing entries in the initial state is reduced dramatically. Because the parser is frequently in the initial state, and because the remaining states generally have few entries, this transformation allows the parser to be time efficient as well as compact. States are arranged to take advantage of the properties of unconditional entries. Consider an unconditional entry for which the state/stack symbol is a state and for which there is no semantic action. Since the only action performed by the parser for that given entry is a transition from one state to another, it is possible to eliminate the entry by concatenating the states. The second state is treated as an "entry point" of the first. To determine an ordering which allows the exploitation of this property, the following procedure is used. We 74 construct an inverse forest, the nodes of which represent states. For each state, B , in which the unconditional entry specifies a transition to a state B without a semantic action, define an edge from B to B . At each node there are clearly m>=0 incoming edges and n<=1 outgoing edges. For every node such that m>1, remove m-1 edges arbitrarily. The result of this process is a collection of disjoint chains; each chain specifies the ordering of a group of states. By arbitrarily ordering the chains and preserving the ordering within them, the state tables are arranged to take advantage of the special unconditional entries. The final arranging scheme we consider is rearranging the stack and input vocabularies. This is done to reduce the space required to store the auxiliary tables of bit vectors. We have assumed that it is necessary to accomodate the largest possible set in establishing the length of the (fixed-length) bit vectors. However, if no set contains the last k elements of this largest set, it is possible to truncate the bit vectors and to specify the range of symbols in the sets. The symbols in the sets can be arranged so that any symbol not appearing in any set used is assigned a numeric encoding larger than a specified range. That range is given by lUx I, where the X are left (or right) i 1 context sets. In arranging the input symbols, the encoding of the symbols used as right parts of initial state rules must be preserved. The remaining symbols can be rearranged. 75 4.8 ENCODING THE TABLES The final phase of the PGS encodes the state tables, putting them into a form usable on the parser's host machine. The tables can be output as binary files, macro definitions, initialization code, or other convenient representation; without loss of generality, we will assume the output to be in the form of macro definitions. Input to the encoding phase consists of three distinct groups of data. State tables are input as an array of sixtuples consisting of a context pair, a state/stack symbol, an acceptor action, a semantic action, and a compression mode. A separate table contains the locations of the state tables in the first array, and the lengths of the state tables. The third data group is the pair of context set tables. Output consists of a sequence of pseudo-instructions, which can be thought of as machine instructions for the parsing machine. There are four primitive actions performed by the machine: 1. Verify the current input symbol and apply an action. 2. Verify the current top of stack symbol and apply an action. 3- Apply an action unconditionally. 4. Indicate an error has occurred and enter error mode. 76 For the first two types of psuedo, the action is applied only if the symbol is matched. If the symbol is not matched, the next psuedo-instruction is fetched and interpreted. The psuedo-instruction format is shown in figure 4.3. operation verification index reduction flag transition flag symbol/ address Figure 4.3 The operation field indicates one of the four instruction types already described. The verification index is an index into a table of bit vectors or the index of a symbol. (In order to avoid storing singleton sets, the index values 1...m, where m=|N| or |T|, are used to represent the sets {1}...{m} and the table entries 1...k are represented by index values m+1...m+k.) The reduction flag indicates whether the symbol verified should be erased. The transition flag specifies how the 77 symbol/address field should be used: as the address of a psuedo-instruction or as a symbol to add to the stack. For the third type of psuedo-instruction, the verification index and reduction flag are unused; for the error pseudo, only the operation and address fields are used, the latter to indicate the state in which the error occurred. In the examples used in the appendix, the operation codes are given by INPUT, STACK, ANY, and ERROR. The reduce flag values are represented symbolically as RED (erase/reduce the symbol) and NORED (do not). The transition flag values are ADD (add to the stack) and GOTO (next instruction to perform is at the address specified). To encode a full entry, we use three instructions. The first checks one context and conditionally "branches" to a substate in which the second context is checked. The third instruction affects a return to the main state to execute further instructions. For example, the state given at the end of section 3-3 is encoded in figure 4.4; semantic actions are not shown. Addresses are given in symbolic form. In a three instruction sequence, if the stack is reduced or the input shifted, that action must be performed in the second instruction. Further, the semantic action is performed before any symbol is consumed. 78 S1: INPUT, 1,N0RED, GOTO, S11 S12: INPUT, 1, RED, GOTO, S2 INPUT, 2, RED, GOTO, S3 STACK, 1, RED, GOTO, S2 ERROR, 1 S11: STACK, 2, RED, GOTO, S5 ANY, GOTO, S1 2 Figure 4.4 79 CHAPTER 5 RESULTS AND CONCLUSIONS 5.1 RESULTS The principal claim of this thesis has been that the PGS is a flexible system that produces compact parsers. The results presented here support that claim. Parsers produced by PGS are compared with those obtained using well-known, efficient systems. Parse table sizes are given for three grammars: a simple expression grammar (see Appendix A), and grammars for XPL and Pascal. Results for XPL are of particular interest, as they provide a basis for direct comparison with the LR parsers produced by Anderson, Eve and horning and by Joliat. Their methods are the most efficient reported in the literature. The technique of Anderson, e_t al produces a list structured parse table, while Joliat produces a sparse-matrix representation. Indirect comparison is also available; Anderson, e_t al give results for Algol W, a language comparable in size and complexity to Pascal. As an additional benchmark, results from a previous version of PGS are shown. 80 The first table provides a complete synopsis of the results obtained. The number of table entries as shown indicates the minimum number of entries produced using all transformations and assuming no semantic actions are required (thus allowing the elimination of some unconditional entries). The space required by the tables is given for three different situations. The first figure given is a measure of the space required when the smallest possible number of bits is used to encode the pseudo-instructions, and assumes semantic actions are not required. The second figure given is based on a fixed, 30-bit pseudo-instruction length. The third figure assumes that no unconditional entries can be eliminated. The size of the tables produced for a compiler is likely to fall between the second and third figures. (**) Bits Bits Bits Stack Input (**) Full sets sets entries 11370 12300 13 5010 5820 390 450 Pascal 349 7712 XPL 163 3296 AEG (*) 13 130 3 31 1 2 27 2 Table 5. 1 (*) AEG is a simple expression grammar (**) Unconditional entries removed by concatenating states, 81 Table 5.2 shows the coraparitive figures for the methods of Anderson, et al , Joliat, and the previous PGS version. Table sizes are shown in bytes. Additional space is required for interpreters; however, this space is roughly the same in all cases. The table sizes reported for the new PGS also include all space required for context sets. PGS (old) PGS (new) Anderson Joliat XPL 1023 627 1128 1966* PASCAL 2480 1421 ALGOL W 2434 AEG 200 49 232 Table 5.2 (*) Joliat indicates that by eliminating error entries and doing full matrix reduction, a table size of 1000 can be obtained. The limited data indicates that the compression techniques used in the PGS can potentially produce parse tables 25-35% smaller than those produced by very efficient LR methods. 82 5.2 CONCLUSIONS This thesis has described a powerful parser generating system that is capable of producing efficient parsers for a wide class of grammars, namely the class of LR grammars. The contribution made by the thesis research is the development of techniques to acheive very good table compression. The techniques given in chapter 4 are efficient in that they do not require an exorbitant amout of computing resources in exchange for producing parse tables. The resulting system is capable of generating parsers for a large group of languages and making them available on small machines, i.e. mini and microcomputers. 5.3 EXTENSIONS AND REFINEMENTS There are several possible refinements to the system which have not yet been invesigated. Additional heuristics for the compression algorithms might be developed, although the payoff does not promise to be extensive. Other areas to be investigated are error recovery and parser speedup. The compression of the parse tables sacrifices some effectiveness recovering and/or repairing syntactic errors. A compromise technique might be developed which perform full context checks for critical production rules. Determinig such situations can be tied in with the table generation algorithms presented in this paper. To do this requires a thorough analysis of applicable error recovery strategies. 83 Another refinement involves improving the time efficiency of the parser by doing 'partial chain elimination', bypassing renaming rules which do not contribute to the semantic phase. For certain contexts, a long chain of renaming rules might be involved which have no associated semantic action (or perhaps only one). By detecting these contexts, it is possible to establish a direct transition and avoid superfluous parsing decisions. The final area for which some investigation is suggested is the grammar transformation phase. Any optimization of the output grammar which can be made in this phase potentially carries over into the remaining phases. Similarly for error recovery and parser speedup. REFERENCES 84 [Aho 72] [Aho 74] [Anderson 73] [DeReraer 71] [Eickel 63] [Evans 64] [Feldman 68] [Floyd 61] [Floyd 63] [Floyd 64] Aho, A.V. and Ullman, J.D. The Theory of Parsing, Translation, and Compiling. Prentice-Hall, Englewodd Cliffs, NJ 1972. Aho, A.V., Hopcroft, J.E., and Ullman, J.D. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass 1974. Anderson, T, Eve, J, and Horning, J.J. Efficient LR(1) parsers. Acta Informatica 2(1973) ,12-39. DeRemer, F.L. Simple LR(k) grammars. Comm. ACM 14(1971), 453-460. Eickel, J., Paul, M. , Bauer, F.L., and Samelson, K. A syntax controlled generator of formal language processors. Comm. ACM 6( 1963) , 451-455. Evans, A. Jr. An ALGOL 60 compiler. Ann. Rev. Automatic Programming 4( 1964) ,87-124. Feldman, J. A. and Gries, D. Translator writing systems. Comm. ACM 1 1 ( 1 968) ,77-1 13. Floyd, R.W. A descriptive language for symbol manipulation. J. ACM 8( 1 96 1 ), 579-584. Floyd, R.W. Syntactic analysis and operator precedence. J. ACM 10(1963) ,316-333. Floyd, R.W. Bounded-context syntactic analysis. 85 [Graham 74] [Gray 72] [Harary 69] [Ichbiah 70] [Joliat 73] [Karp 72] [Lalonde 71] [McKeeman 71] [Mickunas 73] Comm. ACM 7( 1964 ) ,62-67 . Graham, S.L. On bounded right context languages and grammars. SIAM J. Comput. 3( 1974) ,224-254. Gray, J.N. and Harrison, M.A. On the covering and reduction problems for context-free grammars. J. ACM 19(1972), 675-698. Harary, F. Graph Theory. Addison-Wesley, Reading, Mass 1969. Ichbiah, J. and Morse, S. A technique for generating almost optimal Floyd-Evans productions of precedence grammars. Comm. ACM 13( 1970) ,501-508. Joliat, M.L. On the reduced matrix representation of LR(k) parser tables. University of Toronto, Computer Systems Research Group Tech. Rep. CSRG-28, 1973. Karp, R.M. Reducibility among combinatorial problems, in: Complexity of Computer Computation, R.E. Miller and J.W. Thatcher (eds.), Plenum Press, New York, NY ( 1972) ,85-104. Lalonde, W.R. An efficient LALR parser generator. University of Toronto Computer Systems Research Group Tech. Rep. CSRG-2,1971. McKeeman, W.M., Horning, J.J., and Wortman, D.B. A Compiler Generator. Prentice-Hall, Englewood Cliffs, NJ 1971. Mickunas, M.D. Techniques for compressing bounded right context acceptors, Doctoral Diss., Purdue U. , 86 West Lafayette, Ind. May 1973- [Mickunas2 73] Mickunas, M.D. and Schneider, V.B. A parser-generating system for constructing compressed compilers. Coram. ACM 16( 1973) ,669-676. [Mickunas 76] Mickunas, M.D., Lancaster, R.L. , and Schneider, V.B. Transforming LR(k) grammars to LR(1), SLR(1) and (1,1) bounded right context grammars. J. ACM 23(1976), 511-533. [Pager 70] Pager, D. A solution to an open problem by Knuth. Information and Control 17( 1970) ,462-473. [Wirth 66] Wirth, N. and Weber, H. EULER: a generalization of Algol 60 and its formal description. Comm. ACM 9(1966)13-25,89-99. 87 APPENDIX A AN ARITHMETIC EXPRESSION GRAMMAR 1 GRAMMARS BNF Note: nonterminal symbols in both grammars are given as (subscripted) upper case letters; all other symbols, with the exception of the metasymbol '->', are terminal symbols. E -> E+T E -> T T -> T*F T -> F F -> i F -> (E) THE NORMAL FORM GRAMMAR s - > J t $ J l \ ► J 2 E E - > XjT E - ■> T T - > X 2 F T - -> F F - -> P ) F ■ -> i P • -> X 3 E x i - > E + X 2 - > T * x 3 > ( J t •> $ 89 2 CONTEXT COMPUTATION Shown are the follow and precede sets which are needed; there are no class 1 sets (see 2.5), so first sets are not given. FOLLOW E { $,+,) } T { $,»,+,) } F same as above P { ) } J j { $ ) PRECEDE E { X 3 ,J 2 } T ( l v X 3 , J 2 } F { X^ X 2 , X 3 , J 2 } P same as F X. same as E X 2 same as T Xo same as F 90 NORMAL FORM WITH ATTACHED CONTEXTS S -> Jj$ J j -> J 2 E E -> XT E -> T T -> X F T -> F F -> P ) F -> i P -> X 3 E X L -> E + X 2 -> T * x 3 -> ( Jj -> $ {J 2 } X {$} ) {x x J X {$,+,)} ) {J 2 } x {$}, {x 3 } x {)}, {x 3 ,j 2 } X {+} ) {x 2 i X {$,*,+} ) {X J x {$,+,)}, (J 2 ) x {$}, (x 3 ) X {)}, X ,J } x {+}, tx ,X ,J } x {*} ) tx 1 ,x 2 ,x 3 ,J 2 J X {)} ) {X lf X 2 ,X 3 ,J 2 } X {i} ) (x 3 i X {)} ) U 3 ,J 2 } x {+} ) {XPX3.J2) X {*} ) tx 1 ,x 2 ,x 3> J 2 } X {(} ) { } X {$} ) 91 3 CONFLICT GRAPHS Conflict graphs are given for nontrivial cases; i.e. states E, T, and F. The vertices will be labeled by a pair consisting of the number of the production rule and the order of the context set in the collection of context sets for that production. For example, the entry w*(F,{J },{$}) = {(T, rename)} has a corresponding vertex numbered [6,2]. GRAPHS FOR STATE E G L = [2,1] [10,1] [9,1] [2,1] [10,1] [9,1] GRAPHS FOR STATE T 92 [4,1] [4,2] [4,3] [3,1] [11,1] 93 G R = [4,1] [4,2] [4,3] [3,1] [11,1] GRAPHS FOR STATE F G L = [5,1]- • [6,1] • [6,2] • [6,3] ' [6,4] • [6,5] 94 [5,1] 95 4 OUTPUT The sequence shown is a symbolic representation of the parse tables produced for the simple arithmetic expression grammar. The minimum number of entries is shown. SO: INPUT, $, RED, ADD, J INPUT, i, RED, GOTO, F INPUT, (, RED, ADD, X P : INPUT,), RED, GOTO, F ERROR, 1 F : STACK, X , RED, GOTO, T T : INPUT,*, RED, ADD, X 2 STACK, X, RED, GOTO, E E : INPUT, +, RED, ADD, X STACK, J , RED, GOTO, J ERROR, 4 J : INPUT, $, RED, GOTO, ACCEPT ERROR, 5 96 APPENDIX B PARSE TABLES FOR XPL SO: INPUT, ID, RED, GOTO, S13 INPUT, CALL, RED, ADD, N15 INPUT, IF, RED, ADD, N14 INPUT , CHARACTER , RED , GOTO , S34 INPUT,C0MPL0P,RED,ADD,N16 INPUT , NUMBER , RED , GOTO , S5 INPUT , INITIAIL , RED , GOTO , S47 INPUT, GOTO, RED, ADD, N1 INPUT , STRING , RED , GOTO , S5 INPUT, DO, RED, GOTO, S9 INPUT, BIT, RED, GOTO, S48 INPUT , LPAREN , RED , GOTO , S49 INPUT, FIXED, RED, GOTO, S34 INPUT, ADDOP, RED, ADD,N17 INPUT, WHILE, RED, ADD, N24 INPUT, DECLARE, RED, ADD.N1 8 INPUT, LEFT, RED, ADD, N19 INPUT, TO, RED, ADD, N20 INPUT, SEMI, RED, GOTO, S4 INPUT , LABEL , RED , GOTO , S3 1 * 97 INPUT , RETURN , RED , GOTO , S26 INPUT, CASE, RED, ADD, N21 S5: STACK, N4, RED, GOTO, S52 S23: STACK, N38, RED, GOTO, S31 S31: INPUT, MULTOP, RED, ADD, N38 STACK, N 17, RED, GOTO, S2 STACK,N31,RED,GOTO,S2 S2: INPUT, ADDOP, RED, ADD, N31 STACK, N37, NOR, GOTO, SU5 S30: INPUT, CONCAT, RED, ADD, N37 STACK, N 11, RED, GOTO, S20 INPUT , RELOP , RED , ADD , N 1 1 S20: STACK, N16, RED, GOTO, S21 S21: STACK, N36, RED, GOTO, S19 S19: INPUT, ANDOP, RED, ADD, N36 STACK , N35 , NOR , GOTO , S46 S18: INPUT, OROP, RED, ADD, N35 S12: STACK, N5, RED, GOTO, S61 STACK , N29 , RED , GOTO , S59 STACK, N39, RED, GOTO, S53 STACK, N14, RED, GOTO, S50 STACK , N 1 2 , RED , GOTO , S37 STACK, N3MED,G0T0,S16 STACK, N24, RED, GOTO, S8 STACK, N21, RED, GOTO, S8 ERROR, 12 98 S3: STACK, N6, RED, GOTO, S3 S4: STACK, N27, RED, GOTO, S4 INPUT , ELSE , RED , GOTO , S32 S29: STACK, N30, RED, GOTO, S28 STACK, N3, RED, GOTO, S1U STACK, N33, RED, GOTO, S1 4 S28: INPUT, RIGHT, RED, GOTO, S25 INPUT, END, RED, GOTO, S1 1 ANY, GOTO, S30 S33: INPUT, INITIAL, NOR, ADD, N13 S6: STACK, N 18, RED, GOTO, S7 STACK, N32, RED, GOTO, S7 ERROR, 6 S7: INPUT, COMMA, RED, ADD, N32 INPUT , SEMI , RED , GOTO , S4 ERROR, 7 S8: STACK, N22, RED, GOTO, S60 ERROR, 8 S9: INPUT, SEMI, RED, ADD, N2 S22: STACK, N28, RED, GOTO, S62 ERROR, 22 S11: INPUT, ID, NOR, ADD, N25 S10: STACK, N8, RED, GOTO, S58 STACK, N2, RED, GOTO, S55 ERROR, 10 S13: STACK, N7, RED, GOTO, S36 99 STACK, N25, RED, GOTO, S10 INPUT , LPAREN , RED , GOTO , S5 1 INPUT .LITERALLY , RED , GOTO , S40 INPUT , COLON , RED , GOTO , S 1 7 INPUT, INPUT01 ,NOR,ADD,N26 S35: STACK, N1, RED, GOTO, S39 STACK, N15, RED, GOTO, S38 STACK , STACKO 1 , NOR , GOTO , S27 ANY, GOTO, S23 S14: STACK, N27, RED, GOTO, S14 ANY, GOTO, S29 S15: STACK, N 13, RED, GOTO, S33 ERROR, 15 S5M: INPUT, BY, NOR, GOTO, S56 S16: STACK, N23, RED, GOTO, S8 ERROR, 16 S 1 7 : INPUT , PROCEDURE , RED , GOTO , S2 4 ANY,ADD,N27 S24: INPUT, SEMI, RED, ADD, N8 ANY,ADD,N28 S25: STACK, N19, RED, GOTO, S1 ERROR, 25 S26: INPUT, SEMI, RED, GOTO, S4 ANY,ADD,N29 S27: INPUT, EQUAL, RED, ADD, N34 INPUT , COMMA , RED , ADD , N6 100 ERROR, 27 S32: STACK, N3, RED, ADD, N33 ERROR, 32 S34: STACK, N10, RED, GOTO, S33 STACK, N26, RED, GOTO, S33 ERROR, 34 S36: INPUT, RPAREN, RED, GOTO, S22 INPUT , COMMA , RED , ADD , N7 ERROR, 36 S37: INPUT, RPAREN, RED, GOTO, S35 INPUT , COMMA , RED , ADD , N 1 2 ERROR, 37 S38: INPUT, SEMI, RED, GOTO, S4 ERROR, 38 S39: INPUT, SEMI, RED, GOTO, S4 ERROR, 39 S40: INPUT, STRING, GOTO, S6 ERROR, 40 S41: INPUT, NUMBER, RED, GOTO, S42 ERROR, 41 S42: INPUT, RPAREN, RED, GOTO, S34 ERROR, 42 S43: INPUT, NUMBER, RED, GOTO, S63 ERROR, 43 S52: INPUT, RPAREN, RED, GOTO, S15 S4U: INPUT, COMMA, RED, ADD, N4 101 ERROR, 44 S45: STACK, N37, RED, GOTO, S30 ERROR, 45 S46: STACK, N35, RED, GOTO, S18 ERROR, 47 S48: INPUT, LPAREN, RED, GOTO, S41 ERROR, 48 S49: STACK, N28, NOR, ADD, N7 ANY, ADD, N5 S50: INPUT, THEN, RED, ADD, N3 ERROR, 50 S51: STACK, N52, NOR, GOTO, S43 ANY,ADD,N12 S53: INPUT, SEMI, NOR, GOTO, S57 ANY,ADD,N23 S55: INPUT, SEMI, RED, GOTO, S4 ERROR, 55 S56: INPUT, BY, RED, ADD, N34 ERROR, 56 S57: INPUT, SEMI, RED, GOTO, S3 ERROR, 57 S58: INPUT, SEMI, RED, GOTO, S4 ERROR, 58 S59: INPUT, SEMI, RED, GOTO, S4 ERROR, 59 S60: INPUT, SEMI, RED, ADD, N2 102 ERROR, 60 S61: INPUT, RPAREN, RED, GOTO, S23 ERROR, 61 S62: INPUT, SEMI, RED, ADD, N8 ERROR, 62 S63: INPUT, RPAREN, RED, ADD, N12 ERROR, 63 The sets used are: STACK01 = { N2,N3,N6,N8,N19,N22,N27,N30,N33 } STACK02 = { N18,N32 } INPUT01 = { CHARACTER, BIT, FIXED, LABEL } ■ IBLIOGRAPHIC DATA HEET 1. Report No. UIUCDCS-R-77-904 3. Recipient's Accession No. 5. Report Date November, 1977 Title .ind Subt it Ie A Method for Constructing Compressed Parsers For A Parser Generating System 6. Author(s) Richard Marion Schell, Jr 8. Performing Organization Rept. No. Performing Organization Name and Address Department of Computer Science University of Illinois Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract/Grant No. 2. Sponsoring Organization Name and Address 13. Type of Report & Period Covered 14. ). Supplementary Notes >. Abstracts A parser generating system is described and technique is presented for constructing parse tables that are compact and that provide efficient parsing. The parse tables are constructed for a special class of grammars, which can be produced from LR(k) grammars using existing transformations. It is shown that producing optimal parse tables for the construction method is NP-complete; heuristic algorithms are given which are good in practice. Empirical data is presented to support this claim: tables are compared with those given for other parser generating systems and are shown to be smaller. . Key Words and Document Analysis. 17o. Descriptors parsing, parser generating systems, table-driven parsers, bounded right context, context-free grammars, normal -form grammars, pushdown automata, grammatical transfor- mations, NP-complete problems. li. Identifiers/Open-Ended Terms 1;. COSAT1 Field/Group li Availability Statement F |!M NTIS-35 (10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 22. Price USCOMM-DC 40329-P71 FEB 8 1978 am ion