The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books are reasons 
 for disciplinary action and may result in dismissal from 
 the University. 
 To renew call Telephone Center, 333-8400 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 MR 1 ? |^ 
 
 APR 03 (1996 
 
 JAN 6 1997 
 
 L161— O-1096 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/methodforconstru904sche 
 
S-R-77-904 
 
 A Method for Constructing Compressed Parsers 
 for a Parser Generating System 
 
 by 
 Richard Marion Schell, Jr. 
 
 November, 1977 
 
 UILU-ENG 77 1756 
 
A METHOD FOR CONSTRUCTING COMPRESSED PARSERS 
 FOR A PARSER GENERATING SYSTEM 
 
 BY 
 RICHARD MARION SCHELL, JR. 
 A.B., University of Illinois, 1972 
 
 THESIS 
 
 Submitted in partial fulfillment of the requirements 
 for the degree of Master of Science in Computer Science 
 in the Graduate College of the 
 University of Illinois at Urbana-Champaign, 1977 
 
 Urbana, Illinois 
 
S/o S'/ 
 
 n c ' f 
 
 2, 
 
 ACKNOWLEDGEMENTS 
 
 I would like to express ray gratitude to Professor M.D. Mickunas for 
 his guidance and patience as well as for his advice and helpful 
 suggestions. I also owe my thanks to Alfred Weaver for his 
 encouragement and for his part in getting me involved in this research. 
 
 Finally, my thanks go to my wife, Barbara, for her constant faith 
 and unwavering patience. 
 
iv 
 
 TABLE OF CONTENTS 
 
 Chapter Page 
 
 1 . INTRODUCTION 1 
 
 1.1. SURVEY OF METHODOLOGY 3 
 
 1.2. OVERVIEW OF THE THESIS 5 
 
 2 . GRAMMARS AND ACCEPTORS 6 
 
 2.1. DEFINITIONS 6 
 
 2.2. NORMAL FORM GRAMMARS 9 
 
 2.3. BOUNDED CONTEXT ACCEPTORS 12 
 
 2.4. EXTENDED BOUNDED CONTEXT ACCEPTORS 15 
 
 2.5. CONTEXT COMPUTATION AND CANONICAL COVERS 1 8 
 
 3 . TABLE REDUCTION STRATEGIES 25 
 
 3.1. REPRESENTING W* 26 
 
 3.2. MOTIVATION FOR IMPROVING THE TABLES 27 
 
 3.3. TABLE TRANSFORMATIONS 29 
 
 3.4. CONFLICT GRAPHS 37 
 
 3.5. FINDING A BEST COMPRESSION SET 43 
 
 3.5.1. ANALYSIS 43 
 
 3.5.2. A BACKTRACK ALGORITHM 47 
 
3.6. TABLE ORDERING 49 
 
 3.7. OBSERVATIONS 54 
 
 4 . THE PARSER GENERATING SYSTEM 55 
 
 4.1. PGS INPUT 55 
 
 4.2. THE STRUCTURE OF THE PGS 56 
 
 4.3. TRANSFORMING THE INPUT GRAMMAR 56 
 
 4.4. CONTEXT COMPUTATION 61 
 
 4.4.1. ALGORITHMS 61 
 
 4.4.2. REPRESENTING CONTEXT SETS 64 
 
 4.5. CONFLICT GRAPH GENERATION 65 
 
 4.6. STATE TABLE GENERATION 66 
 
 4.7. ARRANGING 72 
 
 4.8. ENCODING THE TABLES 75 
 
 5 . RESULTS AND CONCLUSIONS 79 
 
 5.1. RESULTS 79 
 
 5.2. CONCLUSIONS 82 
 
 5.3- EXTENSIONS AND REFINEMENTS 82 
 
 REFERENCES 84 
 
 APPENDIX A 87 
 
 APPENDIX B 96 
 
CHAPTER 1 
 
 INTRODUCTION 
 
 The subject of this thesis is a method for automatic parser 
 generation. In particular, it explores a technique for producing 
 compact and efficient parsers. The principal objective of generating 
 parsers automatically is to take advantage of powerful parsing methods 
 and at the same time to spare the compiler writer the burden of writing 
 the parser. We will demonstrate techniques for generating parsers which 
 satisfy this objective and which can be produced for a broad class of 
 programming languages. And we will describe a parser generating system 
 (PGS) that uses these techniques effectively. 
 
 The automatic generation of recognizers and parsers has been an area 
 of continued interest since the late 1950 's, when the prototypic 
 "modern" programming languages were developed. Concurrent with the 
 development of programming languages has been the development of theory 
 and technology for recognizing and parsing these languages; this 
 technology has resulted in the development of algorithms for 
 constructing parsers mechanically. There have been several major issues 
 faced in developing these methods. Among them are: 
 
1. Allowing a broad class of languages to be specified easily. 
 Some methods preclude desirable syntactic constructs from the 
 set of languages they are capable of parsing; others require 
 substantial effort on the part of the compiler writer to 
 arrange his input grammar so that it is acceptable to the 
 system. 
 
 2. Generating a useable parser. Parsers that are generated by 
 mechanical systems must be both space and time efficient in 
 order to meet the constraints of practical machine 
 environments. 
 
 3. Generating a useful parser. Parsers must be integrated into 
 the compiling process. To fit, they must provide capabilities 
 for code generation and other semantic analysis and for 
 detecting and repairing, or recovering, from programmer syntax 
 errors. 
 
 Several methods have been developed for constructing parsers. To 
 establish a perspective, we present some of the most representative 
 methods. 
 
1 . 1 SURVEY OF METHODOLOGY 
 
 The methods presented here are all deterministic bottom-up parsing 
 methods. For a thorough survey, see LAho 72] and [Feldman 68]. 
 Mechanical generation of top-down parsers is rarely done, and top-down 
 methods are inherently less powerful; consequently, LL parsing, though 
 widely used, is not considered here. 
 
 Among the oldest methods for mechanical parser generation are the 
 precedence parsing methods. [Floyd 63], [Wirth 66]. Such methods are 
 conceptually simple and the generating algorithms are both simple and 
 efficient. However, most precedence parsing techniques have undesirable 
 characteristics. Not all programming languages are precedence parsable. 
 Those that are often require substantial rewriting of the grammar for 
 the grammar to be acceptable. Furthermore, the parsers produced are 
 often large. The technique of Ichbiah and Morse [Ichbiah 70], which is 
 based on weak precedence and uses Floyd-Evans productions [Floyd 61] and 
 [Evans 64], is relatively space efficient but is still less general than 
 other techniques available. Generally, precedence parsing is no longer 
 a competitive method for generating parsers automatically. An extension 
 to precedence parsing is mixed strategy precedence parsing; a parser 
 generating system based on this method is in common use [McKeeman 71 J. 
 Again, the resulting parsers are large. And while it is theoretically 
 possible to find grammars for languages in a broad class, in practice it 
 is not always feasible to find one acceptable to both the system and to 
 the compiler writer. 
 
Currently, the most favored method in use is LR parsing [Knuth 65]. 
 LR methods are powerful; the general class of LR parsers is capable of 
 satisfying the requirements discussed previously. This is true also of 
 the commonly used LR methods, LR(1), SLR(1) and LALR(1). Although LR 
 parsing had obvious promise from the beginning, there were several 
 disadvantages. Developing construction algorithms from the discussions 
 in the early literature was difficult. When LR parsers were developed, 
 the space required to store the parse tables was prohibitively large. 
 Several independent researchers eliminated the latter problem, 
 developing algorithms that produce parsers of a practical size. 
 [Pager 70], [DeRemer 71] and [Lalonde 71].' Further research has 
 demonstrated that LR construction algorithms can be made to produce very 
 efficient parsers. [Joliat 73] and [Anderson 74]. Parser generating 
 systems based on these techniques are widely available. 
 
 The method used by the PGS described here is a bounded context 
 method. Bounded context parsing predates LR parsing [Floyd 63], 
 [Eickel 63], but did not receive the same attention. Requirements on 
 the input grammar tended to be very restrictive and the space 
 requirements were also severe. A system for generating bounded context 
 parsers was developed at Purdue University [Mickunas 74] which solved 
 these problems; the system accepts reasonable input grammars and 
 produces compact parse tables. The research and development reported 
 here is a product of a modification to and extension of this system. 
 The parsers produced are competitive with those produced by LR-based 
 systems over a wide range of applications and is particularly 
 
advantageous in cases where space is at a premium. 
 
 1.2 OVERVIEW OF THE THESIS 
 
 This thesis presents a description of the parser generating system. 
 It also provides a detailed discussion of the techniques and algorithms 
 used to minimize the size of the parsers it produces. Results and 
 methods produced during the course of the thesis research are 
 substantiated and proved. Remaining chapters are devoted to these 
 tasks. 
 
 Chapter 2 provides background material. It introduces notation and 
 terminology and introduces a formal model which is an abstraction of the 
 parsing method used. Chapter 3 describes a process for constructing 
 parse tables in a space-efficient manner. Methods and algorithms are 
 introduced. Chapter 4 provides an overview of the PGS and shows how the 
 methodology described in chapter 3 is integrated into the system. The 
 chapter outlines the structure of the PGS and the implementation of the 
 algorithms actually used. Chapter 5 presents results and summarizes the 
 thesis presentation. It also suggests refinements and extensions. 
 
 There are two appendices; the first consists of four sections and 
 illustrates the process of generating parse tables from an input 
 grammar. The second is a full set of parse tables for an actual 
 programming language. 
 
CHAPTER 2 
 
 GRAMMARS AND ACCEPTORS 
 
 This chapter develops the mechanics of the parsing method used. In 
 particular, it discusses bounded context grammars and a class of 
 automata which accept languages defined by a subset of these grammars. 
 It will be convenient to use terminology from formal language theory and 
 parsing theory. We present some basic notation and terminology. The 
 format follows Gray and Harrison [Gray 72], and Aho and Ullman [Aho 72]. 
 
 2.1 DEFINITIONS 
 
 Definition. A (context-free) grammar (CFG) is a four-tuple, 
 G=(V,T,P,S) 
 
 where 
 
 V is a finite nonempty set of symbols, the vocabulary 
 
 T9V is a finite set of symbols, the terminal vocabulary 
 
 N = V-T is the nontermina l vocabulary 
 
 S is a distinguished element of N, the goal symbol 
 
 P is a finite subset of N X V*, the productions 
 
We will denote an element (A,x) of P as A -> x. For any rule A -> x, A 
 is called the left-part and x the right-part of the rule. 
 
 We will find it convenient to use binary relations on symbols. 
 
 Definition . Let p be a binary relation on a set Y,p Y X Y. Define 
 P to be the identity relation and P 1 = p x p . Define the reflexive 
 transitive closure p* = Up^; and the (non-reflexive) transitive closure 
 p = p*p. We will use pu to denote {v€Y| vpu} . 
 
 Definition . Let G = (V,T,P,S) be a CFG. For u, v in V* define u=>v 
 if there exist x, w in V*, y in T* and A in N for which u=xAy, v=xwy and 
 A -> w is in P. By this definition, we restrict derivations to 
 rightmost , or canonical , derivations. 
 
 Definition . The set of (canonical) sentential forms for a CFG, G is 
 denoted by CSF(G) and is defined as (xeV*! S=>*x}. The sentences of 
 the grammar are terminal strings in CSF(G). The language generated by G 
 is the set of sentences of G. 
 
 Definition . If x.=>x. , then we say that x. directly derives x. . If 
 i j* i j 
 
 x =>x.=>. . .=>x , then we say that x^ derives x ; the sequence is called 
 a (canonical) derivation of x from x n , 
 
Definition . A CFG G is said to be unambiguous iff every sentence in 
 the language of G has exactly one canonical derivation (from the goal 
 symbol). A CFG which is not unambiguous is said to be ambiguous . 
 
 Definition . A string u in V* is a phrase if there is a derivation 
 S=>*wAv=> wuv. If wAv=>wuv, then u is a simple phrase . The leftmost 
 simple phrase of a sentential form is called the handle . 
 
 Definition. A CFG, G = (V,T,P,S) is said to be 
 
 1. e-free if P£N X V , 
 
 2. reduced if for each A in V-{S} there are x, y in V* such that 
 xAy is in CSF(G) and if no nonterminal is useless . A 
 nonterminal is useless if it does not derive some terminal 
 string. 
 
 For the remainder of the discussion, we insist that all grammars be 
 reduced, e-free, and unambiguous. The following conventions for naming 
 strings hold throughout the discussion: 
 
 A,B,C,D... are nonterminal symbols 
 
 a,b,c,d,... are terminal symbols 
 
 U,V,...,Z are either terminal or nonterminal symbols 
 
u,v,...,z are elements of V* 
 
 2.2 NORMAL FORM GRAMMARS 
 
 The parsing method used is called bounded right context . 
 Informally, a grammar is (m,n) bounded (right) context if given a 
 sentential form in a (leftmost) parse, the handle can be determined by 
 examining at most m symbols to the left and n to the right of a possible 
 handle. The grammars used internally by the PGS are (1,1) bounded right 
 context. It is proved in [Aho 72] that for every LR(k) language there 
 is a (1,1)BRC grammar; further, given an LR(k) grammar, there is a 
 mechanical transformation to a bounded context grammar of the specific 
 form given by the definition below [Graham 74], [Mickunas 76]. 
 
 Definition 2. 1 
 
 A normal form grammar , G, is a 5-tuple 
 G = (N,Q,T,S,P) 
 where 
 
 N is a finite set of symbols called the pushdown or stack 
 
 vocabulary . 
 
 Q is a finite, non-empty set of symbols disjoint from N 
 
 called the state vocabulary. 
 
 T is a finite set of symbols called the input or terminal 
 
 vocabulary . 
 
 V = NuQuT is called the vocabulary. 
 
10 
 
 V-T is called the nonterminal vocabulary. 
 
 S is a distinguished element of Q called the goal or 
 
 sentence symbol. 
 
 P is a finite subset of (NuQ) X (NXQuQXTuQuT) 
 
 called the production rules of G. 
 
 The production rules of G are of the following four forms: 
 
 1. p= E -> AB, called stack reducing rules, 
 
 2. p = E -> Ba, called input erasing rules, 
 
 3- p = E -> B, called renaming rules and, 
 
 4. p = E -> a, called initial state rules, 
 
 The grammars as used here are constrained so that right parts of all 
 rules except renaming rules must be distinct; clearly, a grammar for 
 which this condition does not hold can be transformed into one for which 
 it does by coalescing productions and adding new nonterminals and 
 renaming rules. This requirement can be relaxed but it is convenient to 
 assume it holds in most cases. 
 
11 
 
 Given a normal form grammar, a pair (A, a) in N X T is a context for 
 a production p in P iff there is a derivation S =>+ xABay for which the 
 last production is p. The stack symbol A in a context (A, a) is called 
 the left context ; the input symbol a is called the right context . 
 
 (1,1) bounded context grammars in this form have several important 
 properties. 
 
 1 . It is possible to determine the handle of a sentential form 
 given a single context - the stack top and the next input 
 symbol. 
 
 2. It is possible to distinguish between stack reducing (input 
 erasing) rules on the basis of left (right) context alone. 
 
 3- Contexts can be contructed using a simple algorithm. 
 
 4. The parser and parse table structure are both simple. 
 
 The price paid for these properties is that the input grammar must be 
 transformed into the specified internal form. 
 
12 
 
 2.3 BOUNDED CONTEXT ACCEPTORS 
 
 The natural method of implementing a parser for a class of context 
 free grammars is to contruct a pushdown automaton model which is 
 realized as a table-driven parser. In this section, two separate 
 automata are presented, both of which will parse using the grammars 
 described in the previous section; the first is a simple model which 
 corresponds naturally to the normal form grammars; the second is an 
 extended model which properly includes the first and which serves as a 
 good abstraction of the parsing method described in this thesis. 
 
 The first machine will be referred to as a bounded context acceptor , 
 abbreviated BCA. As used in this thesis a BCA is a deterministic 
 pushdown automaton represented by a sixtuple. 
 
 Definition 2.2 
 
 A bounded context acceptor is a 6-tuple 
 A = (N,Q,T,M,S Q ,S) 
 where 
 
 N is the finite pushdown alphabet 
 
 Q is the finite, non-empty set of states 
 
 T is a finite set, the input alphabet . 
 
 M is a partial function from N X Q X T into (NuQ) X Z 
 
 where Z is the set of actions shift, reduce, and rename; M 
 
13 
 
 is called the ma pping of A. 
 
 S is a distinguished element of Q called the initial 
 
 
 state . 
 
 S is a distinguished element of Q called the accept state 
 
 A possible configuration for the automaton A is a triple (B,A,a) in 
 N X Q X T; the configuration is valid if the mapping is defined for it. 
 When M is defined it maps a configuration into a state-or-stack/action 
 pair: 
 
 M(B,A,a) = (C,t), 
 where if t is a shift, the next input is consumed; if t is a reduce the 
 topmost stack symbol is removed; and if t is a rename, no erasure 
 occurs. If C is a state, the new state of the machine is C; if C is a 
 stack symbol, then after the action t is performed, C is pushed onto the 
 pushdown stack and the new state of the machine is the initial state. 
 
 An instantaneous description (ID) of the automaton A is a triple 
 (o,q,w) in N* X Q X T*; o represents the contents of the stack, q the 
 state, and w the remaining (unscanned) input. The automaton A induces a 
 relation \- on the set of ID's as follows: ( a ,q,w) I- ( ' ,q',w') if 
 
 =uA and w=av and one of the following holds 
 
 1. o'=u, w'=av and M(A,q ,a)= (q ' .reduce) 
 
14 
 
 2. a'=uA', w'=av and M(A ,q ,a)= (a ' .reduce) 
 
 3- o'=uA, w*=v and M(A,q,a)=(q ' .erase) 
 
 4. o'=uAA', w'=v and M(A,q ,a)=(a' , erase) 
 
 5. a'=uA, w'=av and M(A,q,a)=(q ' .rename! 
 
 6. a'=uAA', w'=av and M(A ,q ,a) = (a ', rename) 
 
 The transitive closure I— * is well-defined. In order that the machine 
 
 accept a string w, it is necessary that (e,S ,w) |— *(e,S,e). (It is 
 
 
 
 assumed that the machine is initially in state S with an empty stack 
 and ready to consume the first input symbol. If the input is 
 well-formed, then the machine will stop in the accept state with an 
 empty stack, having consumed all the input.) 
 
 The correspondence between the BCA and the normal form grammar for 
 which the language that it accepts is obvious: given G=(N,Q,T,S,P) , the 
 corresponding BCA is 
 
 A=(N,QU{S o },T,M,S q ,S) 
 
 where M is constructed from P as follows 
 
 1. If A -> A a, is in P, then for all of its left contexts B 
 1 jk 1 
 
15 
 
 define 
 
 M(A.,B 1 ,a k ) = (A i ,shift) 
 
 2. If A. -> B.A k is in P, then for all of its right contexts a. 
 define 
 
 M(A lc ,B.,a 1 ) = (A i , reduce) 
 
 3. If A.^ -> A.= is in P, then for all of its contexts (B^a-^) 
 define 
 
 M(A . , B, , a-, ) = (A ..rename) 
 
 4. If A. -> a. is in P then for all of its left contexts B, define 
 
 i J k 
 
 M(S Q ,B k ,a.)=(A i ,shift) 
 
 The correspondence between an instantaneous description of the BCA and a 
 canonical sentential form is straightforward. For A,. . .A <A a . . .a , if 
 A is in Q then the corresponding ID is (A,... A ,,A ,a ...a ). If A is 
 in N, then the corresponding ID is (A,... A ,S n ,a ...a ). That there is 
 a one-to-one correspondence between the machine's actions and the steps 
 in a derivation follows immediately. 
 
 2.4 EXTENDED BOUNDED CONTEXT ACCEPTORS 
 
 The acceptor defined in the previous section is sufficiently 
 powerful to parse sentences from any deterministic context free 
 lanugage. However, this does not completely serve; the second automaton 
 
16 
 
 presented will allow for easier construction of good parse tables. This 
 is accomplished by compressing the domain of the mapping function, which 
 is a desirable end. Because of space considerations, the mapping 
 function is implemented as a list and table lookup is used to determine 
 the parse action to be taken. For the first automaton, if no space 
 reduction is applied, the number of list items is the same as the total 
 number of context pairs. By squeezing this domain, smaller initial 
 tables will result. 
 
 The following definitions provide a compressed mapping: 
 
 Let W be a partial function from Q X powerset(N) X powerset(T) into 
 powerset( (Nu Q) X Z u {w} ) . The symbol w in the range of W is used to 
 indicate a special undefined value. 
 It will be said that W covers a BCA mapping M if: 
 
 whenever W is defined on a triple (A,U,V) then W(A,U,V) 
 consists of the set of all M(A,B,a) for (B,a) in U X V; if 
 there is some pair (B,a) in U X V for which M is not defined, 
 then W(A,U,V) contains w. 
 
 and for all configurations (A,B,a) for which M is defined, the 
 image M(A,B,a) is in some set in the range of W, W(A,U,V) for 
 which B is in U and a is in V. 
 
17 
 
 W is a deterministic cover of M if all elements in the range of W 
 are singleton sets; W is a dis.loint cover if when W(A,X,Y)= W(A,U,V) 
 then either X and U are disjoint or Y and V are; W is a minimal cover if 
 there is no proper restriction of W which covers the same mapping; 
 finally, W is a minimum cover if there is no deterministic covering of 
 the same mapping with a smaller domain. These definitions are sensible 
 and realizable, in the sense that for any mapping a cover can be found 
 which is characterized by one of the above definitions. Note that a 
 mapping M can be embedded in the set of coverings of M in the natural 
 manner; i.e. W(A, {B} , {a} )=M(A,B,a) whenever M(A,B,a) is defined and W 
 is undefined otherwise. 
 
 An extended bounded context acceptor (EBCA) is a 6-tuple 
 
 A = (N,Q,T,W,S ,S) 
 where N, Q, T, S Q and S are defined as with a BCA and W is a cover for a 
 mapping M of a BCA (N,Q,T,M,S Q ,S) . The notion of cover is extended to 
 the EBCA by defining _A to be a cover for A when N, Q, T, S Q , and S agree 
 and W covers M. The actions of an EBCA are defined as follows: 
 
 In a configuration (A,B,a) the machine applies all possible actions 
 which are in the sets W(A,U,V) for any U and V which contain B and a 
 respectively . 
 
 Given an EBCA A which covers a BCA A, if the mapping for A, W, is a 
 deterministic cover for the mapping of A, then A is a deterministic 
 pushdown automaton; furthermore, it accepts exactly the language of A. 
 
18 
 
 That it is deterministic is easily seen: given any configuration 
 (A,B,a), if there is more than one possible action, then there must be 
 two set pairs whose cross product contains (B,a), say U,V and X,Y for 
 which W(A,U,V) and W(A,X,Y) are not the same; this is not possible, 
 since both image sets must contain M(A,B,a) and nothing else. That it 
 accepts the same language is equally obvious: the actions applied for 
 identical configurations are identical. 
 
 As a starting point for further space reduction, it would be most 
 desirable to obtain an EBCA with minimum cover; this would produce the 
 smallest number of entries in the unreduced table. Unfortunately, 
 determining such a cover is an NP-complete problem and it is not 
 desirable to use algorithms of such complexity. However, it is possible 
 to generate suboptimal coverings with a simple polynomial algorithm, 
 coverings which are still good in practice. 
 
 2.5 CONTEXT COMPUTATION AND CANONICAL COVERS 
 
 The cover used will be called the canonical cover ; the sets for 
 which the mapping is defined are computed during the context computation 
 and no additional effort is required to extract them. In this section, 
 the construction method is given. In order to simplify the discussion, 
 the following binary relations are needed: 
 
 A A B if A -> Bu is in P 
 A p B if A -> uB is in P 
 X >i Y if A -> uXYv is in P. 
 
19 
 
 The derived relations first , follow , and precede are defined by: 
 
 first = (A -1 )* 
 
 last = P* 
 
 follow = first a -1 last 
 
 precede = a first 
 The set of initial elements of a binary relation 3 corresponding to the 
 second element x will be denoted 3(x). The sets so induced by the 
 first, follow and precede relations will form the basis of the canonical 
 covering. 
 
 In using the relations just defined, the following observations are 
 useful: 
 
 1. X first A iff A =>* Xu . This follows immediately. 
 
 2. A precede B iff S =>* xABy . 
 
 If A precede B, then there is a C such that C first B and a 
 production p = D -> uACv. Since D is not useless, s=>*xABy. 
 Conversely, if S=>*xABy then A precede B can be shown by 
 induction on the number of productions in the derivation. 
 Clearly, if S=>xABy in one step, then A precede B. Suppose 
 S=>*xABy for a derivation of length k and for shorter 
 derivations the hypothesis holds; obviously there is a final 
 production in the derivation. There are three cases: 
 
20 
 
 1. p = C -> w, B not in w. In this case, S=>*xABuCz in fewer 
 steps, so by hypothesis, A precede B. 
 
 2. p = C -> uABv; this case follows immediately. 
 
 3. p = C ->Bu; in this case, S=>*xACw and A precede first B. 
 By the definition of precede and the transitivity of first, 
 it follows that A precede B. 
 
 3. a follow A iff S =>* xAay . 
 
 If a follow A then there are B and Y such that a first Y, B Y 
 and B =>*uA. Since B Y, there is a production C-> wBYx in P 
 and since a first Y, it follows that Y=>*av. Since the grammar 
 is reduced, vx=>*z where z is in T*. Thus, C=>*wBaz and 
 C=>*wuAaz for some string z. As C is not useless, the only if 
 part follows. 
 
 Conversely, if S =>* xAay then a follow A, again by induction 
 on the number of productions. If S=>xAay then a a _ l A. If 
 S=>*xAay in k steps and for every shorter derivation the 
 hypothesis holds, then there is a final production. There are 
 four cases: 
 
 1. p = B -> vA; in this case s=>*wBay and so by hypothesis a 
 follow B. Consequently, a follow p A and by the definition 
 of follow and the transitiviy of last, a follow A. 
 
21 
 
 2. p = B -> vAau. Here, act A 
 
 3. p = B -> au; in this case, S=>*xABy. So A precede B and 
 B first A. It follows immediately that a follow A. 
 
 4. p = B -> w, where neither A nor a is in w. Since the 
 derivation is rightmost, S=>*xAauBz in fewer productions. 
 
 There are three distinct classes of context sets these relations 
 generate : 
 
 1. If p = E -> AB, then precede(E) X first(B) is a context set for 
 every production used in any derivation A=>*C, C in Q. 
 
 2. If p = E -> Ba or p = E -> a, then precede(E) X {a} is a 
 context set for the production p and for every production used 
 in any derivation B=>*C, C in Q. 
 
 3. If p = E -> AB then {A} X follow(E) is a context set for 
 production p and for every production in any derivation B=>*C, 
 C in Q. 
 
 Contention: if (A, a) is a context for a production then it is contained 
 in a set of one of the above types. This can be shown by cases: 
 
22 
 
 case 1 
 
 case 2 
 
 case 3. 
 
 Production p is of the type C -> Ba. If (A, a) is a context for 
 p, then S=>*uACv=>uABav. Therefore A precede C and the set of 
 class 2 covers this context. (Similarly for C -> a.) 
 
 Production p is of the type E -> AB. If (A, a) is a context for 
 p then S=>*uEav=>uABv. Consequently a follow E. A class 3 set 
 will cover (A, a) . 
 
 Production p is of the type E -> B. If (A, a) is a context then 
 there is an F such that S=>*uAFav=>*uAEav=>uABav. There are 
 two subcases to consider. In the first, F is in Q. Clearly, 
 (A, a) is a context for some production D -> Fa or D -> AF so it 
 is covered by a class 2 or class 3 set. In the second, F is in 
 N. In this case, there is a production in the derivation of 
 the form G ->FD, so that S=>*uAGw=>uAFDw=>*uAEav=>uABav. This 
 implies that A precede G and a first D, so (A, a) is covered by 
 a class 1 set. 
 
 Using these facts, the covering for the canonical map W* is 
 generated as follows: 
 
 1. if p = E -> Ba then W*(B,precede(E) , {a} )= { (E, shift) } . 
 
 2. if p = E -> AB then W*(B, {A} , follow(E) )={ (E, reduce) } . 
 
23 
 
 3. if p = E -> B then: 
 
 if E is in Q, then whenever W*(E,X,Y) is defined, define 
 W*(B,X,Y) = {(E, rename)}; 
 
 and if E in N then W*(B,precede(D) , first(F) )={ (E, rename) } 
 for all D, F such that D -> EF. 
 
 4. For all other elements of Q X powerset(N) X powerset(T), W* is 
 undefined. 
 
 The covering induced in this manner is suboptimal, but is still very 
 good. The following observations justify that claim. 
 
 First, observe that there is exactly one context set for each stack 
 reducing or input erasing rule. Second, observe that if E -> B is a 
 renaming rule, then if there are two pairs of context sets U X V and 
 X X Y such that 
 
 W*(B,U,V) = W*(B,X,Y) = {(E, rename)} 
 then if U X V and X X Y are not disjoint, there is a stack symbol F such 
 that 
 
 F = >*E 
 
 A-| ■, -> FA 12 
 
 A 21 -> FA 22 
 
 and U = precede(A-n) , V = first(A-|2) 
 X = precede(A 21 ) , Y = first(A 22 )- 
 This is not difficult to establish. Suppose either context set is class 
 
24 
 
 2 or class 3. Then there are two distinct productions for some state 
 symbol D, where D=>*E, for which U X V and X X Y are the respective 
 context sets. But since they have non-empty intersection, there is a 
 context (A, a) for which two separate productions can be applied. This 
 violates (1,1) BRC. 
 
25 
 
 CHAPTER 3 
 
 TABLE REDUCTION STRATEGIES 
 
 To make the extended BCA model derived in the previous chapter a 
 useful tool, it is necessary to produce a table-driven parser and parse 
 tables which allow the parser to simulate an EBCA for a given grammar. 
 It is desirable to keep the space required by the parser as small as 
 possible without sacrificing a great deal of speed efficiency. This 
 chapter outlines techniques for producing reduced tables , parse tables 
 produced from the canonical mapping by applying space reducing 
 transformations . 
 
 As mentioned in the previous chapter, the acceptor mapping is 
 implemented using a table lookup scheme. Such a scheme is obviously 
 preferable to a matrix encoding of the original BCA mapping, which 
 requires one entry for every possible configuration, including the 
 invalid ones. It would require |Q|*|N|*|T| entries to provide this 
 representation. A table lookup method eliminates the need for the 
 entries corresponding to erroneous configurations, at the expense of 
 giving up the direct access of the matrix method. In the last chapter, 
 
26 
 
 evidence was presented to show that the mapping W* is a good one. In 
 this chapter, transformations are presented that produce efficient parse 
 tables from the map W*. Algorithms which perform the transformations 
 optimally will be given. 
 
 3. 1 REPRESENTING W* 
 
 To implement W* in an economic manner, separate tables are 
 
 maintained for each state of the EBCA; there are |Q|+1 separate tables 
 
 to generate. These separate tables will be referred to as state tables . 
 
 In this chapter, details of implementation will be ignored in general 
 
 and the tables will be represented abstractly. An entry in the state 
 
 table for a given state, B, coresponding to the mapping 
 
 W*(B,U,V)= (E,t) will be written as W* (U,V)=(E,t); set brackets are 
 
 B 
 
 eliminated. A state table is represented as an ordered sequence of such 
 entries: 
 
 W* B (U 1 ,V 1 )=(E 1 ,t 1 ) 
 
 W* B (U n ,V n )=(E n ,t n ). 
 
 Table lookup is performed as follows: given that the machine is in a 
 configuration (A,B,a), the state table for state B is inspected one 
 entry at a time starting with the first. There are two possibilities. 
 If the configuration is valid, then there will be a smallest integer r 
 such that U X V contains (A, a), in which case the action implied by 
 (E ,t ) is performed. On the other hand, if the configuration is 
 
27 
 
 erroneous, all of the table entries will be exhausted without any action 
 having been performed, in which case an error is detected. 
 
 The order of the entries in the table is irrelevant; it is only 
 necessary that the entire state table be exhausted. Because of the 
 deterministic nature of the EBCA, although there may be two entries for 
 which a given configuration could cause an action to be applied, both 
 actions applied would be identical. Later, this ordering property will 
 be found to be advantageous. 
 
 3.2 MOTIVATION FOR IMPROVING THE TABLES 
 
 The representation as described in the previous section requires 
 Idom W*| entries. If a fixed space requirement, or cost, is assumed for 
 each entry, then the only improvement available is to find a better 
 mapping, a possibility already rejected. However, there are good 
 reasons to assume that the cost for each entry is not fixed. It is 
 assumed that there are two possible sources of variable cost: the space 
 required for the set pair; and the space required to encode the entry 
 other than that required by the sets. It is further assumed that these 
 are the only factors contributing to the cost of a state table other 
 than the number of entries. 
 
 First, consider the set pair space requirement. It is reasonable to 
 assume that each set will be encoded as a bit vector. Although these 
 vectors might be of variable length, it is improbable that such an 
 
28 
 
 encoding would provide much space economy, because of the overhead 
 required to keep the vector lengths. Therefore, fixed vectors are 
 assumed, thus requiring that the largest possible set must be 
 accomodated. Hence, for each set of stack symbols, IN I bits are 
 required and for each set of input symbols, |T| bits are required. If 
 this space is maintained for each table entry, the space required for 
 the sets alone would be prohibitively high. A more promising scheme is 
 to maintain an auxiliary table of the distinct sets that are referenced 
 and to use an index into this table in place of the actual set in the 
 entries themselves. If this approach is adopted, the number of distinct 
 sets is a source of variable cost, which should be made as small as 
 possible. It can be assumed that singleton sets can be more 
 economically encoded by using the index as the index of the symbol in 
 the set of symbols. 
 
 The space required to encode an entry can vary because of the manner 
 in which the inspection of an entry is performed. Given a configuration 
 (A,B,a), the pair (A, a) is tested for membership in a set U X V by 
 performing two tests: one for A in U and one for a in V. Thus, entries 
 W* (U,V)=(E,t) for which U=N or V=T should require less space in the 
 encoded table than other entries do; one (or more) of the memebership 
 tests can be eliminated. For the rest of this chapter, let us stipulate 
 that the cost overhead for each membership check is one unit; then the 
 overhead required by an entry for a pair (U,V) is: 
 
 if U=N and V=T 
 
 1 if U=N or V=T, but not both, 
 
29 
 2 if neither condition applies. 
 
 Acting on these assumptions, the cost of a set of parse tables can 
 be reduced by minimizing the number of distinct sets referenced, by 
 minimizing the number of expensive entries, and by reducing the number 
 of entries. Such reductions will be done on a state-by-state basis 
 because the task of global optimization is assumed to be intractible. 
 The reductions are applied to tables generated from the canonical 
 mapping to produce better tables; the latter may not correspond directly 
 to any deterministic EBCA. The context sets for the reduced table 
 entries may map into more than one acceptor action. Subsequently we 
 will show this is allowable. The reduction schema are discussed in the 
 remainder of this chapter. 
 
 3.3 TABLE TRANSFORMATIONS 
 
 The table reduction scheme consists of the application of two 
 transformations to the W* tables. The first transformation is designed 
 to reduce the number of expensive entries; the second is designed to 
 merge entries, thereby diminishing the number of entries. It is assumed 
 that by applying the first transformation, the number of distinct sets 
 is also decreased. Consequently, no explicit transformations to 
 decrease the collection of distinct sets are developed here. The 
 transformed tables will be denoted as W*. 
 
30 
 
 Transformation J_ 
 
 Given a state table in arbitrary order, 
 
 W* B (U l ,V 1 )=(E l ,t 1 ) 
 
 W* (U ,V )=(E ,t ). 
 B n n n n 
 
 the entry for (U^V^) can be transformed to ¥* B (U[,V|)s(E i , t^) 
 where U^9 U| and V^s V[ whenever the action for any valid 
 configuration (A,B,a) where (A, a) is in (U{-U i ) X (V{-Vj,) will 
 be applied for some entry r where r<i. In particular, the 
 primed sets will be restricted so that U{=N or U|=U and V|=T 
 or V£=V^. Further, the transformation is constrained by the 
 following: 
 
 if W* B (U,V)=(E, reduce) then U'=U, and 
 
 if W« B (U,V)=(E, shift) then V'=V. 
 
 This transformation will be referred to as compression . An entry for 
 which U'=N will be called a left compressed entry and the corresponding 
 transformation will be called left compression . An entry for which V'=T 
 will be called a right compressed entry and the corresponding 
 transformation will be called right compression . An entry which is 
 either left or right compressed is compressed . If it is neither left nor 
 right compressed it is a full entry. If it is both right and left 
 compressed it will be called an unconditional entry. Clearly, if there 
 is an unconditional entry, it will be the last entry in the table, since 
 
31 
 
 its actions will be performed for all configurations. The validity of 
 this transformation will be shown after both transformations are given. 
 
 Transformation 2 
 
 Given (U r ,V ) through (U s ,V g ) consecutively, the set pairs 
 
 for state table entries W* (U ,V )=(E ,t ) through 
 
 d r r r r 
 
 W* B (U s ,V s )=(E s ,t s ) for which (E r ,t r )=(E r+1 ,t r+1 )...=(E s ,t s ) 
 
 then if the V. are identical then the entire set of entries can 
 
 1 
 
 be replaced by one entry 
 
 W* (U U u j.1 U ...UU ,V ) = (E ,t ) 
 Br r+1 ss ss 
 
 and likewise if all the U. are identical then the entries can 
 
 i 
 
 be replaced by 
 
 W* (U » v 0V , ,U • -. uV ) = (E ,t ). 
 Bsrr+1 s ss 
 
 This transformation is a simple merging process; a more general 
 merging scheme would allow merging in certain cases in which 
 
 neither all the U. nor all the V. are identical. 
 i i 
 
 Transformation 2 will always be applied after transformation ; no 
 further transformation is done thereafter. Both of these 
 transformations are valid, in the sense that the acceptor implemented 
 using the transformed tables will accept exactly the language accepted 
 by the EBCA for which the mapping is represented by the initial tables. 
 This obviously holds for the second transformation. The validity of the 
 first transformation depends on the following two facts: 
 
32 
 
 1. If (A ,,A ,a ) is a valid configuration, then given the 
 
 ordered state table for state B, W* R , an action is performed on 
 
 encountering the i-th entry W* (U . , V . ) = (E . ,t . ) . It is 
 
 B 1 1 1 1 
 
 sufficient to show that if the table is transformed to w~* by 
 
 transformation 1, that the i-th entry will be inspected for the 
 
 specified configuration. Obviously, if an action is applied 
 
 for the t-th entry where t<i then (A . ,a ) is a member of 
 
 r-1 s 
 
 U ' X V; further, since the action was not applied for the set 
 t t 
 
 pair U XV, (A ,,a ) is in (U'-U ) X (V'-VJ. But this 
 t t r-1 s t t t t 
 
 violates the constraints of the hypothesis. 
 
 2. If ( A r _i' A r> a s ) i s not a valid configuration, then if some 
 action is applied for the i-th entry, one of the following 
 cases exists: the action erases the input, in which case a is 
 in VjJ the action reduces the stack, in which case A r _j is in 
 U^; or, no symbol is consumed. Consider the first case. Let 
 (A^...A r _pA r ,a s ...a n ) be the ID of the machine when the action 
 is applied. In order to reduce the stack to empty, the machine 
 must eventually enter some state in which A r _^ is removed. 
 Therefore, the following sequence of ID's is obtained. 
 
 (A 1 ...A r _ 1 ,A r ,a s ...a n ) ^(^ . . . A r _ 2 ,A - ,a g . . .a n ) 
 
 I- (Aj. . .A r _ 2 ,A^' ' ,a t . . .a n ) or 
 
 \- (A 1 ...A r _ 2 A^",S ,a t ...a n ). 
 In either case, if the stack is reduced A^.' =>* A r u, where 
 u=a g ...a t _i and A^'' -> A r _iA r '. (From the correspondence 
 between ID's and derivations.) Therefore, A . precede A r so 
 
33 
 
 that (A , ,a ) is a valid context and the configuration 
 r-1 s 
 
 (A . ,A ,a ) is also valid. 
 r-1 r' s 
 
 Consider the second case. To erase a g , the following sequence 
 
 of ID's must occur. 
 
 (A,. ..A , ,A ,a ...a ) I— + (A.. ..A ,,A ,a ...a ) 
 1 r-1 r' s n 1 q-1 q s n 
 
 I- (A, . ..A , ,A' ,a , . . . .a ) or 
 
 1 q-1 q s+1 n 
 
 I- (A . . .A A' ,S ,a . . .a ) . 
 1 q-1 q s+1 n 
 
 where A ...A is obtained from A ...A , by reducing the 
 1 q-1 1 r-1 * 
 
 stack. There are two subcases to consider. Either A =S„ or 
 
 q 
 
 not. In the second case, it is clear that A =>* wA . 
 
 q r 
 
 Futhermore, since the input is reduced in state A , there is a 
 
 production A' ->A a , so that a follow A and thus a^ follow 
 r q q s ' S q s 
 
 A . Consequently, (A . ,a ) is a valid context for a 
 r r-1 s 
 
 production A' -> A ,A . So (A ,,A ,a ) is a valid 
 r r-1 r r-1 r s 
 
 configuration. In the other case, consider the sequence of 
 
 ID's which leads to a state in which A , is removed from the 
 
 q-1 
 
 stack. If 
 
 (A.. ..A ,,A',a ,,...a ) \-+ (A,. ..A , , A' * ,a. . . .a) 
 1 q-1 q s+1 n 1 q-1 q t n' 
 
 \- (A.. ..A _,A" ',a ...a ) or 
 1 q-2 q t n 
 
 I- (A.. ..A A'",S.,a ...a ) 
 1 q-2 q t n 
 
 and there is no shorter sequence which reduces A , , then 
 
 q-1 
 
 A"' -> A A", and A" = >* A'u. It is clear that A' -> a ; 
 q q-i q q q q s 
 
 therefore, a first A'', implying that a follow A y Hence, 
 since A . = >* wA r , a g follow A r . As before, this leads to the 
 conclusion that (A r _ 1 A r ,a s ) is a valid configuration. In the 
 
34 
 
 first two cases, we ensure that the stack is non-empty by 
 adding an additional symbol to mark the bottom of the stack and 
 an extra state which removes this symbol and accepts if and 
 only if the input has been consumed. 
 
 In the third case, it is easy to see that eventually the 
 machine reaches a configuration in which one of the first two 
 cases applies. In any case, the parser may erase an arbitrary 
 number of input symbols, and it may erase an arbitrary number 
 of pushdown symbols, but eventually it will arrive in a state 
 in which it can detect the error. 
 
 As an illustration of a transformation of the first type, consider a 
 3tate B in which all the rules are input erasing rules: 
 Cj -> Ba 1 
 
 C n -> Ba n 
 with corresponding left context sets Ui,...,U . The initial W* B table 
 is 
 
 W* B (U 1 ,{a 1 })=(C 1 , shift) 
 
 W* (U ,{a })=(C .shift) 
 B n n n 
 
 For any i, if (A i ,B,a i ) is a configuration, the only possible entry 
 
35 
 
 which could be applied is the one for W*_(U. , {a. } ) . Applying 
 
 D 1 1 
 
 transformation 1 , the resulting table is 
 W* B (N,{a 1 })=(C 1 , shift) 
 
 W* B (N,{a n })=(C n , shift) 
 
 Similarly, if there are only stack reducing rules in a state, then the 
 corresponding situation exists. In states in which both types of rules 
 exist, or in which renaming rules exist, applying transformation 1 is 
 non-trivial. Consider the following simple case in which state B has 
 one stack reducing rule and one input erasing rule: 
 C. -> Ba, with left context set {A,,A 2 } 
 
 and C 2 -> A3B with right context set {a,,a 2 J. 
 If the W* table is ordered as written, i.e. 
 
 D 
 
 W» B ({A 1 ,A 2 },{a 1 })=(C 1> shift) 
 
 W* B ({A 3 }, {a ,a 2 })=(C 2 , reduce) 
 no transformations can be made. On the other hand, if the order is 
 reversed, then the table can be transformed to 
 
 W* B ({A 3 ),T)=(C 2 , reduce) 
 W» B (N,{a 1 })=(C 1> shift). 
 
 Clearly, the transformations which can be applied are dependent upon 
 the order of the table entries. To simplify the criteria for applying 
 transformation 1, the notion of conflict is introduced. Two entries 
 
36 
 W* (U.,V,) = (E,,t.) and W* R (U • ,V . ) = (E . , t . ) for which (E • , t, )^(E, , t, ) are 
 
 Dllll D J J J J J -- L JJ 
 
 left conflicting if U . r\ U./0; they are right conflicting if V.r\V^«). 
 It is easy to see that two entries cannot be both left and right 
 conflicting; otherwise the machine with mapping W* would not be 
 deterministic . 
 
 Entries which correspond to shift or rename actions are candidates 
 for left compression; it is possible to transform them to left 
 compressed entries. Likewise, entries which correspond to reduce or 
 rename actions are candidates for right compression. Given these 
 definitions, the criteria for applying transformation 1 can be restated: 
 An entry which is a candidate for left, repectively right, 
 compression can be transformed into a left, respectively right, 
 compressed entry exactly when the state table is ordered so 
 that every right, respectively left, conflicting entry precedes 
 the candidate in the ordering. 
 A set of entries for which there is an ordering of the table that allows 
 all members of the set to be either left or right compressed is a 
 compression set . Note that in general, the largest compression set is a 
 proper subset of the set of state table entries. Consider the following 
 set of table entries in state B: 
 
37 
 
 W* ({A, },{a. }) = (C, .shift) 
 B i 1 1 
 
 W» B ({A 1 },{a 2 }) = (C 2 ,shift) 
 W* B ({A 2 },{a 1 }) = (C 3 , reduce) 
 W* B ( { A 2 } , { a 2 } ) = ( C 4 , reduce ) 
 
 It is easy to see that there is no ordering which satisfies the criteria 
 given. In the following sections, we present schema for finding 
 compression sets, table orderings and transformations; in the next 
 section a convenient framework for developing these schema is presented, 
 using tools borrowed from graph theory. 
 
 3.4 CONFLICT GRAPHS 
 
 To assist in understanding this section, we present some 
 terminology and notation from graph theory. In general, the notation 
 used is from Harary [Harary 691. A graph . G=(X,E) consists of a finite 
 nonempty set, X, of p vertices together with a (possibly empty) set, E, 
 of q distinct unordered pairs of distinct vertices from X. Each pair in 
 E is called an edge . An edge e={u,v} can be written as e=uv or as e=vu. 
 If uv is an edge, then u and v are incident with the edge uv, and u and 
 v are adjacent . A vertex v is an isolated vertex if it is not incident 
 with any edges (adjacent to any vertices). A subgraph of a graph G is a 
 graph G' all of whose vertices and edges are in G. If G* is a subgraph 
 of G, we write G'CG. For a subset S of the vertices of G, the subgraph 
 
38 
 
 induced by S, written as <S> or <S;G>, is the maximal subgraph (in the 
 sense that no edge uv in G can be added to the subgraph) of G with 
 vertex set S. A spanning subgraph of G, G', is a subgraph whose vertex 
 set is the same as that of G. 
 
 The complement G of a graph G is the graph on the same vertex set in 
 
 which uv is an edge if and only if it is not an edge of G. A complete 
 
 graph is a graph in which every two distinct vertices are adjacent. The 
 
 complete graph on p vertices is denoted K . An independent set . S, is a 
 
 P 
 
 set of vertices no two of which are adjacent. Hence, if S is an 
 
 independent set, <S;G> is a complete subgraph of G. The union of two 
 
 graphs G and G„ , denoted GjU G 2 is the graph whose vertex set is the 
 
 union of the vertex sets of G and G~ and whose edge set is the union of 
 
 the edge sets of G and G„ . 
 
 1 2 
 
 A bipartite graph G is a graph whose vertex set X can be partitioned 
 into two subsets X. and X ? so that if uv is an edge of G, then u is in 
 X and v is in X (or vice-versa). If every vertex in X, is adjacent to 
 every vertex in X , then G is a complete bipartite graph. The complete 
 bipartite graph with |X | = m and |X I = n is denoted K m . 
 
 Obviously, a graph is a representation of a symmetric binary 
 relation on a set of objects; here, the binary relation is the conflict 
 relation developed in the previous section. To represent this relation, 
 two graphs, called conflict graphs in this thesis, are defined. 
 
39 
 
 Definition 3. 1 
 
 A conflict graph pair for a set of state table entries is a pair 
 of graphs 
 
 G =(X,E ), the right conflict graph . 
 R 
 
 G =(X,E ) , the left conflict graph 
 such that X is in 1-1 correspondence with the set of table entries 
 and for u, v in X 
 
 uv is in E iff the entries corresponding to u and v are 
 
 1_» 
 
 left conflicting 
 uv is in E R iff the entries corresponding to u and v are 
 
 right conflicting 
 
 Whenever it is not ambiguous to do so, we will use the terms vertex 
 
 and entry interchangeably. The relations adj, and adj R are defined 
 
 such that u adi v if uv is in E T and u adj- v if uv is in E p . 
 L L K K- 
 
 As an example, the graph pair shown in figure 3-1 corresponds to the 
 last set of table entries in the previous section. It is clear that G, 
 is a spanning subgraph of G R and vice versa, inasmuch as two entries 
 cannot be both left and right conflicting. 
 
Figure 3.1 
 
 HO 
 
 Graph pairs will be useful in visualizing conflicts, formulating 
 algorithms and proving theorems. The first such theorem yields a simple 
 test for detecting compressible sets given the conflict graph pair. The 
 theorem requires the following simple definition: a free vertex in a 
 subset S is a vertex which either corresponds to a left compression 
 candidate and is an isolated vertex in <S;G > or corresponds to a right 
 compression candidate and is an isolated vertex in <S;G L >. 
 
41 
 
 Theorem 3. 1 
 
 A subset Y of X, a vertex set for conflict graphs G and G , is a 
 
 L K 
 
 compression set if and only if for every subset, S, of Y there is a free 
 vertex in S. 
 
 Proof: 
 
 If . Since Y is a subset of itself, there is a free vertex, y, 
 
 in Y. Let I ?Y ?...2Y be a chain of subsets of Y such that 
 1 n 
 
 Y =Y, Y =Y-{y} and, in general, Y. . is constructed from Y. by 
 
 removing one of the free vertices in Y , say y . Corresponding to 
 
 i i 
 
 the sequence of Y , then, is a sequence of vertices, the y . 
 i i 
 
 Consider an arbitrary y . Without loss of generality, we can 
 
 i 
 
 assume it to be a left compression candidate and isolated on 
 
 <Y. ;G >. Since it is isolated on <Y ;G >, there is no y , j>i, 
 i R i R j 
 
 which is right conflicting — otherwise y y would be an edge of 
 
 i j 
 
 <Y. ;G R >. From this, it is easy to see that the sequence 
 y_,y. ,...,y corresponds to an ordering of the state table which 
 allows simultaneous compression of the entries in Y. 
 
42 
 
 Only if . Let U be a subset of Y containing no free vertex. For 
 any ordering of the vertices of Y, say yity2» #,, » y n there ^ s a 
 smallest j for which y. is in U. Suppose y. corresponds to a left 
 compression candidate; then there is y .y^ in <U;G R >. By the 
 assumption that j is smallest, k>j. Thus there is a conflicting 
 vertex not preceding y.. A similar result holds if y. corresponds 
 to a right compression candidate. Since the ordering was 
 arbitrary, it follows that there is no ordering for which every 
 entry left conflicting with a left compression candidate precedes 
 it and every entry right conflicting with a right compression 
 candidate precedes it. 
 
 A set which does not contain a free vertex will be called an 
 irreducible set . The theorem just given provides the following 
 algorithm, which tests a set Y for being a compression set. 
 
 Algorithm 3. 1 
 
 Input: A pair of conflict graphs, G =(X,E ) and G =(X,E R ) 
 
 together with the set of left compression candidates, L and the 
 
 set of right compression candidates, R. 
 
 Output: An irreducible subset of X. 
 
 Method: L and R are searched until a free vertex is found; the 
 
 vertex is removed from the sets and the algorithm repeats. If 
 
 no free vertex is found, the algorithm halts, returning L R. 
 
43 
 
 (1) Set Y=X. 
 repeat 
 
 (1) find a vertex, y, in LrvY such that adj yf\X-<t> 
 
 R 
 
 (2) if (1) fails, find a vertex in R r> Y such that 
 adj yriY=tf 
 
 (3) if either (1) or (2) succeeds, remove y from Y; 
 else return Y and halt 
 
 The algorithm presented can be improved upon; a better algorithm is 
 presented in the next chapter. 
 
 3.5 FINDING A BEST COMPRESSION SET 
 
 3.5.1 ANALYSIS 
 
 Although it is easy to test for the property of compressibility, it 
 is not easy to construct an efficient method for constructing a largest 
 possble compression set, as will be shown. First, it is necessary to 
 define what we mean by efficient. It is customary to say that an 
 algorithm is efficient only when the time required to solve a problem is 
 
4U 
 
 bounded by a polynomial function of the size of the problem. Therefore, 
 a method which examines every subset of the input set and tests for 
 compressibility is inefficient; it requires time exponential in the size 
 of the input set. Unfortunately, the problem of finding the largest 
 compression set belongs to a class of problems called non-deterministic 
 polynomial -time-complete (NP-complete) problems. An NP-complete problem 
 is one for which there is a nondeterministic polynomial-time algorithm 
 and which is as hard as any other problem for which there is such an 
 algorithm. That is, if there is a deterministic polynomial time 
 algorithm for the problem, then there is one for every NP problem. For 
 a discussion of NP-complete problems, see [Karp 72], [Aho 74]. In order 
 to show that finding a largest compression set is NP-complete, it is 
 necessary to show that there are conflict graphs corresponding to any 
 arbitrary pair of graphs G and G such that G c G and G qG . 
 
 Lemma 
 
 If G, and G 9 are arbitrary graphs such that G c G and G c G , then 
 *• * 1 z z 1 
 
 there is a corresponding pair of conflict graphs G^ and G^ such that 
 G =G and G 2 =G R . (The roles of Gj and G R are reversible.) 
 
45 
 
 Proof 
 
 It is easily seen that G and G must have the same vertex set 
 
 X. Let the vertex set, X, of G and G be x ,x ,...,x . Construct 
 
 ' ' 1 2 1 2' n 
 
 a set of table entries 
 
 W* d (U.V i )=(E 1 , rename) 
 B 1 1 1 
 
 W* (U ,V )=(E .rename) 
 B n n n 
 
 such that if x x is an edge of G, , U r\ U i<6 and if x .x . is an edge 
 i J 1 i J i J 
 
 of G , then V r\ V t<t> . This situation is trivially constructible: 
 2 i j 
 
 let A. .,..., A be elements of N and a, ,,..., a be elements of T. 
 11 nn 1 1 nn 
 
 U and U. contain A., exactly when x.x. is an edge of G^ and i<j. 
 Similarly, V^ and V. contain a- • exactly when x^x^ is an edge of G2 
 and i<j. Since x.x. in G, implies that U.nU.^, G,9G T .' 
 
 1 J 1 1 J 1 L, 
 
 (Similarly, G 2 GG R .) On the other hand, if \l ± r\ U .14, then it must 
 be that A., is in U. r\ U.; thus, G^G, (and G D ^G ). 
 
 1J 1 J L 1 K Z 
 
 Theorem 3.2 
 
 The problem of finding a largest compression set is NP-complete 
 
46 
 
 Proof 
 
 First, we wil show that the problem is NP-hard by showing that 
 a known NP-complete problem can be transformed to it in polynomial 
 time while making the size of the input a polynomial function of 
 the size of the original input. It is known that the problem of 
 finding the largest complete subgraph of a graph is NP-complete 
 [Aho 76]. It is, therefore, equivalently hard to find the largest 
 independent set of a graph. We show by construction that if there 
 is an algorithm to find the largest compression set of a state 
 table, then it can be used to find the largest independent set of a 
 graph. 
 
 Given a graph G on a vertex set of size p, construct the graph 
 
 pair 
 
 G =GUK , where the vertex set X* of K is disjoint from X 
 P P 
 
 G =K where the two vertex sets are X' and X. 
 P > P 
 
 Suppose the largest independent set of G is of size k. Assuming 
 all vertices represent candidates for both left and right 
 compression (renaming actions), the size of the largest compression 
 set is exactly p+k . Certainly, there is a set which is this large. 
 Take a largest independent set of G, S. Then by theorem 3.1 SOX' 
 is a compression set, since every vertex of S is free on any subset 
 of SAX 1 containing it, and every vertex of X" is free on any 
 subset of X'. On the other hand, if Y is a compression set, it 
 
47 
 
 obviously cannot contain more than p vertices from X'. Further, it 
 cannot contain more than k vertices from X. For suppose it did. 
 Then XnY must contain at least k+1 vertices. Consequently there 
 must be two vertices in XaY, say u and v such that uv is in G. In 
 that case, {u,v}uX' is an irreducible set. 
 
 That the problem is NP is easy to see: an enumeration algorithm 
 which contructs and tests each subset of the vertex set requires 
 polynomial time if it is run on a nondeterministic machine. 
 
 It should be noted that it is an unsettled question whether there 
 are polynomial-time algorithms for NP-complete problems. It is 
 conjectured that this is not the case, based on the fact that no such 
 algorithms have been found for a number of independent problems. 
 
 Therefore, while it is marginally possible that there is an efficient 
 algorithm to find a largest compression set, it does not seem likely. 
 
 3.5.2 A BACKTRACK ALGORITHM 
 
 The following algorithm determines the largest compression set for 
 
 a given state table by partial enumeration. The foundation for this 
 
 algorithm is the simple fact that if U and U' are subsets of a vertex 
 
 set X of a conflict graph pair such that: U'cU; U induces an 
 
 irreducible configuration; and U' is a compressible set, then there is a 
 
 vertex in U' which is not isolated in one of <U;G > or <U;G > but is 
 
 L R 
 
 isolated in the corresponding subgraph induced by U'. For example, 
 
consider vertex 3 in figure 3.2 
 
 'R 
 
 48 
 
 Algorithm 3.2 
 
 Figure 3.2 
 
 Input: A pair of conflict graphs, G =(X,E ) and G =(X,E R ) 
 
 together with the set of left compression candidates, L and the 
 
 set of right compression candidates, R. 
 
 Output: The largest compression set for the graph pair. 
 
 Method : C is a function which returns the largest compression 
 
 set. It accepts as input parameters U and D, where U is a 
 
 subset of vertices which contains the remainder of the 
 
49 
 
 compression set; and D is a set of vertices which can be 
 discarded from U when an irreducible configuration is 
 discovered. The result returned by C must contain D or be the 
 empty set. 
 
 C(U,D) is given by: 
 
 (1) Find an irreducible subset of U, V (see algorithm 3-1). 
 
 (2) if V=0, return U 
 
 (3) let x be a vertex in the smallest set contained in 
 
 {S ! S£D & ((S=adj T vn R)(veVnL) or (S=ad j v r\ L) ( veVHR) ) } 
 
 L R 
 
 (if the set is empty, return 6) 
 
 (4) recur, letting A=C(V-{x} ,D-{x} ) 
 
 (5) if A*V then recur, letting B=C(V,D-{x}) 
 else return AU(U-V) 
 
 (6) return the larger of Au(U-V) and Bu(U-V) 
 
 3.6 TABLE ORDERING 
 
 Finding a largest compression set only solves part of the problem; 
 it is desirable to find a good ordering of the table entries. It is 
 
 important especially to collapse the final, unconditional entries in the 
 table. It is also important to order the tables to take advantage of 
 the merging transformation. There are several possible strategies for 
 
50 
 
 ordering the compressible set to obtain good merging to be considered. 
 
 The first strategy is a simpleminded one. After finding a 
 compressible set Y, algorithm 3.1 is applied to the set. The order in 
 which the algorithm removes free vertices is used to generate the table 
 ordering. The algorithm is clearly suboptimal; the arbitrary selection 
 of the next vertex to remove can eliminate possible unconditional 
 entries or prevent other merging. For example, consider the following 
 state table: 
 
 W» ({A,}, UjJMEp rename) 
 
 W* B ({A 2 ) , {a 2 J) = (E 1 , rename) 
 
 W* B ({A 3 },{a 2 }) = (E 2 , reduce) 
 
 W» B ({A 1 },{a 3 )) = (E 3 , shift) 
 The conflict graphs corresponding to this table are shown in figure 3-3. 
 
 G L = 
 
 G R = 
 
 Figure 3-3 
 
51 
 
 Notice that all four vertices are free. The ordering (1,2,3,4) is 
 equally as likely as any other. This ordering results in the following 
 transformed table. 
 
 W* n (N,{a 1 })=(E 1 .rename) 
 B 1 1 
 
 W* ({A },T)=(E. .rename) 
 W* B ({A 3 },T)=(E 2 , reduce) 
 W* B (N,{a 3 })=(E 3 , shift) 
 
 If an optimal strategy is used, the table can be transformed to 
 
 W* B ({A 3 },T)=(E 2 , reduce) 
 W* B (N,{a 3 })=(E 3 , shift) 
 W* (N,T)=(E, .rename) 
 
 D i 
 
 Obviously, the strategy can be improved by employing heuristics in 
 determining which vertex is to be removed. A polynomial-time suboptimal 
 algorithm will result, but one which can avoid the obvious bad orderings 
 such as the one just given. Such a method is actually used in the 
 implementation. 
 
 If it is sufficient to find an ordering which exploits only the 
 collapsing of unconditional entries, then there is an optimal method for 
 doing this which requires polynomial time. As illustrated in the last 
 
52 
 
 example, the renaming rule entries all corresponded to free vertices in 
 the entire vertex set. The decision to choose vertices 3 and 4 allowed 
 the two renaming rule entries to be output last and consequently to be 
 collapsed into one unconditional entry. It is clearly desirable to 
 collapse a large number of entries in general. The following lemma 
 shows that there is a simple method for determining the optimal ordering 
 to do this, assuming that the renaming rule to which the unconditional 
 entry corresponds has been determined. 
 
 Lemma 
 
 Let Y be a compression set. Suppose there is an ordering of 
 entries such that a set of vertices, S, corresponds to a group of 
 unconditional entries. Then if U is a subset of Y containing an 
 element not in S, there is a free vertex in U that is not in S. 
 
 Proof 
 
 Suppose not. Then there is a subset U such that some x in S is 
 the only free vertex. There is some ordering of the table entries 
 corresponding to vertices in Y that allows a compression 
 transformation. Let S be the set of left compressed entries and 
 
 Li 
 
 S the set of right compressed entries in the resulting 
 R 
 
 transformation. Now, since discarding entries cannot create 
 
53 
 
 additional conflicts, it is clear that there is an ordering of the 
 
 entries in U such that vertices in S T U are left compressible and 
 
 those in S U are right compressible. Let that ordering on U be 
 
 y ,y , . . . ,y . Then y =x, because x is the only free vertex in U. 
 1 2 n 1 
 
 Also, m>1, since |U|>1 by the hypothesis. Let y=y~. Without loss 
 of generality, assume adJ L x U=<zJ- Then y adj R x, otherwise it is 
 free on U. Further, y is not isolated on <U;Gt>; if it were, then 
 removing y from U-{x} would not create a free vertex. Therefore, y 
 is in S . But since x is in S, all right and left conflicting 
 entries precede it so that S T adj D x=<z$. This leads to an obvious 
 contradiction. 
 
 This lemma implies that if, whenever there is a choice of free 
 vertices to remove, the algorithm always tries to remove one not in a 
 given set of vertices corresponding to the same (renaming) rule, then 
 the largest possible subset of that set will be left to be transformed 
 to the unconditional entry. 
 
 Algorithm 3-3 
 
 Input: Y, a compressible subset of a vertex set X, with subsets 
 L and R, the left and right compressible entries. 
 Output: Ordered sequences of elements of Y, representing table 
 orderings such that all entries can be compressed. 
 
5M 
 
 repeat for all pairs (E, rename) for entries in the state 
 
 (1) let Y be the compression set 
 
 (2) repeat 
 
 (a) find all vertices, y, in Lr\Y such that adj R ynY=d 
 
 (b) find all vertices, y, in Rnl such that adJ L y^Y=<z$ 
 
 (c) among all the vertices found in both (a) and (b), if 
 there is one corresponding to an entry with action 
 not (E, rename) output that vertex and remove it from 
 Y. 
 
 (d) if (c) fails, output any vertex found and remove it 
 from Y. 
 
 Algorithm 3.3 will not, in general, produce an ordered state table 
 which can be transformed into an optimal state table. It is doubtful 
 that there is a polynomial-time algorithm to generate an ordering which 
 allows optimal application of transformation 2. 
 
 3.7 OBSERVATIONS 
 
 As a general observation, the algorithms presented here are too 
 expensive to apply; they utilize too much computing time. In the 
 implementation, ad hoc approximations to them were used, with reasonable 
 success. Comparison of results obtainable using the exact methods 
 should be done, in order to determine the worth of using them in 
 restricted cases. Such comparisons are not covered by this thesis. 
 
55 
 
 CHAPTER 4 
 
 THE PARSER GENERATING SYSTEM 
 
 This chapter represents a departure from the formal presentation of 
 the previous two; it outlines an implementation specification for a 
 parser generating system that uses the methods presented in the previous 
 chapters. The structure of the system, the stages of generation and the 
 interfaces between the stages are detailed. Alternate and supplementary 
 algorithms are given to complete the development started in chapter 3. 
 
 4.1 PGS INPUT 
 
 Input to the PGS is an augmented BNF; the form of the input and 
 restrictions to it are discussed in section 4.3. The PGS outputs parse 
 tables which are essentially an encoded, compressed representation of 
 the EBCA mapping described earlier. However, because the tables are 
 used to drive a parser embedded in a compiler, and not just to drive a 
 recognizer, it is necessary to provide information to the rest of the 
 compiler. Semantic actions are included in (some of) the table entries; 
 these are used to generate code or to label abstract parse trees (or to 
 perform similar semantic functions.) 
 
56 
 
 4.2 THE STRUCTURE OF THE PGS 
 
 There are six phases of the PGS which transform user input into 
 parse table output. The first transforms the user input grammar, an 
 augmented form of BNF, into an internal grammar in the normal form 
 described in chapter 2. The second phase uses the internal grammar to 
 compute context sets and attaches the context information to the 
 productions of the internal grammar. Phase three generates conflict 
 graphs for each state from the context information. Phase four uses the 
 conflict graphs to generate properly ordered, compressed state tables. 
 Phases three and four are applied successively on a state-by-state 
 basis. Phase five arranges the tables to gain additional space savings 
 and arranges the initial state table to make it possible to do direct 
 lookup. (See section 4.5.) The final phase encodes the state tables in 
 a usable form, which may vary to allow the generation of parse tables 
 for different host machines. The diagram, figure 4.1, shows the flow of 
 data through the six phases. 
 
 4.3 TRANSFORMING THE INPUT GRAMMAR 
 
 The first phase accepts augmented BNF and transforms it to normal 
 form. The transformations were developed in [Mickunas 731- For 
 completeness, they will be described, albeit briefly. For more detail, 
 see [Mickunas 76]. 
 
57 
 
 Input BNF 
 
 GRAMMAR CONVERSION 
 
 Internal normal form 
 
 CONTEXT COMPUTATION 
 
 Productions with attached contexts 
 
 GRAPH GENERATION 
 
 Conflict graphs 
 
 TABLE COMPRESSION 
 
 Intermediate table representation 
 
 TABLE ARRANGING 
 
 Ordered collection of state tables 
 
 ENCODING 
 
 T 
 
 Tables in output form 
 
 Figure 4. 1 
 
58 
 
 The transformation from the input grammar to internal normal form is 
 accomplished in stages; each stage generates a grammar which generates 
 the same language and which preserves all "sparse" parses. It is 
 assumed that the initial grammar is LR(k) for some k. The LR(k) grammar 
 is first transformed to an SLR(1) grammar; the SLR(1) grammar is then 
 converted to a (1,1)BRC grammar; and finally this grammar is transformed 
 to the internal normal form. If the input grammar is already known to 
 be SLR(1) or (1,1 )BRC , stages of the transformation can be bypassed. 
 
 The transformation from LR(k) to SLR(1) is based on the following 
 concept. If A -> w is a rule which is not SLR(1), then the grammar is 
 modified so that trees of the form 
 
are transformed to the form 
 
 59 
 
 [Ct] 
 
 [t/D] 
 
 This process is known as look-ahead reductio n. Look-ahead reduction 
 consists of two transformations. The first is right-context extraction : 
 if A is an offending rule and Bp*A, then the grammar is said to be 
 right-context extracted iff C -> uBy in P implies that either y is the 
 empty string or y=av for some terminal, a; right context extraction is 
 the conversion of a grammar to a right-context extracted form. The 
 second transformation is premature scanning : if A -> w is an offending 
 
 rule, B p *A and C ->uBav is in P, then Ba is combined to [Ba], a 
 non-terminal--the string Ba is replaced by [Ba] in all right parts in 
 which it appears. Also, rules C -> wX, where Bp*C are replaced by 
 
60 
 
 [Ca] -> wXa if X is a terminal symbol and by [Ca] -> w[Xa] if X is a 
 nonterminal. If a grammar is not SLR(1) then the look-ahead reduction 
 transformations are applied to yield a new grammar. The process is 
 repeated until the transformed grammar is finally SLR(1). If the 
 original grammar is LR(k), then at most k applications of the 
 transformation are required. 
 
 The SLR(1) grammar is subsequently transformed to (1,1 )BRC using 
 'state splitting' and left stratification ; for details, see [Graham 71]. 
 The (1,1 )BRC grammars are then transformed to the internal normal form 
 by repeated left factorization . Given productions P = {A^ -> uB-^v}, 
 left factorization replaces the set by P ' = {h^ ->[u]B i v} and 
 P' = {[u] -> u}; [u] is added to the nonterminals. Application is 
 repeated as long as there is a factor of length bigger than 2. 
 Eventually, all productions are of the proper form. However, there may 
 be situations such that A -> BC and A ->DB (or A ->Ba). This is called 
 a stack-state conflict; such conflicts can be removed by further 
 application of left factorization, replacing A -> BC by A ->B'C and 
 adding B' -> B. 
 
61 
 
 4.4 CONTEXT COMPUTATION 
 
 4.4.1 ALGORITHMS 
 
 The second phase of the system involves the computation of the 
 canonical mapping, W*. For the sake of convenience, the context sets 
 are attached to the productions of the grammar. The productions for 
 each state are kept in lists, so that at the end of the phase, the 
 mapping is available for each state. The computation and collection of 
 context sets is performed in three subphases: the first subphase 
 generates the binary relations (and inverses) given in section 2.5; the 
 second generates the first, follow and precede sets from these 
 relations; and the last associates the context sets with the appropriate 
 productions. 
 
 As we showed in 2.5, to compute the necessary context sets, it is 
 sufficient to compute the first, follow, and precede sets for each 
 nonterminal. This is done in two steps: first, X , p ,a and a - are 
 computed; in the second, the transitive closures of and are computed 
 and the relations composed to construct the sets. The first step is 
 accomplished in one pass through the production rules. For each 
 production, the partially computed sets, FIRST, FOLLOW, and PRECEDE and 
 the binary relations are computed as follows: 
 
62 
 
 1. For p = A -> A. A . A. is added to XA and to pk ±2 and k ±l 
 ii il i2 1 Xi 
 
 is added to PRECEDE(A.J 
 
 i2 
 
 2 For p = A -> A a , A is added to A A and a is added to 
 i i il i2 i 1J - 
 
 FOLLOW(A. ,) 
 il 
 
 3. For d = A -> A . , , A . is added to XA., and to pA.,. 
 
 i i il l li 
 
 4. For p. = A. -> a,,, a., is added to FIRST(A i ). 
 
 In this step, we also construct a directed graph, the renaming graph. 
 
 representing the renaming rule structure of the grammar. The vertices 
 
 of the graph are the nonterminals of the grammar. A labeled edge from 
 
 A to A,, is added for all renaming rules p =A ± -> A-q. The label 
 i il 
 
 applied is the index of the production. 
 
 To complete the generation of the sets FIRST, FOLLOW, and PRECEDE, 
 
 3 
 requires transitive closure. Closure is computed by the following 0(n ) 
 
 algorithm, where n is the number of nonterminals. (Actually, the 
 
 algorithm is (Kn^) where m is the number of computer words required to 
 
 store n bits. ) 
 
 Input: A binary relation, 6 , as an array of bit vectors. 
 Output: The transitive closure of 6, 6*. 
 
63 
 
 (1) Set 3*x to {x,}u0x, for all x.. 
 
 ill i 
 
 (2) for i from 1 to n do for all x B*x include B*x in 0*x . 
 
 i j i j 
 
 It is easy to verify that this algorithm is correct. 
 
 After the closures are computed, FIRST, FOLLOW, and PRECEDE are 
 completed as follows: 
 
 1. FIRST(B) is included in FIRST(A) for all A \+ B. 
 
 2. PRECEDE(A) is included in PRECEDE(B) for all A \ + B. 
 
 3. FIRST(A) is included in FOLLOW(B) for all A, B such that 
 C -> BA. 
 
 4. FOLLOW(A) is included in FOLLOW(B) for all A P + B. 
 
 The context sets are attached to productions in a manner prescribed 
 by the procedure given in 2.5. For each input erasing and stack 
 renaming production, p, the appropriate context sets are determined and 
 attached to the production; further, they are added to every production, 
 p', which labels an edge in the renaming graph that is in any path from 
 a nonterminal in the right part of p. Traversing the renaming graph is 
 a simple process; there can be no cycles in the graph, nor can there be 
 
64 
 
 multiple paths between vertices. Therefore, each nonterminal can be 
 encountered at most once in any traversal of the renaming graph. Thus, 
 distributing context sets is an 0(mn) algorithm, where m is the number 
 of input erasing and stack reducing rules and n the number of 
 nonterminals. 
 
 4.4.2 REPRESENTING CONTEXT SETS 
 
 The context information for each production is recorded as an index 
 into a table of context pairs. The pairs map into two tables of bit 
 vectors, one table each for the left and right context sets. Each bit 
 vector in a table is unique; when a new context set is generated during 
 context computation, a hash addressing scheme is used to locate the 
 corresponding bit vector; if it is not present, it is added to that 
 table. The tables are initialized so that all the singleton subsets of 
 N are included in the left context set table and all the singleton 
 subsets of T are included in the right context set table. Entries with 
 the same hash address are chained together. Figure 4.2 illustrates the 
 layout of the context tables. 
 
65 
 
 62 
 
 131 
 
 -> {Tj} > {T 5 ,T 6 ,T 7 } — > 
 
 > (T 2 ) > 
 
 -> IT 1 -— > 
 
 Figure 4.2 
 
 4.5 CONFLICT GRAPH GENERATION 
 
 The third phase generates conflict graphs for each state. The pair 
 of graphs and the sets of left and right compression candidates is 
 
66 
 
 generated for all productions in the list for the specific state, 
 graphs are represented by the vertex set and the two adjacency 
 relations; the adjacency relations are computed from the definition 
 given in section 3-4. Each vertex is mapped into a triple consisting of 
 a production rule index and indices into the context set tables. 
 
 4.6 STATE TABLE GENERATION 
 
 Phase four creates state tables from the graphs created in phase 
 three. The tables are generated in three subphases. The first divides 
 the table entries into two partitions: full entries and compressed 
 entries. The second subphase orders the entries, and the third subphase 
 performs the compression and merging transformations and generates an 
 intermediate representation of the state table. In the implementation, 
 the second and third subphases are carried out concurrently. For ease 
 of presentation, these subphases will be treated as separate sequential 
 steps. 
 
 The process used to partition the full and compressed entries is 
 very simple. The following stucture is used. 
 
67 
 Algorithm 4.5. 1 
 
 Input: A set of vertices, X, and a pair of adjacency relations, 
 
 adj and adj . 
 L R 
 
 Output: FULL, a set of full entries, and COMPRESS, a set of 
 compressible entries. 
 
 (1) Let Y = X. 
 repeat 
 
 (2) Remove vertices from Y until there is a free vertex in 
 the subset; add the vertices removed to FULL. 
 
 (3) Generate S, a set with no free vertices, by removing 
 free vertices from Y until it is not possible to 
 continue; add the vertices removed to COMPRESS. 
 
 (4) Let Y be the set of remaining vertices in S. 
 until S=<zS . 
 
 The reduction from Y to S specified in step (2) can be achieved as 
 follows. Alternately scan the remaining left and right compression 
 candidates looking for isolated vertices on the induced subgraphs. The 
 process terminates when none are left. The following algorithm is based 
 on this principle. 
 
68 
 
 Algorithm 4.5.2 
 
 Input: Y, a set of vertices, and the relations adj and adj on 
 
 L R 
 
 the elements of Y. 
 
 Output: A subset of Y containing no free vertices. 
 
 Method: Let C and C be the sets of left and right compression 
 L R 
 
 candidates; the sets L and R will be used to hold potential 
 
 free vertices from C and C . Vertices are removed from L and 
 
 L R 
 
 R during the scanning process and added to L and R when freeing 
 a vertex could cause the vertex added to become isolated. 
 
 (1) Let L = C 7 , R = C D , and S = Y. 
 
 L K 
 
 repeat 
 
 (2) /* left scan */ Pick a vertex, y, from L and remove 
 it. If adj R yAS = #, then remove y from S and add 
 vertices from adj. yr\C D /-\ S to R. 
 
 (3) /* right scan */ Pick a vertex, y, from R and remove 
 it. If adj yr\S - gb , then remove y from S and add 
 
 Li 
 
 vertices from adj y^C aS to L. 
 R L 
 
 until L UR = 
 
 This algorithm is 0(e+n), where e is the total number of edges in both 
 conflict graphs and n is the number of vertices. This is easy to see: a 
 vertex is examined at least once and once additionally for every vertex 
 adjacent to it removed in a left or right scan. 
 
69 
 
 Given the output of this algorithm, it is necessary to find a good 
 set of vertices to remove in order to obtain a free vertex. As we 
 showed in 3-5, there are no known polynomial-time algorithms to generate 
 an optimal compression set. Therefore, the algorithm presented here is 
 only approximate; it functions by removing the cheapest immediate set 
 which will produce a free vertex. In those cases in which there are few 
 vertices to be removed, the algorithm is good if not exact. 
 
 Algorithm 4.5.3 
 
 Input: A set S containing no free vertex. 
 
 Output : A subset S 1 of S containing at least one free vertex 
 
 (1) Let L = left compression candidates, R = right compression 
 candidates. 
 
 (2) Find x in L s.t. ladi xr\S| is smallest. 
 
 R 
 
 (3) Find y in R s.t. |adj yr>S| is smallest. 
 
 (4) If |adj y^Sl < I ad j x^S| then set S = S-adj y; 
 
 L R L 
 
 otherwise, set S = S-adj x. 
 
 R 
 
 When this algorithm is used as step 3 of algorithm 4.5.1, the result is 
 the same as would be obtained from the backtracking algorithm 3-2 if it 
 halted after finding the first compression set. Since algorithm 4.5.3 
 is 0(n), algorithm 4.5.1 is 0(n 2 +ne). 
 
70 
 
 Given that the partitions have been computed, it is necessary to 
 order both sets. We will address only the problem of ordering the 
 compressed entries here. We showed in section 3.5 that there is a 
 polynomial algorithm to find an ordering which provides maximum 
 exploitation of the unconditional entries; further, we gave a bound for 
 the algorithm. In the current implementation a faster, albeit 
 suboptimal, ordering algorithm is used. Time is saved by not repeating 
 the ordering process for each renaming rule in the grammar; instead, the 
 best choice is "predicted" and the algorithm defers outputting 
 corresponding entries when it can. At the same time, the algorithm 
 attempts to improve table size by setting up mergeable sequences of 
 entries. 
 
 We present a slightly simplified version of the algorithm in use; 
 additional detail does not add to the discussion. To simplify notation, 
 let R(v) be the largest set containing vertex v and consisting of 
 vertices that correspond to entries for the same rule. 
 
 Algorithm 4.5.4 
 
 Input: A compressible set, Y. 
 
 Output : An ordering of Y and a compression scheme s, a 
 
 function from Y to the set {left, right, both, either}. 
 
71 
 
 (1) Let L = left compression candidate set, and R = right 
 
 compression candidate set. 
 repeat 
 
 (2) If there exists x in (L-R)nY s.t. adj xnY r <z5, then 
 
 R 
 
 output x, set s(x) = left, and remove x from Y. 
 
 (3) else if there is x in (R-L)r>Y s.t. adj xnY = 6 then 
 output x, set s(x) = right, remove x from Y. 
 
 (4) else let U = (xt Lr\R| adjxnY = e5}, let U_ = 
 
 L L K 
 
 {xe Ln R| adj xnY = <z5}, U = U U U -(U n U ), and U 
 R 1 L R L R 2 
 
 = U n U . 
 L R 
 
 (5) If U u U = <b, then output an error entry and halt . 
 
 (6) If U i <t>, pick y in U s.t. |R(y)nY| is smallest. 
 
 If y is in U (U ) output R(y)n U (R(y)nU ) and for 
 R L K L 
 
 all x in R(y)nU D (R(y)nU.) set s(x) = left (right) 
 
 R L 
 
 and remove x from U and Y, 
 
 until U -<t. 
 1 
 
 repeat 
 
 (7) pick y in U s.t. |R(y)^Y| is smallest. 
 
 (8) if not (R(y)cY) then output R(y)r»Y and for all x in 
 R(y)rAY, set s(x)=either and remove x from U and Y. 
 
 until R(y)& Y. 
 
 (9) output Y and set s(y)=both. 
 
 The algorithm as given is clearly 0(n2). By using the technique that 
 was used in algorithm 4.5.1, the algorithm can be improved to 0(n+e). 
 
72 
 
 Since the algorithm given to generate a compression set is 0(ne), it 
 dominates all the computation. 
 
 The final step in generating the state tables is to transform the 
 state tables using the compression and merging transformations. 
 Compression is performed to take maximum advantage of merging: for cases 
 
 in which there is a sequence of entries x . , ,x . „, . . . ,x. , all in R(x..), 
 
 M ll i2 in ll 
 
 for which s(x..)=both, the choice of compression is based on the result 
 of merging the entries. If merging left compressed entries results in 
 the generation of a 'new' set and merging the entries as right 
 compressed entries would not, the latter choice is made, and vice-versa. 
 Each entry in the set tables is tagged to indicate whether it has been 
 referenced in a state table, and tagged entries are given high priority. 
 In this way, profliferation of sets is avoided. Obviously, the choices 
 made in one state affect the compression scheme used later. It is 
 clearly possible to make optimal choices, but only at the expense of 
 using a combinatoric algorithm over the set of all states. 
 
 4.7 ARRANGING 
 
 Before the state tables are encoded, several global "optimizations" 
 can be performed. These reduce the size of the tables and allow the 
 parser to run faster as well. We will categorize all of these 
 optimizations under the heading of "arranging". There are three 
 distinct operations treated under this heading: arranging the initial 
 state table, arranging the state tables, and arranging the state and 
 
73 
 input vocabularies. 
 
 The initial state table entries are arranged so that the table can 
 be handled specially. Instead of performing a sequential table lookup, 
 it is possible to access the entries directly. Recall that productions 
 for the initial state are of the form A. -> a.., where the a., are all 
 distinct. If the input symbols which are the right parts of the initial 
 state rules are mapped one-to-one on an interval 1..,n, it is trivial to 
 arrange the initial state table so that the right context for the i-th 
 entry is the i-th input symbol. This allows direct access; if a symbol 
 is in the proper range, the action for the corresponding entry is 
 performed. If the symbol is out of range, an error is detected. By 
 using direct access, the time spent accessing entries in the initial 
 state is reduced dramatically. Because the parser is frequently in the 
 initial state, and because the remaining states generally have few 
 entries, this transformation allows the parser to be time efficient as 
 well as compact. 
 
 States are arranged to take advantage of the properties of 
 unconditional entries. Consider an unconditional entry for which the 
 state/stack symbol is a state and for which there is no semantic action. 
 Since the only action performed by the parser for that given entry is a 
 transition from one state to another, it is possible to eliminate the 
 entry by concatenating the states. The second state is treated as an 
 "entry point" of the first. To determine an ordering which allows the 
 exploitation of this property, the following procedure is used. We 
 
74 
 
 construct an inverse forest, the nodes of which represent states. For 
 each state, B , in which the unconditional entry specifies a transition 
 to a state B without a semantic action, define an edge from B to B . 
 At each node there are clearly m>=0 incoming edges and n<=1 outgoing 
 edges. For every node such that m>1, remove m-1 edges arbitrarily. The 
 result of this process is a collection of disjoint chains; each chain 
 specifies the ordering of a group of states. By arbitrarily ordering 
 the chains and preserving the ordering within them, the state tables are 
 arranged to take advantage of the special unconditional entries. 
 
 The final arranging scheme we consider is rearranging the stack and 
 
 input vocabularies. This is done to reduce the space required to store 
 
 the auxiliary tables of bit vectors. We have assumed that it is 
 
 necessary to accomodate the largest possible set in establishing the 
 
 length of the (fixed-length) bit vectors. However, if no set contains 
 
 the last k elements of this largest set, it is possible to truncate the 
 
 bit vectors and to specify the range of symbols in the sets. The 
 
 symbols in the sets can be arranged so that any symbol not appearing in 
 
 any set used is assigned a numeric encoding larger than a specified 
 
 range. That range is given by lUx I, where the X are left (or right) 
 
 i 1 
 
 context sets. In arranging the input symbols, the encoding of the 
 symbols used as right parts of initial state rules must be preserved. 
 The remaining symbols can be rearranged. 
 
75 
 
 4.8 ENCODING THE TABLES 
 
 The final phase of the PGS encodes the state tables, putting them 
 into a form usable on the parser's host machine. The tables can be 
 output as binary files, macro definitions, initialization code, or other 
 convenient representation; without loss of generality, we will assume 
 the output to be in the form of macro definitions. 
 
 Input to the encoding phase consists of three distinct groups of 
 data. State tables are input as an array of sixtuples consisting of a 
 context pair, a state/stack symbol, an acceptor action, a semantic 
 action, and a compression mode. A separate table contains the locations 
 of the state tables in the first array, and the lengths of the state 
 tables. The third data group is the pair of context set tables. 
 
 Output consists of a sequence of pseudo-instructions, which can be 
 thought of as machine instructions for the parsing machine. There are 
 four primitive actions performed by the machine: 
 
 1. Verify the current input symbol and apply an action. 
 
 2. Verify the current top of stack symbol and apply an action. 
 3- Apply an action unconditionally. 
 
 4. Indicate an error has occurred and enter error mode. 
 
76 
 
 For the first two types of psuedo, the action is applied only if the 
 symbol is matched. If the symbol is not matched, the next 
 psuedo-instruction is fetched and interpreted. The psuedo-instruction 
 format is shown in figure 4.3. 
 
 operation 
 
 verification 
 index 
 
 reduction 
 flag 
 
 transition 
 flag 
 
 
 
 symbol/ 
 
 address 
 
 Figure 4.3 
 
 The operation field indicates one of the four instruction types 
 already described. The verification index is an index into a table of 
 bit vectors or the index of a symbol. (In order to avoid storing 
 singleton sets, the index values 1...m, where m=|N| or |T|, are used to 
 represent the sets {1}...{m} and the table entries 1...k are represented 
 by index values m+1...m+k.) The reduction flag indicates whether the 
 symbol verified should be erased. The transition flag specifies how the 
 
77 
 
 symbol/address field should be used: as the address of a 
 psuedo-instruction or as a symbol to add to the stack. For the third 
 type of psuedo-instruction, the verification index and reduction flag 
 are unused; for the error pseudo, only the operation and address fields 
 are used, the latter to indicate the state in which the error occurred. 
 
 In the examples used in the appendix, the operation codes are given 
 by INPUT, STACK, ANY, and ERROR. The reduce flag values are represented 
 symbolically as RED (erase/reduce the symbol) and NORED (do not). The 
 transition flag values are ADD (add to the stack) and GOTO (next 
 instruction to perform is at the address specified). 
 
 To encode a full entry, we use three instructions. The first checks 
 one context and conditionally "branches" to a substate in which the 
 second context is checked. The third instruction affects a return to 
 the main state to execute further instructions. For example, the state 
 given at the end of section 3-3 is encoded in figure 4.4; semantic 
 actions are not shown. Addresses are given in symbolic form. 
 
 In a three instruction sequence, if the stack is reduced or the 
 input shifted, that action must be performed in the second instruction. 
 Further, the semantic action is performed before any symbol is consumed. 
 
78 
 
 S1: INPUT, 1,N0RED, GOTO, S11 
 S12: INPUT, 1, RED, GOTO, S2 
 
 INPUT, 2, RED, GOTO, S3 
 
 STACK, 1, RED, GOTO, S2 
 
 ERROR, 1 
 S11: STACK, 2, RED, GOTO, S5 
 
 ANY, GOTO, S1 2 
 
 Figure 4.4 
 
79 
 
 CHAPTER 5 
 RESULTS AND CONCLUSIONS 
 
 5.1 RESULTS 
 
 The principal claim of this thesis has been that the PGS is a 
 flexible system that produces compact parsers. The results presented 
 here support that claim. Parsers produced by PGS are compared with 
 those obtained using well-known, efficient systems. Parse table sizes 
 are given for three grammars: a simple expression grammar (see Appendix 
 A), and grammars for XPL and Pascal. Results for XPL are of particular 
 interest, as they provide a basis for direct comparison with the LR 
 parsers produced by Anderson, Eve and horning and by Joliat. Their 
 methods are the most efficient reported in the literature. The 
 technique of Anderson, e_t al produces a list structured parse table, 
 while Joliat produces a sparse-matrix representation. Indirect 
 
 comparison is also available; Anderson, e_t al give results for Algol W, 
 a language comparable in size and complexity to Pascal. As an 
 additional benchmark, results from a previous version of PGS are shown. 
 
80 
 
 The first table provides a complete synopsis of the results 
 obtained. The number of table entries as shown indicates the minimum 
 number of entries produced using all transformations and assuming no 
 semantic actions are required (thus allowing the elimination of some 
 unconditional entries). The space required by the tables is given for 
 three different situations. The first figure given is a measure of the 
 space required when the smallest possible number of bits is used to 
 encode the pseudo-instructions, and assumes semantic actions are not 
 required. The second figure given is based on a fixed, 30-bit 
 pseudo-instruction length. The third figure assumes that no 
 unconditional entries can be eliminated. The size of the tables 
 produced for a compiler is likely to fall between the second and third 
 figures. 
 
 (**) Bits Bits Bits Stack Input (**) Full 
 
 sets sets entries 
 
 11370 12300 13 
 
 5010 5820 
 
 390 450 
 
 Pascal 
 
 349 
 
 7712 
 
 XPL 
 
 163 
 
 3296 
 
 AEG (*) 
 
 13 
 
 130 
 
 3 
 
 31 
 
 1 
 
 2 
 
 27 
 
 
 
 
 
 2 
 
 
 
 Table 5. 1 
 
 (*) AEG is a simple expression grammar 
 
 (**) Unconditional entries removed by concatenating states, 
 
81 
 
 Table 5.2 shows the coraparitive figures for the methods of Anderson, 
 et al , Joliat, and the previous PGS version. Table sizes are shown in 
 bytes. Additional space is required for interpreters; however, this 
 space is roughly the same in all cases. The table sizes reported for 
 the new PGS also include all space required for context sets. 
 
 
 PGS (old) 
 
 PGS (new) 
 
 Anderson 
 
 Joliat 
 
 XPL 
 
 1023 
 
 627 
 
 1128 
 
 1966* 
 
 PASCAL 
 
 2480 
 
 1421 
 
 
 
 
 
 ALGOL W 
 
 
 
 
 
 2434 
 
 
 
 AEG 
 
 200 
 
 49 
 
 
 
 232 
 
 Table 5.2 
 
 (*) Joliat indicates that by eliminating error entries and 
 doing full matrix reduction, a table size of 1000 can 
 be obtained. 
 
 The limited data indicates that the compression techniques used in the 
 PGS can potentially produce parse tables 25-35% smaller than those 
 produced by very efficient LR methods. 
 
82 
 
 5.2 CONCLUSIONS 
 
 This thesis has described a powerful parser generating system that 
 is capable of producing efficient parsers for a wide class of grammars, 
 namely the class of LR grammars. The contribution made by the thesis 
 research is the development of techniques to acheive very good table 
 compression. The techniques given in chapter 4 are efficient in that 
 they do not require an exorbitant amout of computing resources in 
 exchange for producing parse tables. The resulting system is capable of 
 generating parsers for a large group of languages and making them 
 available on small machines, i.e. mini and microcomputers. 
 
 5.3 EXTENSIONS AND REFINEMENTS 
 
 There are several possible refinements to the system which have not 
 yet been invesigated. Additional heuristics for the compression 
 algorithms might be developed, although the payoff does not promise to 
 be extensive. Other areas to be investigated are error recovery and 
 parser speedup. The compression of the parse tables sacrifices some 
 effectiveness recovering and/or repairing syntactic errors. A 
 compromise technique might be developed which perform full context 
 checks for critical production rules. Determinig such situations can be 
 tied in with the table generation algorithms presented in this paper. 
 To do this requires a thorough analysis of applicable error recovery 
 strategies. 
 
83 
 
 Another refinement involves improving the time efficiency of the 
 parser by doing 'partial chain elimination', bypassing renaming rules 
 which do not contribute to the semantic phase. For certain contexts, a 
 long chain of renaming rules might be involved which have no associated 
 semantic action (or perhaps only one). By detecting these contexts, it 
 is possible to establish a direct transition and avoid superfluous 
 parsing decisions. 
 
 The final area for which some investigation is suggested is the 
 grammar transformation phase. Any optimization of the output grammar 
 which can be made in this phase potentially carries over into the 
 remaining phases. Similarly for error recovery and parser speedup. 
 
REFERENCES 
 
 84 
 
 [Aho 72] 
 
 [Aho 74] 
 
 [Anderson 73] 
 
 [DeReraer 71] 
 
 [Eickel 63] 
 
 [Evans 64] 
 
 [Feldman 68] 
 
 [Floyd 61] 
 
 [Floyd 63] 
 
 [Floyd 64] 
 
 Aho, A.V. and Ullman, J.D. The Theory of Parsing, 
 
 Translation, and Compiling. Prentice-Hall, Englewodd 
 Cliffs, NJ 1972. 
 
 Aho, A.V., Hopcroft, J.E., and Ullman, J.D. The 
 
 Design and Analysis of Computer Algorithms. 
 
 Addison-Wesley, Reading, Mass 1974. 
 
 Anderson, T, Eve, J, and Horning, J.J. Efficient 
 
 LR(1) parsers. Acta Informatica 2(1973) ,12-39. 
 
 DeRemer, F.L. Simple LR(k) grammars. Comm. ACM 
 
 14(1971), 453-460. 
 
 Eickel, J., Paul, M. , Bauer, F.L., and Samelson, K. 
 
 A syntax controlled generator of formal language 
 
 processors. Comm. ACM 6( 1963) , 451-455. 
 
 Evans, A. Jr. An ALGOL 60 compiler. Ann. Rev. 
 
 Automatic Programming 4( 1964) ,87-124. 
 
 Feldman, J. A. and Gries, D. Translator writing 
 
 systems. Comm. ACM 1 1 ( 1 968) ,77-1 13. 
 
 Floyd, R.W. A descriptive language for symbol 
 
 manipulation. J. ACM 8( 1 96 1 ), 579-584. 
 
 Floyd, R.W. Syntactic analysis and operator 
 
 precedence. J. ACM 10(1963) ,316-333. 
 
 Floyd, R.W. Bounded-context syntactic analysis. 
 
85 
 
 [Graham 74] 
 
 [Gray 72] 
 
 [Harary 69] 
 
 [Ichbiah 70] 
 
 [Joliat 73] 
 
 [Karp 72] 
 
 [Lalonde 71] 
 
 [McKeeman 71] 
 
 [Mickunas 73] 
 
 Comm. ACM 7( 1964 ) ,62-67 . 
 
 Graham, S.L. On bounded right context languages and 
 
 grammars. SIAM J. Comput. 3( 1974) ,224-254. 
 
 Gray, J.N. and Harrison, M.A. On the covering and 
 
 reduction problems for context-free grammars. J. 
 
 ACM 19(1972), 675-698. 
 
 Harary, F. Graph Theory. Addison-Wesley, Reading, 
 
 Mass 1969. 
 
 Ichbiah, J. and Morse, S. A technique for 
 
 generating almost optimal Floyd-Evans productions of 
 
 precedence grammars. Comm. ACM 13( 1970) ,501-508. 
 
 Joliat, M.L. On the reduced matrix representation of 
 
 LR(k) parser tables. University of Toronto, Computer 
 
 Systems Research Group Tech. Rep. CSRG-28, 1973. 
 
 Karp, R.M. Reducibility among combinatorial 
 
 problems, in: Complexity of Computer Computation, 
 
 R.E. Miller and J.W. Thatcher (eds.), Plenum Press, 
 
 New York, NY ( 1972) ,85-104. 
 
 Lalonde, W.R. An efficient LALR parser generator. 
 
 University of Toronto Computer Systems Research Group 
 
 Tech. Rep. CSRG-2,1971. 
 
 McKeeman, W.M., Horning, J.J., and Wortman, D.B. A 
 
 Compiler Generator. Prentice-Hall, Englewood Cliffs, 
 
 NJ 1971. 
 
 Mickunas, M.D. Techniques for compressing bounded 
 
 right context acceptors, Doctoral Diss., Purdue U. , 
 
86 
 
 West Lafayette, Ind. May 1973- 
 
 [Mickunas2 73] Mickunas, M.D. and Schneider, V.B. A 
 
 parser-generating system for constructing compressed 
 compilers. Coram. ACM 16( 1973) ,669-676. 
 
 [Mickunas 76] Mickunas, M.D., Lancaster, R.L. , and Schneider, V.B. 
 
 Transforming LR(k) grammars to LR(1), SLR(1) and 
 
 (1,1) bounded right context grammars. J. ACM 
 23(1976), 511-533. 
 
 [Pager 70] Pager, D. A solution to an open problem by Knuth. 
 
 Information and Control 17( 1970) ,462-473. 
 [Wirth 66] Wirth, N. and Weber, H. EULER: a generalization of 
 Algol 60 and its formal description. Comm. ACM 
 9(1966)13-25,89-99. 
 
87 
 
 APPENDIX A 
 
 AN ARITHMETIC EXPRESSION GRAMMAR 
 
 1 GRAMMARS 
 
 BNF 
 
 Note: nonterminal symbols in both grammars are given as 
 (subscripted) upper case letters; all other symbols, with the exception 
 of the metasymbol '->', are terminal symbols. 
 
 E -> E+T 
 E -> T 
 T -> T*F 
 T -> F 
 F -> i 
 F -> (E) 
 
THE NORMAL FORM GRAMMAR 
 
 s - 
 
 > 
 
 J t $ 
 
 J l 
 
 \ 
 
 ► J 2 E 
 
 E - 
 
 > 
 
 XjT 
 
 E - 
 
 ■> 
 
 T 
 
 T - 
 
 > 
 
 X 2 F 
 
 T - 
 
 -> 
 
 F 
 
 F - 
 
 -> 
 
 P ) 
 
 F ■ 
 
 -> 
 
 i 
 
 P • 
 
 -> 
 
 X 3 E 
 
 x i 
 
 - 
 
 > E + 
 
 X 2 
 
 - 
 
 > T * 
 
 x 3 
 
 
 > ( 
 
 J t 
 
 
 •> $ 
 
89 
 
 2 CONTEXT COMPUTATION 
 
 Shown are the follow and precede sets which are needed; there are no 
 class 1 sets (see 2.5), so first sets are not given. 
 
 FOLLOW 
 
 E { $,+,) } 
 
 T { $,»,+,) } 
 
 F same as above 
 
 P { ) } 
 
 J j { $ ) 
 
 PRECEDE 
 
 E { X 3 ,J 2 } 
 
 T ( l v X 3 , J 2 } 
 
 F { X^ X 2 , X 3 , J 2 } 
 
 P same as F 
 
 X. same as E 
 X 2 same as T 
 Xo same as F 
 
90 
 
 NORMAL FORM WITH ATTACHED CONTEXTS 
 
 S -> Jj$ 
 J j -> J 2 E 
 E -> XT 
 E -> T 
 
 T -> X F 
 T -> F 
 
 F -> P ) 
 F -> i 
 
 P -> X 3 E 
 X L -> E + 
 X 2 -> T * 
 x 3 -> ( 
 Jj -> $ 
 
 {J 2 } X {$} ) 
 {x x J X {$,+,)} ) 
 
 {J 2 } x {$}, {x 3 } x {)}, {x 3 ,j 2 } X {+} ) 
 {x 2 i X {$,*,+} ) 
 
 {X J x {$,+,)}, (J 2 ) x {$}, (x 3 ) X {)}, 
 X ,J } x {+}, tx ,X ,J } x {*} ) 
 tx 1 ,x 2 ,x 3 ,J 2 J X {)} ) 
 
 {X lf X 2 ,X 3 ,J 2 } X {i} ) 
 
 (x 3 i X {)} ) 
 U 3 ,J 2 } x {+} ) 
 
 {XPX3.J2) X {*} ) 
 
 tx 1 ,x 2 ,x 3> J 2 } X {(} ) 
 
 { } X {$} ) 
 
91 
 3 CONFLICT GRAPHS 
 
 Conflict graphs are given for nontrivial cases; i.e. states E, T, 
 and F. The vertices will be labeled by a pair consisting of the number 
 of the production rule and the order of the context set in the 
 collection of context sets for that production. For example, the entry 
 w*(F,{J },{$}) = {(T, rename)} has a corresponding vertex numbered [6,2]. 
 
 GRAPHS FOR STATE E 
 
 G L = 
 
 [2,1] 
 
 [10,1] 
 
 [9,1] 
 
[2,1] 
 
 [10,1] 
 
 [9,1] 
 
 GRAPHS FOR STATE T 
 
 92 
 
 [4,1] [4,2] [4,3] 
 
 [3,1] [11,1] 
 
93 
 
 G R = 
 
 [4,1] 
 
 [4,2] 
 
 [4,3] 
 
 [3,1] 
 
 [11,1] 
 
 GRAPHS FOR STATE F 
 
 G L = 
 
 [5,1]- 
 
 • [6,1] 
 • [6,2] 
 
 • [6,3] 
 ' [6,4] 
 
 • [6,5] 
 
94 
 
 [5,1] 
 
95 
 4 OUTPUT 
 
 The sequence shown is a symbolic representation of the parse tables 
 produced for the simple arithmetic expression grammar. The minimum 
 number of entries is shown. 
 
 SO: INPUT, $, RED, ADD, J 
 
 INPUT, i, RED, GOTO, F 
 
 INPUT, (, RED, ADD, X 
 P : INPUT,), RED, GOTO, F 
 
 ERROR, 1 
 F : STACK, X , RED, GOTO, T 
 T : INPUT,*, RED, ADD, X 2 
 
 STACK, X, RED, GOTO, E 
 E : INPUT, +, RED, ADD, X 
 
 STACK, J , RED, GOTO, J 
 
 ERROR, 4 
 J : INPUT, $, RED, GOTO, ACCEPT 
 
 ERROR, 5 
 
96 
 
 APPENDIX B 
 
 PARSE TABLES FOR XPL 
 
 SO: INPUT, ID, RED, GOTO, S13 
 INPUT, CALL, RED, ADD, N15 
 INPUT, IF, RED, ADD, N14 
 INPUT , CHARACTER , RED , GOTO , S34 
 INPUT,C0MPL0P,RED,ADD,N16 
 INPUT , NUMBER , RED , GOTO , S5 
 INPUT , INITIAIL , RED , GOTO , S47 
 INPUT, GOTO, RED, ADD, N1 
 INPUT , STRING , RED , GOTO , S5 
 
 INPUT, DO, RED, GOTO, S9 
 INPUT, BIT, RED, GOTO, S48 
 
 INPUT , LPAREN , RED , GOTO , S49 
 INPUT, FIXED, RED, GOTO, S34 
 INPUT, ADDOP, RED, ADD,N17 
 INPUT, WHILE, RED, ADD, N24 
 INPUT, DECLARE, RED, ADD.N1 8 
 
 INPUT, LEFT, RED, ADD, N19 
 INPUT, TO, RED, ADD, N20 
 
 INPUT, SEMI, RED, GOTO, S4 
 
 INPUT , LABEL , RED , GOTO , S3 1 * 
 
97 
 
 INPUT , RETURN , RED , GOTO , S26 
 INPUT, CASE, RED, ADD, N21 
 S5: STACK, N4, RED, GOTO, S52 
 S23: STACK, N38, RED, GOTO, S31 
 S31: INPUT, MULTOP, RED, ADD, N38 
 STACK, N 17, RED, GOTO, S2 
 STACK,N31,RED,GOTO,S2 
 S2: INPUT, ADDOP, RED, ADD, N31 
 STACK, N37, NOR, GOTO, SU5 
 S30: INPUT, CONCAT, RED, ADD, N37 
 STACK, N 11, RED, GOTO, S20 
 INPUT , RELOP , RED , ADD , N 1 1 
 S20: STACK, N16, RED, GOTO, S21 
 S21: STACK, N36, RED, GOTO, S19 
 S19: INPUT, ANDOP, RED, ADD, N36 
 
 STACK , N35 , NOR , GOTO , S46 
 S18: INPUT, OROP, RED, ADD, N35 
 S12: STACK, N5, RED, GOTO, S61 
 STACK , N29 , RED , GOTO , S59 
 STACK, N39, RED, GOTO, S53 
 STACK, N14, RED, GOTO, S50 
 STACK , N 1 2 , RED , GOTO , S37 
 STACK, N3MED,G0T0,S16 
 STACK, N24, RED, GOTO, S8 
 STACK, N21, RED, GOTO, S8 
 ERROR, 12 
 
98 
 
 S3: STACK, N6, RED, GOTO, S3 
 S4: STACK, N27, RED, GOTO, S4 
 
 INPUT , ELSE , RED , GOTO , S32 
 S29: STACK, N30, RED, GOTO, S28 
 
 STACK, N3, RED, GOTO, S1U 
 
 STACK, N33, RED, GOTO, S1 4 
 S28: INPUT, RIGHT, RED, GOTO, S25 
 
 INPUT, END, RED, GOTO, S1 1 
 
 ANY, GOTO, S30 
 S33: INPUT, INITIAL, NOR, ADD, N13 
 S6: STACK, N 18, RED, GOTO, S7 
 
 STACK, N32, RED, GOTO, S7 
 
 ERROR, 6 
 S7: INPUT, COMMA, RED, ADD, N32 
 
 INPUT , SEMI , RED , GOTO , S4 
 
 ERROR, 7 
 S8: STACK, N22, RED, GOTO, S60 
 
 ERROR, 8 
 
 S9: INPUT, SEMI, RED, ADD, N2 
 S22: STACK, N28, RED, GOTO, S62 
 
 ERROR, 22 
 S11: INPUT, ID, NOR, ADD, N25 
 S10: STACK, N8, RED, GOTO, S58 
 
 STACK, N2, RED, GOTO, S55 
 
 ERROR, 10 
 S13: STACK, N7, RED, GOTO, S36 
 
99 
 
 STACK, N25, RED, GOTO, S10 
 
 INPUT , LPAREN , RED , GOTO , S5 1 
 
 INPUT .LITERALLY , RED , GOTO , S40 
 
 INPUT , COLON , RED , GOTO , S 1 7 
 
 INPUT, INPUT01 ,NOR,ADD,N26 
 S35: STACK, N1, RED, GOTO, S39 
 
 STACK, N15, RED, GOTO, S38 
 
 STACK , STACKO 1 , NOR , GOTO , S27 
 
 ANY, GOTO, S23 
 S14: STACK, N27, RED, GOTO, S14 
 
 ANY, GOTO, S29 
 S15: STACK, N 13, RED, GOTO, S33 
 
 ERROR, 15 
 S5M: INPUT, BY, NOR, GOTO, S56 
 S16: STACK, N23, RED, GOTO, S8 
 
 ERROR, 16 
 S 1 7 : INPUT , PROCEDURE , RED , GOTO , S2 4 
 
 ANY,ADD,N27 
 S24: INPUT, SEMI, RED, ADD, N8 
 
 ANY,ADD,N28 
 S25: STACK, N19, RED, GOTO, S1 
 
 ERROR, 25 
 S26: INPUT, SEMI, RED, GOTO, S4 
 
 ANY,ADD,N29 
 S27: INPUT, EQUAL, RED, ADD, N34 
 
 INPUT , COMMA , RED , ADD , N6 
 
100 
 
 ERROR, 27 
 S32: STACK, N3, RED, ADD, N33 
 
 ERROR, 32 
 S34: STACK, N10, RED, GOTO, S33 
 
 STACK, N26, RED, GOTO, S33 
 
 ERROR, 34 
 S36: INPUT, RPAREN, RED, GOTO, S22 
 INPUT , COMMA , RED , ADD , N7 
 ERROR, 36 
 S37: INPUT, RPAREN, RED, GOTO, S35 
 INPUT , COMMA , RED , ADD , N 1 2 
 ERROR, 37 
 S38: INPUT, SEMI, RED, GOTO, S4 
 
 ERROR, 38 
 S39: INPUT, SEMI, RED, GOTO, S4 
 
 ERROR, 39 
 S40: INPUT, STRING, GOTO, S6 
 
 ERROR, 40 
 S41: INPUT, NUMBER, RED, GOTO, S42 
 
 ERROR, 41 
 S42: INPUT, RPAREN, RED, GOTO, S34 
 
 ERROR, 42 
 S43: INPUT, NUMBER, RED, GOTO, S63 
 
 ERROR, 43 
 S52: INPUT, RPAREN, RED, GOTO, S15 
 S4U: INPUT, COMMA, RED, ADD, N4 
 
101 
 
 ERROR, 44 
 S45: STACK, N37, RED, GOTO, S30 
 
 ERROR, 45 
 S46: STACK, N35, RED, GOTO, S18 
 
 ERROR, 47 
 S48: INPUT, LPAREN, RED, GOTO, S41 
 
 ERROR, 48 
 S49: STACK, N28, NOR, ADD, N7 
 
 ANY, ADD, N5 
 S50: INPUT, THEN, RED, ADD, N3 
 
 ERROR, 50 
 S51: STACK, N52, NOR, GOTO, S43 
 
 ANY,ADD,N12 
 S53: INPUT, SEMI, NOR, GOTO, S57 
 
 ANY,ADD,N23 
 S55: INPUT, SEMI, RED, GOTO, S4 
 
 ERROR, 55 
 S56: INPUT, BY, RED, ADD, N34 
 
 ERROR, 56 
 S57: INPUT, SEMI, RED, GOTO, S3 
 
 ERROR, 57 
 S58: INPUT, SEMI, RED, GOTO, S4 
 
 ERROR, 58 
 S59: INPUT, SEMI, RED, GOTO, S4 
 
 ERROR, 59 
 S60: INPUT, SEMI, RED, ADD, N2 
 
102 
 
 ERROR, 60 
 S61: INPUT, RPAREN, RED, GOTO, S23 
 
 ERROR, 61 
 S62: INPUT, SEMI, RED, ADD, N8 
 
 ERROR, 62 
 S63: INPUT, RPAREN, RED, ADD, N12 
 
 ERROR, 63 
 
 The sets used are: 
 
 STACK01 = { N2,N3,N6,N8,N19,N22,N27,N30,N33 } 
 
 STACK02 = { N18,N32 } 
 
 INPUT01 = { CHARACTER, BIT, FIXED, LABEL } 
 
■ IBLIOGRAPHIC DATA 
 HEET 
 
 1. Report No. 
 
 UIUCDCS-R-77-904 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 November, 1977 
 
 Title .ind Subt it Ie 
 
 A Method for Constructing Compressed Parsers For A Parser 
 Generating System 
 
 6. 
 
 Author(s) 
 
 Richard Marion Schell, Jr 
 
 8. Performing Organization Rept. 
 No. 
 
 Performing Organization Name and Address 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 2. Sponsoring Organization Name and Address 
 
 13. Type of Report & Period 
 Covered 
 
 14. 
 
 ). Supplementary Notes 
 
 >. Abstracts 
 
 A parser generating system is described and technique is presented for constructing 
 parse tables that are compact and that provide efficient parsing. The parse tables 
 are constructed for a special class of grammars, which can be produced from LR(k) 
 grammars using existing transformations. It is shown that producing optimal parse 
 tables for the construction method is NP-complete; heuristic algorithms are given 
 which are good in practice. Empirical data is presented to support this claim: 
 tables are compared with those given for other parser generating systems and are 
 shown to be smaller. 
 
 . Key Words and Document Analysis. 17o. Descriptors 
 
 parsing, parser generating systems, table-driven parsers, bounded right context, 
 context-free grammars, normal -form grammars, pushdown automata, grammatical transfor- 
 mations, NP-complete problems. 
 
 li. Identifiers/Open-Ended Terms 
 
 1;. COSAT1 Field/Group 
 
 li Availability Statement 
 
 F |!M NTIS-35 (10-70) 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 22. Price 
 
 USCOMM-DC 40329-P71 
 
FEB 8 1978 
 
am ion