"L I B R.AR.Y OF THL UNIVERSITY Of ILLINOIS 510. 84 ha 271-278 cop. 2 -I •' • i he person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN m 22 a* JAN 2 8 WI L161 — 0-10% Digitized by the Internet Archive in 2013 http://archive.org/details/ongenerationofpa276dere 5/C- H Report No. 276 /luuOh ON THE GENERATION OF PARSERS FOR BNF GRAMMARS: AN ALGORITHM by Franklin L. DeReraer ILLIAC IV Document No. 199 DEPARTMENT OF COMPUTER SCIENCE • UNIVERSITY OF ILLINOIS • URBANA, ILLINOIS ILLIAC IV Document No. 199 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS URBANA, ILLINOIS 6l801 Contract No. US AF 30 (602)klkk ON THE GENERATION OF PARSERS FOR BNF GRAMMARS: AN ALGORITHM by Franklin L. DeRemer Report No. 276 August 1, 1968 This work was supported in part "by the Department of Computer Science, University of Illinois, Urbana, Illinois, and in part by the Advanced Research Projects Agency as administered by the Rome Air Development Center, under Contract No. US AF 30(602)klkk. A CKNOWLEDGEMENT The author would like to thank Alan J. Beals for his help in evaluating and debugging the algorithm. Thanks are also due Dr. R. S. Northcote for suggested improvements in the paper itself. 11 - ABSTRACT This paper describes an algorithm which, for suitable grammars, maps the Backus Naur Form (BNF) definition of the grammar of a language into a parser for the sentences in that language. By design the algorithm generates a suitable parser for any bounded right context grammar . It happens that it also covers some LR(k) grammars which are not bounded right context. A modified version of Floyd's descriptive language for symbol manipulation is used to describe the parser. Several examples illustrate the application and generality of the algorithm. - 111 - Introduction The algorithm described herein is in essence an extension, albeit a simplification, of the work of Earley^ which in turn was based on Evans/ ' Feldman,^' Floyd/ ''' and Standi sir ' '. For a large subset of grammars, the algorithm maps the Backus Naur Form (BNF) definition of the grammar of a language into a deterministic, left- to- right parser for the sentences in that language. It is shown below that the algorithm, by design, covers all bounded right context grammars and, as a by-product, some LR(k) grammars which are not bounded right (g\ — context (see Knuth for the definitions of these classes of grammars). More precisely, the algorithm maps a set of BNF productions into a program : a "reductions analysis" program* consisting of modified Floyd productions (really reductions ) referred to below as FPL (Floyd Production Language) statements.** The program consists of labeled, mutually exclusive groups of statements called sections. Each section has a specific task to perform. It is activated, by transfer of control to its first statement via the label, only at appropriate times. Upon each activation it either scans a new terminal symbol or makes a reduction (combined with an "unscan" in the case of a production with an empty right part) and then transfers control to the appropriate next section, or it transfers control to an ERROR routine if control falls out the bottom of the section. The algorithm is based on Earley's intuitive notion that the top symbols on the stack matched against the right parts of certain productions should determine parsing decisions. It is an extension of his algorithm in that it provides for both finite look-ahead and finite look-back and in that it covers productions with empty right parts. It is a simplification of his alogrithm in that it allows reductions only at the top of the stack, therefore reducing the number of mapping rules. *It is assumed that the reader is familiar with reductions analysis programs and the associated stack, input string, and manipula- tions thereupon. **This nomenclature is adapted to clarify the distinction between the BNF productions , which together define the grammar of a language, and the FPL statements which, when combined to form a program, describe a parser, - 1 - A word notation is in order before proceeding. In this paper, non- terminal symbols are represented by Latin capitals, terminals by lower case Latin letters, arbitrary strings by the Greek letter 2. This section consists of exactly one statement which compares the first p symbols of production « with the top p symbols of the stack. It is activated only when the match must occur for a well-formed string. Its purpose is to verify the top symbol and to take appropriate action to continue the parse. - 2 - (3) A section labeled Nt is required for each non- terminal N which appears in the right part of some production. The section is activated immediately after a reduction to N occurs at the top of the stack. The statements in this section indicate comparisons to the stack to determine which of the production(s) in whose right part N appears is applicable to the case at hand. A match determines the appropriate sub- sequent action. (b) Descripton Sets . In order to generate the appropriate set of state- ments for a given section, a descriptor set of pairs D = (k, p ),...) is assocated with each section label. This descriptor set is deter- mined by investigating the productions and serves to indicate to which part of which production(s) the mapping rules described below are to be applied. The pair (rt,p) points to the first p symbols of production n as the stack comparison symbols of the corresponding statement. The descriptor sets are determined as follows: (1) D„ : Initially D is empty and the following recursive procedure is applied. The right part of each production rt that defines N is examined. If it is empty, then (rt,0) is added to D„ : if it begins with a terminal, then (jt.l) is added to D,- : Nh Nh otherwise it begins with a non-terminal and the procedure is applied to that non- terminal. (2) D. / \ contains exactly one pair (it,p). t(,jt,pj (3) D,,, : The right part of each production n is examined. If J.M \j the non- terminal N appears as the p-th symbol, then (n,p) is in D Nf (c) The BNF to FPL Mapping Rules . Presented in Table I are four rules for mapping BNF productions into FPL statements. Together with the descriptor sets they represent a naive first try at generating a parser for the grammar. Implicity, the rules assume there is no question about which production applies to the case at hand but only what action is to be taken by the parser next, given that a certain production is applicable. - 3 - Table I The ENF to FPL mapping rules. (cc represents the first p symbols of production n , a is a symbol which matches any other symbol and q = p + 1 . ) ENF production (ft,p) (1) M ::= ON ... := Ob ... (2) M (3) M (h) M = a = e maps into FPL statement a| * Nh a| * t(*,q) a| -> m| Mt a\ -► Ml er Mt - k - It is the purpose of the last tvo rules of the algorithm to extend it to cover a reasonable set of grammars by resolving confusion about which production(s) may apply to different cases within a given section. The rules of Table I are explained intuitively as follows. If the first p symbols of the right part of production n are at the top of the stack and (1) if the (p+l)st symbol is a non-terminal N, then the parser should scan(*) the next terminal and activate section Nh to begin to reduce a substring to N. (2) if the (pfl)st symbol is a terminal b, then the parser should scan the next terminal and activate section t(jt,q), where q = p +1, to verify that that terminal is indeed b and to decide how to continue the parse. (3) if the p-th symbol is last in the right part of the produc- tion, then the parser should make a reduction (->) to the symbol M defined by the production and activate section Mt to decide how to continue the parse. (k) if p = (and, therefore, the right part of the production is empty), then the parser should "unscan" the top symbol, push an M onto the top of the stack, and activate section Mt to decide how to continue the parse. (The symbol unscanned will always be a terminal since this statement will appear only in an Nh-type section, the activation of which is always immediately preceded by a scan (see rule (l)).) (d) Combinations . In general, a reductions analysis program generated according to the above rules will contain sections in which some of the statements are not disjoint. That is, the conflicting statements will indicate stack comparisons (l) which are identical, or (2) the shorter of which are identical to the top few symbols of the longer ones. Thus, several statements may be applicable to a single stack and input string configuration, and the parser is in some sense non-deterministic. To render the parser deterministic it must be modified so it can either delay - 5 - or determine the decisions concerning which of the several similar productions associated with the conflicting statements is applicable in various cases. Decision delays are effective "by pairwise statement combinations as follows. If a pair of statements in a given section are not disjoint and if each was generated according to either mapping rule (l) or (2), then replace them with a single statement: one whose stack comparison is the shorter of the two and which, upon a successful stack match, scans a new terminal and activates a new combination section which must be added to the program. The new section is that section whose description set is the union of the two descriptor sets of the sections which the original statements would have activated. Of course, the new section must be checked for disjointness, and the old sections, of which the new one is a combination, should be checked for usefulness, since the only reference in the entire program to one or both might have been deleted by removal of the two statements. (e) Expansion by Contextual Analysis . The only decisions which cannot be delayed are those concerning reductions. This limitation is due to the requirement that reductions be made only at the top of the stack. Thus, conflicts with statements generated according to mapping rules (3) and (k) cannot be cured by combination. In this case the statements' com- parison fields are expanded by contextual analysis to provide the parser with whatever finite look-ahead and look-back are necessary to make the decision at hand,* i.e., for each of the conflicting statements the grammar is inves- tigated and generation begun of the strings of symbols which, in the context of the production associated with the statement, may surround the original stack comparison substring a of the statement. Appropriate comparison of the composite strings associated with each of the original statements, indicates the minimum context which must be checked to make the statements disjoint. In the worst case each statement must be replaced - 6 - with several statements which differ from the original in that they indicate more symbols which must be matched in the stack and/ or the input string. Examples Since the parser proceeds from left to right, always making reductions at the top of the stack on the "basis of whatever finite look- ahead and look-back are necessary, the algorithm by definition covers all bounded right context grammars . Further, due to the fact the sections of the program themselves imply certain extra information about the stack configuration, in the same sense that a state of a finite state acceptor implies information about the string read, the algorithm also covers some LR(k) grammars which are not bounded right context. An example grammar in this class is S : := aA|bB, A : := cA|d, B ::= cB|d, the sentences of which are a c d and be d. It is not bounded right context since the clue as to whether to reduce d to A or B is an a or b arbitrarily far down the stack. The grammar is however, LR(O) and can be parsed by the algorithmically generated parser of Figure 1. Note that a transfer of control to an ERROR routine is implicit at the bottom of each section in case no match occurs. - 7 - START (Sh) Ah Ah Bh At Bt St b -K- Bh c * Ah a — » A At c -* Bh a — » B Bt cA — > A At aA — » S St cB — > B Bt bB — » S St SUCCESS EXIT Figure 1. Algorithmically generated parser for a grammar which is LR(O) but not bounded right context. As an example of a grammar requiring "both look-ahead and look- back consider the following. IT P 123 1 S : := cAB 2 S : := dAe 3 A : := aG k B : := xe 5 G : := Gx 6 G : := x Confusion arises in the Gt section about when to terminate the gathering of x's into the non-terminal G. Generation of the context related to production five produces three possible strings: (1) G| -* G|x -» G|xx (2) G| -» G|x -* aG|x -* daG|xe (3) G|-*G|x->aG|x^ caG|xB -* caG|xxe There are two possible strings for production three: (1) aG (2) aG daG|e caGlB caG xe Most of the confusion is between case (2) of production five and case (2) of production three. One possible solution is to construct the following Gt section Gt G|xx daG|xe aGl A t(5,2) t(5,2) At Note that advantage has been taken of the sequential nature of the program here. Since the first two statements will catch all config- urations to which production five is applicable, the statement associated with production three checks no extra context. That is, the restriction that the statements in a given section must be disjoint may be relaxed in special cases where advantage is taken of the order in which statements are executed, however the contextual analysis must still be performed to ascertain the validity of such an optimization. Finally, note that had production five been G : := xG the grammar would not have been bounded right context nor covered by this algorithm, although it would still be LR(2). - 9 - As a final, larger, and more practical example consider the grammar of Table II, which is Earley's example of a 'simple algebraic language. The corresponding list of necessary sections and their descriptor sets are presented in Table III, and the parser is given in Figure 2. This grammar requires no special look-back end look- ahead of more than one symbol in only one case, section Dt. A single pair of statements were combined in section Ht causing the combination of sections t(l2,2) and t(4,2) to form a section labeled t(l,2; 4,2). Note that such combinations are probably most efficiently effected by operations on the descriptor sets before the sections are generated. Also note that maximum advantage was taken of ordering the statements. However, for expositional purposes several optimizations were not made: (l) since the first p-1 symbols are matched immediately prior to its activation, a t(n,p) section need match only the p-th symbol with the top symbol of the stack, (2) since a reduction to N occurs immediately prior to the activation of an Nt section, it need not match the top symbol, and (3) several sections could have been "concatenated", as for example sections Dt, t(6,2), and t(6,3) which would form Dt D|;r *** Th bD| H| Ht Finally, since sections Ph, Fh, and Th are identical, and are a subset of section Eh, all these could have been combined to save space; however this is probably undesireable as it implies a loss of information useful for error recovery. 10 - PRODUClTON table p 1 1 2 3 h 5 6 7 8 9 ^EXPRESSION> 10 11 12 13 lk 15 16 17 18 z B H H H D D T i T i S E E E T T F F P P H b b H r D i T i i T _+ E F T P F i ( B e D y T, T f l E E NOTE : i is identifier r is real "b is "begin e is end Table II. Production table for a simple algebraic language, - 11 - NECESSARY SECTIONS DESCRIPTOR SETS START (Eh) B h D h V T h E h S h F h P h t t t t t t t t t t t t H t D t V T t E t F t P t B t S t Z t (1,2) (h,2) (6,2) (8,2) (9,2) (12,2) (1^,2) (16,2) (0,3) (6,3) (8,3) (18,3) } 0,1 2,1 5,1 7,1 17,1 11,1 9A 17,1 17,1 3,1 18,1 17,1 18,1 18,1 combined to form t (1,2; 4,2) 18,1 1,1 4,1 (co 6,1 3,2 8,1 5,2 6,4 10,1 iM 11,2 12,1 18,2 9,3 13,1 16,1 1^,3 15,1 16,3 0,2 U,3 (combination) 12,3 Table III. List of necessary sections and the corresponding descriptor sets for the grammar of Table II. - 12 - START (Eh) B h D h V T h E h S h F h P h t [l,2j h,2) t [6,2) t [8,2) t (9,2) t [12,2) t (1^2) t [16,2) t [0,3) t [6,3) t [8,3) t [18,3) H • t D " t V t b b r i i ( + i ( i i ( i ( He H; V i+- E + Ft |-B-| D;r T r i (E) H D bD * * # * * * * * * * * * * * * H \0" p| B n pi H| D| Dl B h D h H t V P t E h T h P t E h t (9,2) P t E h P t E h B t S h t (6,3) t (8,3) E h T h F h P h It V P t t (1,2; 4,2) t (6,2) H t t (8,2) D t D t Figure 2. Algorithmically generated parser for a single algebraic language. - 13 - T t E t F t P t B t S t Z t t|* *■ t (1^,2) e+t| — » E| • E t +t| — * El E t t| — * "1 E t e| + * t (12,2) (E| * t (18,3) i«£| — > B| S t ?\t * t (16,2) rpfF| -» T| T t F| -> T| T t FfP| — > F| F t P| — > F| F t |-B| * t (0,3) H;S| — > Hi H t SUCCESS EXIT Figure 2. (continued) - Ik - REFERENCES 1. Earley J. Generating a Recognizer for a BNF Grammar, Carnegie - Mellon Institute of Technology, June 1965* unpublished. 2. Evans, A. An ALGOL 60 Compiler, National ACM Conference, Denver 1963. 3. Feldman, J. A Formal Semantics for Computer-Oriented Languages, Doctoral Thesis, Carnegie -Mellon Institute of Technology, l$6h. k. Floyd, R. A Descriptive Language for Symbol Manipulation, J. ACM 8, k (1961), 579-584 5. Floyd, R. Bounded Context Syntactic Analysis, Comm . ACM 7, 2 (196k) , 62-67. 6. Knuth, D. On the Translation of Languages from Left to Right, Information and Control 8, (I965), 607-639. 7. Standish, T. Generating Productions from a Restricted Class of BNF Grammars, Carnegie-Mellon Institute of Technology Computation Center, unpublished. - 15 - UNCLASSIFIED Security Classification DOCUMENT C JNTROL DATA -R&D (Security claaaltlcatlon ol till; body of abatrmct and IndamtnM annotation mutt ba antarad whan tha orarall raport la clamalllad. originating ACTIVITY (Corporal* author) Department of Computer Science University of Illinois Urbana, Illinois 6l801 S REPORT TITLE i». REPORT SECURITY C L A SSI FIC A TION UNCLASSIFIED 2b. GROUP ON THE GENERATION OF PARSERS FOR BNF GRAMMARS : AN ALGORITHM 4. descriptive NOTE! (Typa ol raport and htclualra dalam) Research Report B »UTMORUI(fl«lniM, middle Initial, laal nama) Franklin L. DeRemer • REPORT DATE 7a. TOTAL NO. OF PACE* 19 7b. NO. OF NEFt 7 •*. CONTRACT OR GRANT NO. 46-26-15-305 b. PROJEC T NO. US AF 30(602)klkk •a. ORIGINATOR'S REPORT NUMBER(S) ILLIAC IV DOCUMENT NO. 199 •b. OTHER REPORT Noli) (Any othar numbers that may ba aaaljnad thlm rapart) DCS Report No. 276 10. DISTRIBUTION STATEMENT Qualified requesters may obtain copies of this report from DCS. 11. SUPPLEMENTARY NOTES NONE 12. SPONSORING MILITARY ACTIVITY Rome Air Development Center Griffiss Air Force Base Rome, New York 13440 13. ABSTRACT This paper describes an algorithm which, for suitable grammars, maps the Backus Naur Form (BNF) definition of the grammar of a language into a parser for the sentences in that language. By design the algorithm generates a suitable parser for any bounded right context grammar . It happens that it also covers some LR(k) grammars which are not bounded right context. A modified version of Floyd's descriptive language for symbol manipulation is used to describe the parser. Several examples illustrate the application and generality of the algorithm. DD ,?<,?.. 14 73 UNCLASSIFIED Security Classification UNCLASSIFIED Security Classification key wo ROS ROLE W-T ROUE *T RO!_ E WT Parser Syntax analysis Compiler Compiler-Compiler Security Classification