LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 VLQ>r no. 233-300 Cop.£ CENTRAL CIRCULATION AND BOOKSTACKS The person borrowing this material is re- sponsible for its renewal or return before the Latest Date stamped below. You may be charged a minimum fee of $75.00 for each non-returned or lost item. Theft, mutilation, or defacement of library materials can be causes for student disciplinary action. All materials owned by the University of Illinois Library are the property of the State of Illinois and are protected by Article 16B of Illinois Criminal law and Procedure. TO RENEW, CALL (217) 333-8400. University of Illinois Library at Urbana-Champaign 'JUM2 81999 ff s o i ah When renewing by phone, write new due date below previous due date. L162 Digitized by the Internet Archive in 2013 http://archive.org/details/bnflikelanguagef300trou 'Mm ^y^udL^L ?i Report No. 300 A BNF LIKE LANGUAGE FOR THE DESCRIPTION OF SYNTAX DIRECTED COMPILERS by Harold Robert George Trout January 13, 1969 M£ mmi OF irk rcb 27 1969 UMItHSIlK lifr iLLi,.d!S Report No. 300 A BNF LIKE LANGUAGE FOR THE DESCRIPTION OF SYNTAX DIRECTED COMPILERS* by Harold Robert George Trout January 13, 19^9 Department of Computer Science University of Illinois Urbana, Illinois 6l801 * This work was supported in part by the Advanced Research Projects Agency as administered by the Rome Air Development Center under Contract No. US AF 30(602)l+lM+ and submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, February, 19^9 • Ill ACKNOWLEDGMENT The author would like to express his deep appreciation of the efforts of Dr. R. S. Northcote without whose encouragement and advice the system could never have come to fruition. Mrs. Sharon Hardman must also be thanked for sympathetic typing of a difficult manuscript. Finally, the author would like to express his gratitude to the Department of Computer Science for the facilities and staff coopera- tion so willingly provided. IV ABSTRACT The paper describes an extended version of Backus Naur Form which can be translated in one pass to a parsing algorithm. The restric- tions which must be placed on the BNF to achieve this end are minimal, and it is proved that they do not alter the generative capacity of the metalanguage. The recursive descent parsing algorithm produced operates at about 1000 cards per minute for typical languages on the B-5500. TABLE OF CONTENTS ACKNOWLEDGMENT 1. INTRODUCTION Page iii ABSTRACT iv LIST OF FIGURES , vii 2. THE SYNTAX LANGUAGE 4 2.1 Automatic Computation Based on BNF 4 2.2 The Language TBNF 5 2.3 The Recognition Algorithm 8 2.4 Distinctive Characteristics of This Algorithm 10 2.5 Semantic Linkage 11 3. ON EXECUTABLE CODE l4 3.1 Discussion . l4 3.2 The Equivalent Algol Code l6 3.3 Optimization 22 4. THEORETICAL RESULTS 27 4.1 Definitions 27 4 2 Discussion 27 4.3 Theorem 1: A Lower Bound For the Power of a RD Machine . 28 4.3.1 Discussion of the Theorem . . . . 28 4.3.2 The LBA 29 4.3.3 Proof of the Theorem 30 4.3.4 The Converse of Theorem 1 34 4.4 Theorem 2: Ambiguity 34 4.5 Some Results on Timing 3^ VI Page 5- CONCLUSION 39 APPENDIX A. RECURSIVE DESCENT, TOP TO BOTTOM ANALYSIS 4o B. THE SYNTAX SPECIFICATION OF TBNF 42 C THE TBNF GRAMMAR FOR THE LANGUAGE DEMALGOL, AND THE CORRESPONDING ALGOL CODE PARSER h$ D. B-5500 IMPLEMENTATION 68 LIST OF REFERENCES 83 VI 1 LIST OF FIGURES Figure Page 1. An interpretive system Ik 2. A noninterpretive system Ik 3. Direct translation 15 1. INTRODUCTION Waxing academic interest in the subject of syntax analysis per se attests to the thoroughness with which this particular vein of ling- uistic lore has been worked. Notwithstanding the light shed and the limitations exposed by such analysis, writing programs for syntax analy- sis has remained an ad hoc affair, especially in the computer industry, where the effects of such advances should logically have been felt. In short, syntax analysis has come to and gone from the academic scene with- out much apparent impact on the industrial tempo. Those in the universities who have been involved in compiler writing have felt the need for neat and fast ways to write down syntax recognition algorithms and, in many cases, could have had their burden lightened by having some of the manufacturers software packages available. Those in industry likewise have not utilized much of the work done in the universities. The language described here is specifically an algorithmic language, but is clothed in terms which make it appear to be a syntax description language. As an algorithmic language, it could be added to the repetoire of multi-purpose languages, to be used like any other pro- gramming device when convenient. The language described here (TBNF-T for translatable) is based on Backus Naur Form (BNF) but employs some of the devices used by Brooker and Morris [l] and by Kleene [2] in their respective systems. A minor variant of what follows is operational on a B-5500 and has been used in the syntax phase of two ILLIAC IV languages and several support languages at the University of Illinois, and at the Burrough's Corporation. In these examples the syntax of a well defined language of greater complexity than Algol 60 have been written in approximately one man week each and yield parsers of moderate speed (900 to 1800 cards per minute on the B-5500). As testimony to the flexibility of the system, one language (Tranquil [3] for ILLIAC IV) was converted from an earlier Floyd production scheme without a single change to the semantics routines, The method of syntax analysis is a straightforward top to bottom recursive descent method based on this considerably expanded ver- sion of BNF. The machinery of TBNF includes mnemonic symbols for denoting a sequence of strings without resorting to the usual recursive method, (i.e., a list of statements can be written list instead of introducing the nonterminal and using a recursive production.) Several other devices are employed to reduce the number of nonterminals necessary to describe a language and avoid the semi-infinite verbosity associated with conventional BNF. Thus, for example, a three hundred production grammar in BNF became a forty production grammar in the extended language. One of the first benefits of such compactif ica- tion was that programmers developed a very good feel for the parsing method and quickly learned to program efficiently in TBNF. This paper demonstrates how TBNF can be translated, in one pass, to a parsing algorithm. The method does not detect ambiguity or any of the other traditional properties (which turn out to be of remark- ably little relevance in this context), and not surprisingly translates the TBNF quite rapidly. (i.e., on the B-5500 the elapse time between submitting a grammar and producing a compiler is of the order of two minutes.) A short turn around time is of considerable advantage when a new language is being developed. In summary, it is fair to say that the language to be described has reduced the syntax analysis of compiling (as far as it can be divorced from the semantics) to a trivial matter, without imposing restrictions on the semantics. 2. THE SYNTAX LANGUAGE 2.1 Automatic Computation Based on BNF One of the surprising things about BNF is that, despite its apparent simplicity and elegance, it is a very difficult language to handle automatically. The question, of ambiguity is typical of the diffi- culties one encounters. For example, it is impossible to write an effec- tive procedure to detect the ambiguities in an arbitrary context free grammar (the proof is given in Section ^-.4). This may not seem a serious handicap, for it seems possible to retrieve some usefulness from BNF by dropping the requirement of unambiguity. Even so, one is faced with unusual difficulties in imple- menting BNF in its full generality. It is manifestly obvious that there is a linear lower bound T(n) to the parsing time for a string of length n. Yet, no general scheme, to the author's knowledge, has succeeded in realizing linear parsing time. On the other hand, at least three workers (Earley [5], Kasami [6], and Younger [7]) have established n time bounds for context free languages. Earley further strengthens his result to 2 n in a large number of cases. Their proofs are constructive. One is 3 tempted to conclude that an n bound is the best one can expect in general. It is not surprising, then, that most practical schemes place some restrictions on the BNF accepted (either LR(k) or LR(m,n) for some finite, usually small (e.g., 1 or 2) values of m, n). The method des- cribed in this paper restricts the BNF also. To distinguish this form, we will refer to it as TBNF (Translatable BNF). There is one other characteristic of BKF which makes it inconvenient to use in a practical translator writing system (TWS). This characteristic is its excessive verbosity. It is quite unnecessary to define both the nonterminals and when it is clear that the former is obtained by compounding several of the latter. In the author's experience, students revert to a more compact notation when endeavoring to decipher complex BKF productions. TBKF attempts to meet this objection by providing abbreviations for the common constructs. 2.2 The Language TBKF The syntax of TBKF is similar to BKF except for the following use of special characters and words. (i) Kleene Star. * s \ ;~ | | | that is, any number n of 's concatenated together (n > 0), with X representing the null symbol. (ii) Brooker and Morris' question mark. ? = <&> | \ to mean the optional presence of some symbol. (iii) Bracket construct. Square brackets [ , ] used to delimit groups of symbols; e.g., : : = [ | | ] is equivalent to : : = <7> ::= | | <0 Naturally the brackets can be nested to any depth, as in the following compact expression for a boolean expression: ::= list [ list separator [^ | and ]] separator [ v | or ] (iv) list = * (v) list separator = [ ] * (vi) Any symbol whatsoever. (vii) but This is normally used in conjunction with to express things like, ::= comment [ but ;]* (viii) not , ahead , back These are special purpose devices used occasionally for error recovery and for certain optimization tricks. Their meaning should be apparent (for a description see Section 2.3 on implementation). The language has a few rules of formation (i.e., syntax) which is more than minimal in the sense that certain artificial restrictions are placed on the collections of symbols which will be accepted, in order to protect the programmer from minor lexical errors. (ix) < , >, [ , ], ; f /,*>©># , list , separator , not , open , close , ahead , but , may not be used as terminal symbols without being preceded by #; e.g., # < , # > ('<", »>" also allowed). (x) Each production must be terminated by ";". (xi) separator may not be used without list . (xii) Only letters, digits, and spaces may appear between " < " and " > ". (xiii) The null symbol \ must always be represented by < >. One of the properties of BNF which TBNF conceals is the difference between left and right recursiveness; i.e., : : = + | being a left recursive production implies that a term once found is to be absorbed into the . The right recursive production ::= ' + | on the other hand, implies that the whole is to be strung out and finally reduced, term by term, from the right hand end of the string. 8 (xiv) Explicit instructions must be written in TBNF to distinguish the two possible cases of recursion: list open similar to right recursion; implies that the entire string of 's be assembled before reduction list close similar to left recursion; permits a reduction to be made after every is found. Kleene star (*) follows exactly the same convention. The default conditions are: (a) action call immediately following implies open (b) any other construct implies closed 2.3 The Recognition Algorithm A top to bottom recursive descent method has been chosen for the syntax analysis, for the following reasons: (i) The algorithm presented here can recognize a large class of languages. It will be established in Chapter h that the method can accept at least all context free lang- uages (i.e., languages which have a context free grammar). (ii) The interpretation taken by the algorithm is immediately obvious from the syntax. There are many properties of grammars which are well formulated theoretically but 9 which are exceedingly difficult to see in typical grammars, Being LR(k) or ambiguous, for example, are properties which are not in any sense of the word obvious in a grammar. On the other hand, experience has verified that programmers quickly develop a feeling for the mean- ing of what they write in TBNF and learn to program in it efficiently. It should be added that the preprocessing algorithm makes no attempt to detect ambiguity or any other elegant theoretical property of a source grammar. This can be heralded as a feature in view of the fact that several grammars submitted to the system are known to be ambiguous and have been consciously written that way to achieve a specific programming advantage. (iii) The preprocessing task for a top to bottom parsing is very simple to write. This property is of inestimable advantage when a programming language is still being formulated and a large number of syntaxs are tried on the way to the final result. (iv) Top to bottom analysis facilitates the partitioning of the syntax of a language. For example, the code for translating declarations can be written almost indepen- dently of the code for translating arithmetic expressions. The two parts can then be joined together to form the whole language without parasitic side effects. This 10 has "systems" significance which bottom-up compilers seem to lack; viz., bottom up preprocessors typically require the entire grammar to be processed each time. 2.U Distinctive Characteristics of This Algorithm Certain seemingly trifling details bear some careful exploration because of the rather remarkable effect they have on the power of the algorithm (as discussed in Sections k.2 - h.k). (i) Input terminal symbols are buffered and are accessed by an integer procedure FETCH, Once a nonterminal has been recognized, its constituents are removed from the buffer and are replaced by that nonterminal. Thus, if : : = begin end then the input configuration might be <1> begin end else immediately before reduction and will be <]> else after the reduction. This means that the algorithms can easily be trapped into following wrong paths. It is remarkable that this property of following wrong * For example: ::= | J : : = begin end ; : : = begin end comma ; will be trapped because the algorithm will reduce the begin end pair to without looking at the comma following. 11 paths actually can be put to advantage and increases the class of languages which can be described by the system. (ii) The empty symbol (\) is forced into the buffer; hence, for the production: : : = < > ; ::= b ; : : = c ; the system will make the following reductions on the input stream b e_: 1) b c 2) c 3) < > c k) < > 5) This represents another way in which the system can be trapped. The empty symbol can be avoided altogether by using ? or * (as defined above). However, certain simplifications of semantics routines often result by putting a K in a production. 2.5 Semantic Linkage Traditionally, semantic routines have been subroutine calls which are placed at the end of productions. The reason for making such a restriction is that it is only at the end of a production that one 12 really knows if the various components of that production have been found, and it is only at the end, therefore, that the semantics routine can reli- ably be called. However, since the programmer knows full well that the word go to cannot herald anything but a go to statement and that anything else is an error, it is reasonable to permit semantic calls at any point at all in the parsing scheme; for example, in : : = begin * list separator #; end ; the most convenient place to insert the semantic action which opens block storage is immediately after the begin . Accordingly, semantic calls are permitted anywhere in a TBNF production except in a list construct. Thus, the above becomes ::= begin @ S 3 * A semantic call is denoted by the marker @ which, in the implementation being described, must be followed by S or T and an integer; i.e., ::- @ [ S | T ] ; S for £emantics - a pure semantic routine which does not interact with the syntax analysis, T for test - a semantic routine which also per- forms a syntactic function (e.g., test for an appropriate declaration). In order for the syntax analysis 13 to proceed on the present branch, a boolean variable "SEMANTICTEST " must be set to true . For example, there corresponds a boolean pro- cedure ATEST(R) which returns the value true if an is present begin- ning at position R of the input buffer. Then the production ::= <0 is translated into: boolean procedure ATEST(R) ; value R ; integer R ; begin integer N ; N := R ; ATEST := false ; if BTEST(N) then N : = N + 1 else go to NXTALT ; if CTEST(N) then N := N + 1 else go to NXTALT; DELETE (R, N-R) J ATEST := true ; INPUTBUFFER[R] := AIDNO ; NXTALT: end ; There are two auxiliary procedures required by the process. They are integer procedure FETCH(n) which accesses the n-th element of an input buffer. It calls the scanner when required to read a new symbol from the input stream. IT The second procedure, DELETE(m,n), is concerned with moving data within the input buffer by performing reductions on the input. It's function is to delete in the input buffer beginning at position m and collapsing n elements. Hence, after DELETE(m,4) , the configuration A B C D E t m of the input buffer becomes A E t m Terminal symbols, of course, do not require a boolean procedure to represent them. For "+» we have if FETCH(N) = " +■ " then N : = N + 1 else go to NXTALT; Given these primitive operations, it is easy to generalize the process to obtain Algol code which is equivalent to the more sophisti- cated TBNF instructions indicated below. (i) <&> * open = L : if ATEST(N) then begin N : = N +■ 1 ; go to L end ; 18 (ii) * closed = L : if ATEST(N) then "begin delete (N-l,l) » go to L end ; (iii) 1 = if ATEST(N) then N := N + 1 J (iv) list open = second : = false ; L : if ATEST(N) then begin second := true ; N := N + 1 ; go to L end ; if not second then go to NXTALT ; (v) list separator open = second := false ; L : if ATEST(N) then begin second := true ; N := N + 1 ; 19 if BTEST(N) then begin N := N + 1 ; go to L end end else N := N - 1 ; if not second then go to NXTALT ; (vi) ahead t = if FETCH (N) f "t" then go to NXTALT ; (vii) not t = if FETCH(N) = »t» then go to NXTALT ; ( vi i i ) = N := N + 1 ; (ix) but t = if FETCH(N-l) = "t" then go to NXTALT ; t This final "N := N-l" is required to ensure that the separator of a list genuinely punctuates a list C A B A B A is acceptable t I CABABAB is truncated I I 20 (x) bracket construct. In principle this can be handled as though the pseudo production within brackets were written out as a full production. Thus, [ | | <0 D ] will be translated to if DUMMY(N) then N := N + 1 else go to NXTALT ; where DUMMY is the procedure which results from translating : : = | | <0 D ; This establishes that the bracket construct does have some equivalent Algol code. The efficient translation of the bracketed construct will be the subject of a separate section. (xi) The final- constructs which are to be implemented are the various forms of the semantic action. Classically, these are the numbered routines: @ S , which is a call on the semantic routine , becomes simply: SEMANTIC^ , R, N ) ; 21 @ T is the same as the above except that a global variable SEMANTICTEST must be set true ; i.e., in Algol, SEMANTIC ( , R, N) ; if not SEMANTICTEST then go to NXTALT ; These are very straightforward and do not differ from the usual table driven methods. However, translation to Algol code permits the use of another form of the semantic rou- tine which is very efficient for some applications. To implement the semantic routine "blockpointer := blockpointer + 1", for example, requires overhead to: (a) enter a procedure with at least one parameter passed; (b) branch on the semantic routine number (something like an Algol switch or Fortran computed go to); (c) branch to the end of the block; (d) procedure exit. Even with a machine like the Burroughs B-5500 the overhead is about three times as long as the kernel code, but it may be completely avoided by using 's which copy the Algol code directly into the code being produced. 22 The syntax for these routines is: ::= 11 @ " [S | T | Q ] "[" •*]" J (note that @, [, ] must be enclosed in quotation marks, or be preceded, by a #, because they are also elements of the metalanguage.) Thus, begin @ Q [blockpointer := blockpointer +■ 1; ....J becomes if FETCH(N) = beginword then N := N + 1 else go to NXTA1T ; blockpointer := blockpointer + 1 ; . . . 3.3 Optimization Like most optimizations the ones described for this language are ad hoc, comprising the detection and clever coding of cases which appear frequently. The prime target for optimization is, of course, the bracketed construct. By using [. . . ] a programmer is admitting to the machine that the dummy production implied by the brackets is either very simple or occurs in a very few places in the syntax. Note, however, that to be logically consistent it is imperative to preserve the equivalence of [. . .] and dummy productions. Thus, 23 ::= list [ list

separator [ X | / ] @S3] separator [+ - @Sk] ; is functionally equivalent to = list separator ; = + | - @S^+ ; = list

separator @S3 = x I / ; Semantic routine 3 assumes the input buffer configuration defined by- its local context; i.e., the configuration defined within the square brackets of which it is a part. The optimization game is, of course, to implement more efficient code while preserving this conceptual neatness. It is desirable to reduce the number of calls on the utility routines (FETCH, DELETE, etc.) as much as possible. All of the optimizations described below are directed at reducing the number of these calls. (i) The alternatives on the right hand side of a production will, in some cases, occupy only one place of the input buffer during execution; e.g.: : := * list | [ ] | < > I <0 @Q[;] ; 2k Each alternative on the right hand side occupies one place in the input buffer, so the DELETE procedure need never be called in the recognition of this production. (ii) In some uses of the closed Kleene star, two calls on DELETE are implied; e.g., in : : = [ ] * ; DELETE is called once to contract to [ ], and again to close up the input buffer as required by the Kleene star construct. The code generated need only perform one deletion, and it does not seem difficult to automatically emit the following code to recognize : L : if XTEST(N) then N := N + 1 else go to NXTALT ; if YTEST(N) then N := N + 1 else go to NXTALT ; DELETE (N-3, 3) ; go to L ; NXTALT: .... (iii) In similar vein, it is possible to avoid one call on the procedure delete when the production in brackets occurs at the end of an alternative; e.g., ::= if then [ else J ? ; 25 (iv) Only one call on the procedure FETCH need be made when each alternative of a production begins with a distinctive terminal symbol; e.g., : : = if .... go to .... for .... begin .... j Then it is possible to emit Algol code like temp := FETCH(n) ; if temp = ifword then .... else go to NXTALT ; NXTALT: if temp = gotoword then .... else go to NXTALT1 ; KKTALT1: if temp = forword then .... else go to NXTALT2; MXTALT2: if temp = beginword then .... ; 26 Notes: (a) It is worth pointing out that for a small list the above sequential search is the fastest code to find the approp- riate "branch" of the tree. It is clearly faster than a table lookup "because it involves no subscripted references, (b) All of the above optimizations can be implemented in a one pass compilation. Some of these are already present in the current B-5500 implementation. Other more global optimizations clearly require more than one pass. 27 k. THEORETICAL RESULTS k.l Definitions (i) Translatable Backus Naur Form (TBNF) is BNF extended in the way described in, and interpreted in the manner of, Chapter 2. (ii) A recursive descent (RD) machine is an automaton which executes TBNF in the manner of Chapter 3. We also require that the stack of the machine be a linear function of the length of the input string; i.e., only a finite number of nonterminals can be stacked up. k.2 Discussion The author claims that the restrictions imposed on BNF do not, in fact, restrict its capacity to describe languages. It would be expec- ted that TBNF, being equivalent to a subset of BNF, has a generative capacity somewhat less than BNF. However, because of the way in which the syntax is interpreted, the RD machine is actually as powerful as a LBA. Thus, the. class of languages describable in TBNF is not merely the class of context free languages, but the class of context sensitive languages. This rather remarkable result is a direct consequence of the algorithmic interpretation imposed on TBNF; the algorithm happens to be sufficiently general to be used as a LBA, albeit with a very devious instruction set. 28 k.3 Theorem 1: A Lower Bound For the Power of a RD Machine A RD machine is as powerful as a LBA. U.3.1 Discussion of the Theorem The method of proof is constructive. Given a set of LBA instructions, it is possible to build a grammar which initiates those instructions. Much use will be made of several specialized features of the RD machine (RD grammar). Particular note should be taken of the following peculiarities. (i) ::= <> effectively forces the nonterminal into the input stream. Thus, it is possible to write on the input string. Likewise, ::= ; is a production which will cause the machine to overwrite with . (ii) ahead , back , not , but do not cause any reductions to be made. They are, so to speak, read only productions; thus, : : = ahead ; ::= ; is a method of writing only when a is present in the input stream, ahead merely checks for a , then the forces a change in naming. 29 (iii) It has been pointed out before that the system will not unpick a reduction once it has been made. This means that an erroneous path taken in parsing can nevertheless make reductions and, hence, alter the input stream. Thus, : : = may well be a device for writing on the input, where is a special nonterminal : : = Tj ; with T] ^ V , the set of terminal symbols. The causes the RD machine to abandon any attempt to reduce the current alternative. 4.3.2 The LEA As a canonical example of a LBA, we take a machine M with a finite number of states Q. , ..., Q and a read/write head which can advance, reverse, read, or write on an input tape. The behavior of the machine can be described by a set of rules of three types: (i) R : (Q., I £ ) - (Qj, I k ) In state Q. read I„, go to Q., and write I. ; i £' j k' (ii) R : (Q., Ig) - (Q d , +■) In state Q. , read I-, go to Q., and advance the read head. 1 Ju 3 30 (iii) R : (Q., I £ ) -> (Q.., -) reverse the read head. The LEA is in state 0^ when switched on, and the read head is poised above the first input character. It finishes by going into a state 0_,. For each state of the machine Q. , introduce nonterminals , , , < and for each terminal symbol t. introduce a nonterminal representative . k . 3 . 3 Proof of the Theorem (i) First, the entire input stream must be entered into the input buffer and be punctuated with spaces which will be used later to record the state of the machine. : : = | ; : : = list [ | | .... | ] ^1 — •— 1 ' 2 ' ' m separator [ "^q-i^ ] open ; (ii) For each rule R : (Q. , t) -» (Q., s) of the LBA write X 1 J ' : : = ahead < q. > [ ahead [ ] ? J ] ; 31 The third line of the production restores if the input stream fails to have . Note that productions will be introduced later to allow any state to be reduced to . J (iii) For LBA instructions of the form R X : ( V t) "* ( q y +) write ::= ahead [ ahead The mechanism is similar to (ii). (iv) For LBA instructions of the form R x : (q ± , t) - (q y -) write ::= ahead x [ ahead I ] ; 32 (v) Each of the productions above is designed to execute one instruction of the machine. Now a production is defined which administers the above three types of productions and causes the machine to advance or reverse on the input stream. ::= | m ahead [ ] (i=l,...n) °> ahead 1 1 ahead | ahead ^ S(i=l,...,n) .... (E^) [[ ahead | ] ] | ahead ^^ ; ahead | ahead | ahead ; J . (El) . (E2) (i=l, ...,n) .... (E3) . (E5) Notes on these rules: El: These rules apply each of the instructions in the repetoire of the machine; E2: If the machine is to advance on the input string ( present), then mark the input with the present state and apply the rules again; 33 E3: Apply the rules again; E4: A reversal is produced by backing up the right recursive production for , where a back up is marked by and the productions scan forward to retrieve the previous state of the machine; E5: The final acceptance state, (vi) Lastly, the productions which allow state exchange: > ::= ahead "N ahead <( ^ > <( l^ > I \ x = +, -, ahead < (L* > J | < > ; for i=l, ...,n. also functions as a state, so ::= ahead ; Vi, * 1 i Finally, the terminal representative: l = t . ; i = l> • • • > m « 3^ The rules for transforming a LBA instruction set to an equivalent RD instruction set are thus established, which completes the proof. ^.3.*+ The Converse of Theorem 1 By storing the possible parses on the input tape, it is trivially- possible to imitate a RD machine on a LBA. k.k Theorem 2: Ambiguity There exists no effective procedure for deciding whether or not any given context free grammar G is ambiguous [8], Proof : Suppose that there exists an algorithm A which, when presented with any grammar G, can decide whether or not G is ambiguous. Then select arbitrary pairs of strings (f . , g. ) (i=l,...,n) of elements from some set V = {a, b, c, ..., z) and define G as : : = | ::= f x g x | f 2 <*> g 2 I f g | i n B n ' r 35 : : = a a b b | z z | / where g. is the reverse of the string g. and ^ is a center marker. Then ambiguity in this grammar corresponds to the existence of some strings of elements from V which are terminal derivatives of or . Therefore, the algorithm A must be able to decide whether there exists a string S: S = f. f . . . . f . i g. S\ /\ • • • B i & ± ■1 2 n n 2 1 — a.. a_ ••• aca ••• a a, 12 p r p 2 1 = h f. h That is, if there exists a string h: h = f. f. . . . f . = g. g. . . . g. 1.2 n 12 n which is exactly Post's correspondence problem, which, in turn, is equivalent to the halting problem [k]. Hence, the algorithm A does not exist. 36 U.5 Some Results on Timing It is difficult to obtain meaningful theoretical comparisons between various parsing methods. In practice, experts have discovered that no system has a clear cut advantage over all of its competitors [l], Each system, in effect, takes advantage of a certain corner of the very large data base consisting of language specifications and the program in hand. Top down methods tend to rely on the intrinsic properties of the language concerned. Bottom up methods rely more on the program being translated. There is no evidence that a system which combines the two approaches will necessarily run faster in practice than either of the two component parts. Part of the reason for this is that for a large subset of precisely defined languages (e.g., programming languages) the two algorithms follow almost identical paths, in some cases. Consider, for example, the following production: : : = if .... | for .... | begin list .... go to | list [ := ] .... ; Typical top to bottom, bottom to top techniques will do a sequential search on the key words (if, for , begin , etc.) to decide which branch of the syntax tree to take. The subsequent stacking operations are likely to be similar in both methods. For such a class of structures, it is largely irrelevant which method is being used. 37 There are cases, however, for which top down methods can be distinctly inferior. These are typically cases in which there exists a great deal of hierarchy or, in other terms, where there is a very definite precedence relation between a large number of operations. Consider, for example, a precedence structure with n operators P , ..., P with the precedence relations P, ::= [P < E ,> ] * : n n-1 n n-1 : : = N [P 1 N ] * \ Using the top down method, a stacking operation will be required for each change in level of precedence. Thus, N P N P^ ... will involve the system in three stacking operations as it winds up and down the syntax tree. Assuming the operators are randomly distributed, 2 we have the following result: in n of the n possible juxtapositions of operators, there will be stacking operations; and in 2(n-i) cases, there will be i stacking operations (i=l,...,n). 38 Forming the average number of stacking operations/operator we get: In 1 In t ~ 3 3n - 3 By contrast, a precedence method can get away with approximately -^ stack operation per operator 2 i i E p - 2 " 2 2n Parentheses can be incorporated into either system with equal efficiency. Note that precedence techniques assign precedence levels to "(" and ")". This is not done here and the top down methods handle such problems by recursion, a task involving 1 stacking operation. 39 5 . CONCLUSION The system described above has proved to be a satisfactory tool for syntax directing the first pass of a compiler. It could be trivially modified to be used in later passes if that were desired. It also has application in certain translation processes (e.g., trans- lating from a matrix language into PL/l) . However, the extensive use of [ but . . . ] * constructs suggests that a scan operation should be included as a legitimate element of the syntax. This is one of sev- eral additions which have been examined for TBNF which, although not changing the recognizing ability, make for faster and neater code (in the spirit by which TBNF arose from BNF). The decision to translate TBNF into Algol code is a significant departure from a table driven scheme. Exponents of table driven methods will undoubtedly see this as a retrograde step. However, the advantages to be gained by table driving are somewhat undermined by a comparatively clear algorithmic language. For example, there is no point in inter- preting a table when it is just as easy to recompile from the source code. The obvious next step is to translate directly from TBNF to machine code, a step which could improve speed by another factor of 3. APPENDIX A RECURSIVE DESCENT, TOP TO BOTTOM ANALYSIS Number the components of the right hand side of a produc- tion as follows. I 1 I 2 *3 ::= | ; t t 1 2 Then define a (recursive) procedure "presence" which maps the cartesian product of V AT and I, where V = set of all nonterminals in the grammar I = {natural numbers) , onto the set { true , false) ; i.e., presence: V„ X I -» { true , false) If there is a production such as the one above, then the following relation will exist between certain pairs in V^ X I; viz., from above, presence ( , n ) = presence ( , n ) ^ presence ( , n + 1 ) ^ presence ( , n + 2 ) \z presence ( , n ) ^ presence ( , n + 1 ) Ul (Note that " = " is used here in the usual assignment sense and not in the mathematical sense; i.e., the left hand side is found by computing the right hand side in a left to right fashion) For a terminal symbol t define presence (t,n) = (FETCH(n) = t) where FETCH(n) is the n-th symbol of the input string. The recursive descent (RD) method proceeds by repeatedly rewriting a given presence function with its equivalent right hand side. The recursive process stops when a presence function finally degenerates into a terminal test. In the method used in this paper, it should be observed that once a presence function has been completely evaluated (i.e., success- fully recognized) part of the input stream is rewritten with the non- terminal which has been recognized. This has rather important consequences as explained in Section 2.3. k2 APPENDIX B THE SYNTAX SPECIFICATION OF TBNF This section rigorously defines the language TBNF and contemporaneously defines the algorithm for accepting strings in this language. The semantic routines are also sketched. ::= list end j ::= »<« * but ">" ; but ";" but "| " but "[ •• but »<« but "]" but "#" but ""*■ but " > " but # list but # separator but # ahead but # back but # but but # open but # close but # not but "*" but ''©n but "?" "#" | "«n [ but """ ] * """ ; ::= | "[" "]" | "[" | ; ^3 # list ] ? [ @SUU # open @S4U # clo se @SU5 | @S^5 ] < > # list # ahead # ahead # "but # but # not [ " ? " " * " [ @SM+ # open .@S4i+ # close < > < > ] I '<" # an^r ">" @Sk6 | @Sh6 ] "H hk comment Note that the convention of open and closed Kleene stars is repeated here because the semantic actions are different; : : = " @ " [ S | T ] | " @ » [ S J T | Q ] "[" "]" " @ " ; ::= [ but "]" but "[" I "[" »']'»]* ; c omment Note that the recursive method of matching [ with ] is not the most efficient way of doing it; : : = ::= '*; " ; ::= list separator "|" close ; : : = list ; : : = [ but " | " but »; » ] * @S3 ; h5 APPENDIX C THE TBNF GRAMMAR FOR THE LANGUAGE DEMALGOL, AND THE CORRESPONDING ALGOL CODE PARSER As a further example of the syntax of a language written in TBNF, we take a very simple version of Algol 60 which is used "by the proponents of Translator Writing Systems to compare various systems under development at the University of Illinois. Notice the error recovery built into the syntax and the use of semantic tests to deter- mine the exact path to be taken under certain circumstances. It is possible to avoid the use of semantic tests if desired, however, this example is designed to adhere closely to operational conditions in which this information is available and provides a neat way of parsing the source string. The grammar presented below parses at 1200 cards per minute on the B-5500 (not including scanner time). k6 COMMENT THE FOLLOWING LANGUAGE TS USED AS A BENCH MARK RY TWS DESIGNERS AT ThF UNIVERSITY OF ILLINOIS. NO ATTEMPT HA* RFPN MADE TO REOUCE THr NUMRER OF NONTERMINALS OR PRODUCTIONS UNOERSTnOO NOMFMCLATURE, THE LANGUAGE REQUIRES THREE PRODUCTIONS FOR ITS DEFINITION I OFMALGO Ma i I I* BEGIN * LIST SEPARATOR fj END I tt« [ INTEGER •0(TyPEI«TNTFGFRTyPEI J/ BOOLEAN fQCTYPEl-BOOLEANTYPEl ]/ LABEL •0[TVPE»«LABELTYPF I]] [LIST [<*!» RS[ENTFR(Pl,TYpE)in SEPARATOR . / < Error > i j • »■ f I 1 * f t GO TO / GOTO 1 * = < > | | | (ii) Brooker and Morris's question mark to mean the optional presence of some symbol: ? = j < > (note that & is regarded as equivalent to ?) (iii) List : is merely a different notation for: * TO (iv) List separator = [ ]* There are certain conventions which apply to list and * operations: these are discussed in the section on implementation. (v) Brackets [ ]: delimit groups of symbols--(iv) above is an example. The production: is entirely equivalent to : : [ ] = = <3> Naturally brackets can be nested to any depth--behold the following compact production for a Boolean expression: : : = list [ list separator [A | and] ] separator [V | or ] (vi) = any symbol whatever (vii) but t: this is normally used in conjunction with to express things like : : = comment [ but# ;]* (viii) not t } ahead t These are one symbol lookahead instructions which check for the absence or presence of the terminal t. Any number of not' s may be used but only one ahead. Their use is largely in code 71 optimization (of object code produced) and to a more limited extent to fudge EXEC calls. Thus: : : = not #; @ S23 list - - - ; It is important that action 23 not "be called if an is not present so the not #; prevents it being called in this case. (ix) back This is a one symbol lookback which is provided but appears never to have been used. (x) Calls on semantic actions. These are of two types, both preceded by @ immediately followed by "S", "T", or "Q". Semantic actions are the subject of a separate section. Syntax Rules There are certain syntax rules of the BNF input which are provided mainly to avoid ambiguous productions, but also to protect the user to some extent (e.g., separator occuring by itself could be regarded as a terminal of the language defined-- it seems more reasonable to prohibit its use in this way) . (i) <, >, [ , ], ;, |, &, list , *, separator , not , open , close , but , ahead , any , @, # may not be used in the grammar without being preceded by #--hence #<, #>, #g, ##. (ii) each production must be terminated by ', 72 (iii) separator may not be used without a preceding list (iv) exits to semantic routines are designated by @. This will be discussed later as will the words open and close which also have special significance. (v) only identifiers and spaces may appear between "<" and ">" (vi) < > must always be used to represent the null symbol. Alpha Procedure SCAN; SCANMODE = normal operating mode-identifiers are recognized and entered in a table called BIGTAB : the address in BIGTAB is returned. Numbers are recognized and assembled and the address in BIGTAB is returned. SCANMODE = 1 SCANMODE = 2 SCANMODE - 3 SCANMODE = h SCANMODE = 5 same as except blanks or repeated blanks are also recognized separately as a single blank. returns each character except blanks which are ignored. same as except multiple blanks are reduced to single blanks. same as except identifiers, etc. are not entered in BIGTAB. return every character- 73 There are a number of Boolean variables which also control the action of the scanner and if set to true cause the following: LETTERCHAR no assembly of identifiers; i.e., each letter is returned separately. DIGITCHAR each character of a number is returned separately with no formation of the number. STRINGCHAR no assembly of strings. IDBLANKS ignore blanks in an identifier. Each of the above is normally set false . FRSTCOL the first column of each card to be read, normally = 1 LASTCOL the last column read on each card, normally = 72. The scanner operates with reference to a table called CHARCLAS- Thus to set "-" as an alphabetic character CHARCLAS ["-"]• ALPHABETIC : = 1 ', Similar assignments may be made for ALPHANUMERIC, etc. References to output of the scanner are by way of the special nonterminals: <*I> meaning identifier <*N> meaning integer <*R> meaning real <*S> meaning string 7^ Executive Calls SEMANTICS For historical reasons two separate types of semantic routines are accepted. The first is referred to by number: SEMANTIC (N, FIRST, LAST); value N, FIRST, LAST; integer N, FIRST, LAST; This will presumably be a procedure which switches on the parameter N. Using FIRST to point to the first element of the pro- duction, and LAST to the final element. It is called from the syntax by one of three entries: @ S a call on the semantic routine number , i.e., SEMANTIC (, - - -) @ T the same as @ S, except that it requires a Boolean variable SEMANTICTEST to be set true in order for syntactic analysis to proceed. The second type of call is an inline semantic routine: : : = ti© [ S | T | Q ] #[ #] ; When the syntax is translated into Burroughs ALGOL, the inline semantic routines are inserted in the code produced. 75 Within a semantic routine, the various parts of a produc- tion may be referred to in two ways: (i) communication array COMM[ 1:100] FIRST points to the beginning of the production in COMM, LAST points to the end of it. (ii) PM3, PM2, PM1, PO, PI, P2, P3, P9 define PM3 = COMM [FIRST - h] #, PO = COMM [FIRST - 1] #, PI = COMM [FIRST ] #, thus P9 = COMM [FIRST + 8] #; : : = if then [else statement]? MM ft PI P2 P3 P^ PO PI P5-* 1 ' Note that within [ ] the nomenclature changes—bear in mind the equivalence of [ ] and dummy productions. Examples: ■ : : = [ : : = < > ; will always be found. Note this example: : : = begin list separator #; end | begin list list separator #; end | < > ; 79 Because ::=----< >it will always be found when sought. Thus the system will erroneously find a between the begin and declaration in the string: " begin system forces in real "; There are two types of list as has been mentioned before. They are the open and closed types. In an open list every member of the list is present in the communication array: + + + In a closed list each succeeding + is deleted as formed. Open Kleene stars are similar and may take 0, 1, 2, 3; • • •) locations of COMM. Closed Kleene stars occupy zero positions always. An open list is indicated by writing open after the list | list separator construction and closed by the word closed . The default options are action call immediately following -* open other- wise closed . * @S3 open * closed * open open * closed closed Note ; that in counting the number of elements in a produc- tion, ? will count as one only if it is in fact present: ? 3 or h elements. 80 Each procedure so defined has an identification number; e.g., IDNOARRAYIDENTIFIER as does each reserved word of the language. Every in the pro- gram is taken as reserved and may be referenced by W0RD; e.g., WORDBEGIN WORDEND Interaction With Semantics The system is at least as flexible as ALGOL because it contains ALGOL as a subset. It is possible to initiate the search for a particular nonterminal should be sought: Note this example: : : = (@S [seek parameters]); Where procedure "seek parameters" looks up the parameter in a given position and calls the procedures: TESTARRAYIDENTIFIER( FIRST + l) or TESTPROCEDUREIDENTIFIER( FIRST + l), etc. 81 Error Recovery The user is expected to insert his own error recovery in the syntax. Thus: declarations : : = integer [ list <*I> separator , | ] ; : : = [ but #;]* ; Control Cards The control cards for the ALGOL compiler (which compiles | SOURCE) and TWST/COMPILE should be on separate cards. Amongst others there are: $ LIST $ NOLIST $ TRACE2 ) ) or DEBUG ) $ TRACE3 ) ) or $ EXEC ) $ TRACE9 $ RESERVE causes the syntax to be listed as processed. inhibits listing—listing is assumed, must be turned on at generation and running of the compiler to cause the resulting compiler to list what it is seeking and when it finds par- ticular nonterminals, as above— causes exits to semantics to be printed, lists scanner output. will set the reserve word option as the presumed setting in the compiler. It is already assumed-- the alternative is the special word designation. 82 $ SPECIAL to set a given symbol as special ■word designator; e.g., SPECIAL means every word of the language which is reserved must be marked by a period (full stop). Note that there are no facilities at present for making only some words reserved, whereas the syntax can often be designed to cir- cumvent the need for reserved words. In order for TRACE2 or TRACE3 to operate they must be turned on at both compiler building and run time. In this case the complication may be circumvented by reorder- ing the production: : : = begin list list separator #; end j begin list separator #; end | < > ; Many of the traps associated with < > can be avoided by using * or ? . Assembly Time and Execution Time The syntax can be processed at about 60 cards/minute for large grammars. Most of this time is spent in formatting and it is hoped to improve this figure. The resulting compilers should operate at a minimum of 1000 cards/minute parsing time. Cleverly written grammars can almost double the parsing speed. 83 LIST OF REFERENCES [l] Rosen, S., "A Compiler-Building System Developed "by Brooker and Morris", Comm. ACM 7 (196k) , p. ^03. [2] Kleene, S. C, "Representation of Events in Nerve Nets and Finite Automata", in Automata Studies , Annals of Mathematics Studies, Number "5k, Princeton University Press (19^7). [3] Budnik, P. P., Kuck, D. J., Muraoka, Y. , Northcote, R. S., Wilhelmson, R. B., "The Tranquil Language and Its Compiler", Proc. SJCC (1969). [h] Post, E. L., "Recursive Unsolvability of a Problem of Thue", Journal of Symbolic Logic 12 (19^7), P- 1- [5] Earley, J., "An Efficient Context-Free Parsing Algorithm", Ph.D. Dissertation, Dept. of Computer Science, Carnegie- Mellon University (August, 1968). [6] Kasami, T., "An Efficient Recognition and Syntax -Analysis Algorithm for Context Free Languages", Report No. R-257, Coordinated Sciences Laboratory, University of Illinois, Urbana, Illinois (March, 1966). 3 [7] Younger, D. H. , "Context Free Language Processing in Time n ", IEEE 7th Annual Symposium on Switching and Automata Theory (1966). [8] Hartmanis, J., "Context-Free Languages and Turing Machine Computations", Proceedings of Symposia in Applied Mathematics, Vol. 19, American Mathematical Society 19^7. UNCLASSIFIED Security Classification DOCUMENT CONTROL DATA -R&D (Security claeallleatlon at title, body ot aba timet mnd Induing — irtrtw mmat fca wiWfW w*m tfw ortnll report la cteaalHod) originating ACTIVITY (Corporal* author) Department of Computer Science University of Illinois Urbana, Illinois 6l801 ami »tPO«r ISCUKITr CLASSIFIC ATION UNCLASSIFIED zk. group ». REPORT TITLE A BNF LIKE LANGUAGE FOR THE DESCRIPTION OF SYNTAX DIRECTED COMPILERS 4. DESCRIPTIVE MOTES (Type ol report mnd tneluaiw dmtea) Research Report 5. author (Si (Flrat nmmre, middle initial, Imal name) Harold Robert George Trout 6 REPORT DATE January 13, 19^9 7A. TOTAL NO. OF PACES 73 7b. NO. OF REFS •A. CONTRACT OR 6RANT NO. h6 -26 -15 -305 b. PROJECT NO. usaf 30(602 )hikh c. IER(S) DCS Report No. 300 •b. OTHER REPORT NOISI (Any other number thim report) 10. DISTRIBUTION STATEMENT Qualified requesters may obtain copies of this report from DCS. II. SUPPLEMENTARY NOTES NONE 12. SPONSORING MILITARY ACTIVITY Rome Air Development Center Griffiss Air Force Base Rome, New York 13MK) IS. ABSTRACT The paper describes an extended version of Backus Naur Form which can be translated in one pass to a parsing algorithm. The restrictions which must be placed on the BNF to achieve this end are minimal, and it is proved that they do not alter the generative capacity of the metalanguage. The recursive descent parsing algorithm produced operates at -about 1000 cards per minute for typical languages on the B-5500. DD ,?.?..t473 UNCLASSIFIED Security Classification UNCLASSIFIED T Security Classification K E V WO RDS MOLI WT The Syntax Language BNF (Backus Naur Form) The Language TBNF UNCLASSIFIED Security Classification *h