Digitized by the Internet Archive in 2013 http://archive.org/details/genesisacompiler854hoef .Co V UIUCDCS-R-77-854 n •z- ~) r yisJl*i Genesis — A Compiler Generator Using Language Segmentation UILU-ENG 77 1707 s March 1977 by Jay Philip Hoef linger The Ubrary of tha APR 19 1977 University 01 Illinois irbana-Cham DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS GENESIS--A COMPILER GENERATOR USING LANGUAGE SEGMENTATION BY JAY PHILIP HOEFLINGER B.S., University of Illinois, 1974 THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign , 1977 Urbana, Illinois Ill ACKNOWLEDGMENT I would like to express my gratitude to my thesis advisor, Professor Garry Kampen. He provided many stimu- lating ideas and much guidance during the preparation of this thesis. I would also like to express my sincere appreciation to my wife, Donna, whose support, patience, and hard work have made the job of preparing this thesis much easier. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 1.1 Computing Environment 1 1.2 Motivation 2 1.3 Background 3 1.4 Organization of this Thesis 4 2. OVERVIEW 6 2.1 Genesis Description 6 2 . 2 Segmentation 8 2.3 Example 10 3. SYSTEM OPERATION 16 3.1 General 16 3.2 Segment Generator 17 3.3 Linker 20 4. THEORETICAL BASIS 22 4.1 Choice of Parsing Technique 4.2 SLR(l) Parsing and Table Construction .... 23 4.3 Basis for SLR(l) Construction 24 4.4 Formal SLR(l) Definitions 25 4.5 The Combinability Criterion 26 5. ALGORITHM DESCRIPTION 30 5.1 Parser 30 5.2 Lexical Scanner 33 5.3 Segment Generator 35 5.4 Linker 40 5.5 Implementation Description 45 6. FUTURE IMPROVEMENTS 4 9 7. CLOSING REMARKS 54 BIBLIOGRAPHY 55 APPENDIX A 57 APPENDIX B 59 APPENDIX C 61 LIST OF FIGURES Figure Page 1. Genesis Overall Structure 7 2. One-Step Compiler Generation with Special Version of Genesis 8 3. Segmentation Parse Tree for "A=B" 12 4. Segmentation Parse Tree for "A(A(5))=B" .... 13 5. Example Segments 14 6. Genesis Recognizer 17 7. Combinability Criterion Applied to Example Segments 29 8. Potential Infinite Loop Situation for Segment Entrance 32 9. Parse Table Structure for Linked Segments ... 41 10. Finite State Machine States for "BE" and "BEGIN" 43 11. Compressed Parse Table Structure for Each Segment 45 12. Compressed Lexical Scanner Table Structure . . 43 1. INTRODUCTION The Genesis ( Gen erator Sys tem) compiler generator system allows the user to generate language subsets indi- vidually, then link together those which form a complete language to get a compiler for the language. The language subsets are called "segments" of the language. Segment- module libraries may be built which make the modules avail- able to any language designer or compiler writer who may need them. One segment-module is generated at a time by a pro- gram called the Segment Generator. The segment syntax is written in a syntactic meta-language similar to BNF and the segment semantics is written in PL/I. The segment-modules are linked together by a program called the Linker which produces a compiler based on a description written in a compiler description language. This thesis describes the theoretical aspects, the algorithms, and the implementation details behind Genesis . 1 . 1 Computing Environment The Genesis compiler generator is written in PL/I and runs on an IBM 360/75 running under HASP-OS/MVT with 1M bytes fast core and 2M bytes slow core located on the campus of the University of Illinois at Urbana-Champaign. The Segment Generator runs in 128K bytes plus an input-dependent area. The Linker runs in 108K bytes plus an input-dependent area. The compiler generated from the system runs in 48K bytes plus an area for its syntax tables and an area for its semantic code . 1 . 2 Motivation Genesis is a tool which should promote the "structured design" of programming languages, just as "structured" pro- gramming languages have promoted the structured design of programs. With Genesis , a language can be divided into seg- ments which have one entry and one exit, much as a program can be divided into subroutines which have one entry and one exit. This modularized language design technique decreases the complexity of the compiler-debugging task. Since the segment-modules are independent of each other and are stored in a library, much like a subroutine library, they are available to anyone who may need them. Ideally, segments which are sufficiently general could be used in many compilers. Segments such as one with standard control structures, or one with expressions containing standard operators could be useful in many languages. Each "standard" segment used by a compiler writer would make the task of writing and debugging the compiler easier. In ad- dition, the use of "standard" segments would tend to stan- dardize the structure of the compiler and therefore would make the compiler easier to understand for someone familiar with the standard components. 1 . 3 Background How does Genesis compare with other compiler gener- ators which have been developed? Where does Genesis fit in the wide spectrum of approaches to the construction of com- piler generators? To answer these questions, three widely different approaches to compiler generation have been selected to compare and contrast with Genesis . The three are: XPL [McKeeman; 1970], APAREL [Balzer; 1969], and PGS [Mickunas; 1973]. XPL and APAREL are compiler-writing languages. PGS and Genesis rely on ordinary programming languages for the actual implementation of a compiler. Of the three systems, APAREL is probably the least like Genesis . Both the syntax and semantics of a language are expressable within the framework of the APAREL language. The language is an extension to PL/I which includes pattern statements resembling the Bachus Naur Form for specifying the syntax of the language. While APAREL encourages complete mixing of syntax and semantics within a user-written com- piler, Genesis provides nearly complete separation for them. The principle behind this separation in Genesis is that these two dissimilar, very complex, and often unrelated entities should be studied individually before they are studied to- gether. APAREL achieves much greater flexibility for the compiler because syntax patterns can be invoked at any point in the compiler. McKeeman et al. not only provide a compiler writing language in XPL, but they provide a compiler skeleton as well. SKELETON, as it is called, consists of a structure of subroutines written in XPL which is filled in by the user to form a compiler. In addition, the language syntax may be written in BNF and submitted to the ANALYSER program which constructs parse tables for SKELETON. XPL differs from Genesis in one major point — the compiler writer must know the structure and details of the XPL system thoroughly to be able to add his/her own code to that of SKELETON. Using that knowledge, the user can tailor the lexical analysis to his own special needs. One of the prime goals in the design of Genesis is that a user need not be bothered with the details of implementation. The PGS system is the most like Genesis of the three systems. In PGS, the syntax is expressed in BNF with semantic tags, showing which semantic subroutine is to be executed at each point in the parse. Assembler, Algol, and Fortran sub- routines may be invoked as semantic routines in PGS. In Genesis , "semantic numbers" are placed at various points within the syntax and at those points the user-written PL/I semantic routine for the current segment is called and the semantic number is passed to it. 1. 4 Organization of this Thesis This thesis is organized in a top-down fashion. The major ideas behind the system are presented in this Introduction. In Section 2, a general overview of the system plus a gen- eralized system diagram are presented. Also, a simple example of segmentation is presented. This example will be referenced throughout the paper. In Sections 3 through 5, the system components will be described in detail, the theoretical basis for the system will be stated, and the system algorithms will be presented. Section 6 discusses possible improvements to the system. 2. OVERVIEW 2 . 1 Genesis Description Genesis consists of two major programs, the Segment Generator and the Linker. The Segment Generator receives a segment written in the Genesis syntactic meta-language (syntax) and PL/I (semantics) . It then generates a parse table module, which it puts in a parse table library, and uses a PL/I compiler to produce a semantic object module, which it puts in the semantic library. These two modules are logically associated because they are filed under the same name in each library. The Linker reads a description of the compiler which it must build and finds the appropriate segments in both libraries. The Linker produces three modules: a compiler load module, a parse table module and a lexical scanner taole module. Those three modules together form the compiler for the language. Figure 1 shows the Genesis overall structure. A special-case version of the Genesis system exists, which generates an entire compiler in one step from one syntax/semantics description. This version strips away all segmentation overhead made unnecessary since only one "segment" exists. Figure 2 shows a block diagram of this program. segment -> language x computer description -> Segment Generator Linker 3L Compiler Semantics Parser Parse Table Module Lexical K- Scanner Compiler Load Module Scanner Table Module LANGUAGE X COMPILER Figure 1. Genesis Overall Structure Language X syntax/ semantics Program written in Language X -> Parse Table Module ■> LANGUAGE GENERATOR JL Compiler Semantics Parser Lexical Scanner Compiler Load Module Scanner Table Module LANGUAGE X COMPILER Figure 2. One-Step Compiler Generation with Special Version of Genesis 2 . 2 Segmentation 2.2.1 Definitions The word "segment" will appear many times in this thesis. It is important that the term be clearly understood. A language segment is the combined syntax-semantics descrip- tion for a subset of a language. To distinguish the input to the Segment Generator from its output, the syntactic meta-language and PL/I descrip- tion will be called a segment, while the syntax tables and object code produced will be called a segment-module. A language which is formed by the combination of two or more segments is termed a "segmented language." The recog- nition process for a segmented language which involves both parsing within a segment, and moving between segments is called "segmentation parsing." The segmentation process has little to do with the actual parsing technique used within each segment. Segmentation merely provides a superstructure within which the parse is carried out. This would make it possible for each segment to be parsed via a different tech- nique. For the most part, this thesis will deal with the problems inherent in combining the syntax parts of several segments. Thus, for simplicity, "segment" will be frequently used in place of "syntax part of a segment" throughout the rest of this paper. 2.2.2 Segment References Within one segment, a reference to another segment looks exactly like the reference to an ordinary nonterminal symbol. The Segment Generator can distinguish between the two since a segment reference is simply an undefined non- terminal within some segment. For example, consider the segments below. A B __ v rtr> - ,,,.,. \ i K • - > D , Segment A C - \ ■c'; / Segment C 10 Within Segment A, A and B are nonterminals, 'b' is a terminal symbol, and C is a segment reference. Now, suppose the input sentence 'be' were to be parsed according to the syntax structure defined by segments A and C. The 'b' of the input sentence would be found to be syntactically correct according to Segment A. To examine the rest of the input sentence, the segmentation superstructure causes an entrance into Segment C and then parsing continues within C. When 'c' has been recognized as syntactically correct within C, the segmentation superstructure leaves Segment C and returns to Segment A, indicating that C has been successfully recognized 2 . 3 Example The segmentation concept may be best illustrated at this point by an example. First, consider the grammar for a very simple type of assignment statement which is made up of three segments: ASSIGN » NAME EXPR; EXPR > NAME; » DIGITS; NAME -» IDENTIFIER; Segment ASSIGN Segment EXPR Segment NAME In this example and throughout this thesis, the special names IDENTIFIER, DIGITS, and LITERAL will denote special sets of lexemes. IDENTIFIER is the name of the set of lexemes which begin with a letter (A-Z) and continue with 11 either letters or digits (0-9) . DIGITS is the name of the set of lexemes which are strings of digits. LITERAL is the name of the set of lexemes which begin with a single quote and end with a single quote, with any character in between. An internal single quote is represented by two consecutive single quotes. This grammar describes a language which allows sen- tences of two forms: IDENTIFIER = IDENTIFIER and IDENTIFIER = DIGITS. The syntax tables for each of these three segments would be generated separately, then linked together. The segmentation parse tree for the input sentence "A=B" is shown in Figure 3. To explore the power provided by segmentation, sup- pose a wider variety of names are to be accepted, namely singly subscripted identifiers. To accomplish this, the NAME segment would be rewritten, its syntax tables generated and then linked with the existing ASSIGN and EXPR segments. The new NAME segment is shown in Figure 4 . The EXPR segment is mentioned in the new NAT-IE segment and therefore anything which the existing EXPR segment accepts is immediately valid where EXPR appears in NAME. Figure 4 illustrates this with the input sentence "A(A(5))=B." Next, EXPR can be expanded to accept operators. The new EXPR segment is shown in Figure 5. When it is generated 12 ASSIGN EXPR "> NAME EXPR; "> NAME; -> DIGITS; NAME > IDENTIFIER; GRAMMAR ASSIGN IDENTIFIER v_ J EXPR NAME Figure 3. Segmentation Parse Tree for "A=B' ASSIGN EXPR NAME SUBSCR "> NAME EXPR; -> NAME; -> DIGITS; IDENTIFIER SUBSCR; ( ' EXPR ' ) ' ; 13 Grammar ASSIGN EXPR NAME IDENTIFIER Figure 4. Segmentation Parse Tree for "A(A(5))=B' 14 ASSIGN ■> NAME EXPR; Segment ASSIGN EXPR -> EXPR < ' * ' -> EXPR <' + ' -> NAME ; "^ DIGITS; EXPR; EXPR; Segment EXPR NAME SUBSCR -> IDENTIFIER SUBSCR; -> > ' C -^ EXPR i \ i , ) •; Segment NAME Figure 5. Example Segments 15 and linked with the existing NAME and ASSIGN segments, ex- pressions throughout the language are suddenly richer. The input sentence "A (B+3) =B+A (B+2) " is acceptable with this new set of segments. While it is true that the grammar given in Figure 5 for the EXPR segment is ambiguous, it has been shown [Aho; 1975] that such a grammar plus some operator associativity and precedence information is a perfectly valid expression grammar. In fact, the Genesis system accepts an expression grammar including sufficient associativity and precedence information to remove ambiguity. 16 3. SYSTEM OPERATION 3 . 1 General The most striking thing about the two major components of Genesis (the Segment Generator and the Linker) and the generated compiler is that they all have exactly the same structure. This brings about one very important property of the system: new versions of either the Segment Generator or the Linker can be produced by the existing system, which would automatically generate the new program. The structure which the major components of Genesis exhibit is shown in Figure 6. This structure consists of a standard Recognizer which accesses a parse table and a lexical scanner table and which calls user-written semantics routines. The Recognizer is made up of a parser and a lexical scanner. The parser accesses the parse table and performs parse actions according to an SLR(l) parsing algorithm. In addition, it causes entry into and exit from the various seg- ments in the language being parsed. The lexical scanner accesses the lexical scanner table and performs actions according to a finite state machine algorithm. At various points in the course of parsing an input program, the parse table may indicate that a semantic action must be performed. When such a point occurs, the parser calls SEMANTICS Parse Table Figure 6. Genesis Recognizer 17 Shaded areas are standard components the appropriate user-written semantic routine which then performs the action and returns. 3 . 2 Segment Generator 3.2.1 Syntactic Meta-language The Segment Generator's input is a grammar for a single segment written in a syntactic meta-language. The 18 major extensions beyond BNF in the meta-language concern ways of specifying certain Genesis options and constants, semantic information and operator associativity and precedence infor- mation within expressions. The full syntax of the Genesis syntactic meta-language appears in Appendix A. In this syntax notation, a blank left-hand side means the same left-hand side as that of the last production. In addition, several productions with identical left-hand sides may be written with one left-hand side and a series of right- hand sides separated by the alternation symbol ( | ) . Each production or production-group with a single left-hand side is terminated by a semicolon ( ; ) . The semantic numbers are "associated" with the symbol appearing to be the immediate right of the number. If there is no associated symbol (the semantic number appears at the far right-hand end of the production) , then the number is associated with the reduction of the entire production. When the associated symbol has been recognized, the semantic package indicated by the semantic number is executed. The special symbols which have to do with expression disambiguation are left- or right-pointing arrows ("<" or ">"). An operator is identified by the presence of a left- arrow to its left or a right-arrow to its right. The left- pointing arrow on the left side of a symbol denotes left- associativity. A right-pointing arrow on the right side of an operator symbol denotes. right-associativity. If both a left-arrow and a right-arrow appear surrounding a symbol, then the symbol is thought of as non-associative. 19 The precedence of an operator is determined by its position in the grammar compared with the positions of the other operators. The higher a production appears on the page in a listing of the grammar, the higher the precedence of its operator. If the operator is a nonterminal symbol, all terminal symbols which that nonterminal can produce are defined to be operators of equal precedence and associativity. The expression grammar shown below contains six operators. The exponentiation ('**') operator has the highest precedence and is right-associative. The multiplication ('*') and division ('/') operators have equal precedence and are both left-associative. The addition ('+') and subtraction ('-') operators are both of equal precedence and of lower precedence than exponentiation, multiplication and division. The less-than operator ('<') has the lowest priority and is non-associative. This means that an expression like "A EXPR '**' > EXPR; -» EXPR < MULOP EXPR -> EXPR < ADDOP EXPR -> EXPR < ' < ' > EXPR MULOP ADDOP -> IDENTIFIER; -> '*' I '/'; 9 ' + 2', 3.2.2 Segment Generator Semantics The Segment Generator performs four major tasks. First, and foremost, it builds the parse table by applying the Simple LR(1) construction algorithm to the segment's grammar. While that construction proceeds, the Generator performs its second major task, that of keeping track of any segment references in the grammar and the parse states in which they occur. The third task takes place after the construction algorithm has made one complete pass over the grammar—all ambiguities caused by an expression grammar are resolved. The fourth major task is that of compressing the parse table. 3 . 3 Linker 3.3.1 Linker Input Language The Linker's compiler description language can com- pletely describe a compiler. The language provides a means for setting various Linker constants, naming which segments are to be linked together, specifying a body of initialization code which will be executed in the compiler before compilation begins, and selecting from among various compiler options. The syntax of this compiler description language appears in Appendix B. 3.3.2 Linker Operation The Linker performs six major tasks. First, it reads the segments for the compiler and stacks the parse tables 21 together to form one parse table. Second, it builds a lexeme cross-reference table which keeps track of the way each user-defined lexeme is coded in every segment. Third, a similar cross-reference table is built for the coding of each segment name in every segment. Fourth, a lexical scanner table is generated for the compiler. Fifth, the customized initialization routine and one other customized routine are generated and compiled. Finally, the sixth task is the link- editing of the standard Recognizer with the customized routines and the semantic routines. When all of the above has been completed by the Linker, the compiler load module, the parse table and the lexical scanner table are written to files and are ready for use. 22 4. THEORETICAL BASIS 4 . 1 Choice of Parsing Technique Basic to the success of the Genesis system is the use of a theoretically sound parsing technique. The SLR(l) technique was chosen. SLR(l) parsing was first introduced by F. L. DeRemer in [DeRemer; 1969]. There are four reasons why the SLR(l) technique was chosen. First, SLR grammars form a large subset of the deterministic context free languages. Most programming languages in use today can be described by one of the SLR grammars. Second, I personally find that SLR grammars are more natural to write than other types of grammars. Third, SLR parsers report an error at the earliest possible time. This is not the case for some other parsing techniques. For instance, a precedence parser may examine an arbitrary number of symbols after an error has occurred before it reports the error. Finally, SLR parsers can be made com- petitive in size and speed with other parsing techniques through table transformations. [Aho; 1973] The SLR(l) technique has two big advantages over other LR techniques. The computation time required to generate its parse table is much less in general for SLR(l) and there are usually far fewer table entries with SLR(l) than with other LR techniques. [Aho; 1973] 23 4 . 2 SLR(l) Parsing and Table Construction The classic model machine for deterministic context free language recognition is the deterministic push down automaton (DPDA) . This machine is sufficient for SLR(l) parsing, but is not quite sophisticated enough for effi- cient SLR(l) parsing. The machine chosen to model SLR(l) parsing efficiently for Genesis is called the Genesis DPDA . The Genesis DPDA consists of three elements: an input tape (where the input program comes from) ; a finite state control (which controls the machine's actions); and a push down stack. The machine's current condition is characterized by its "state" (which is actually the state of the finite control) . The Genesis DPDA can carry out six actions: SHIFT an input symbol onto the stack; REDUCE a meta-language pro- duction by taking its right-hand side off the stack and putting its left-hand side on the stack; ACCEPT the input string; report an ERROR; ENTER another Genesis DPDA; and EXIT the current Genesis DPDA. Embedded in three of the six actions is the possible execution of semantic routines. After the input symbol has been SHIFTED onto the stack, a semantic routine can be executed. Within REDUCE, both after the right-hand side has been taken off the stack, and after the left-hand side has been put on the stack, semantic rou- tines can be executed. Within ACCEPT, after the right-hand side of production 1 has been removed from the stack, some 24 semantic routine may be executed. The inclusion of these semantic actions makes the Genesis DPDA a translating machine instead of just a parsing machine. The ENTER and EXIT actions implement segmentation parsing. The Genesis DPDA effectively models the parse tree for an input sentence. The SHIFT action gets input symbols onto the stack, then the REDUCE action has the effect of replacing them with a single nonterminal symbol (their father node in a parse tree) . That nonterminal can then be one of the symbols replaced by another REDUCE, and the process continues until the sentence symbol alone is left on the stack, at which point the input sentence is ACCEPTED. As long as the finite control causes the proper ac- tions at the proper times, the Genesis DPDA can model any deterministic context-free language parse. The finite control built for Genesis is for SLR(l) grammars. 4 .3 Basis for the SLR(l) Construction The SLR(l) table construction is done by building a series of LR(0) items , each of which results in one parse action entry in the SLR(l) table. The LR(0) items are built from an SLR(l) grammar by moving a cursor through the pro- ductions of the grammar to all the possible points which the parse of an input sentence might reach. Each LR(0) item has the form [A + a • 3] 25 where A is a nonterminal in the grammar, a and 3 could be non- terminals, terminals, or A (the empty symbol), and A ■*■ a3 is a production in the grammar. The cursor is represented by a period ( . ) . An item represents a parse that has reached the point between a and 3. Each set of LR(0) items is called a state. For each state there is a collection of valid terminal symbols which will cause the parse to continue. These symbols are each associated with a parse action. The procedure for determining the parse actions is described in section 5.3.4. 4 .4 Formal SLR(l) Definitions A grammar is expressed as an ordered 4-tuple G= (N,T,P,S) . N is the set of all nonterminals in the grammar. T is the set of all terminals in the grammar. P is the set of all productions in the grammar. S is the Start symbol of the grammar. SeN. The symbol "=y means "produces." A grammar is used to derive the string on the right of "4" from the string on the left. "=t> " means "produces in zero or more steps." "^ rightmost" means "produces in zero or more steps using a right-most derivation." The EFF 1 (Epsilon-Free First) and FOLLOW, sets will be used in the definition of SLR(l) grammars. The FIRST set will be used to define EFF, . 26 FIRST 1 (a) EFF., (a) = FOLLOV^ (8) (x|a 4x6 and length (x) = 1} FIRST., (a) if a does not begin with a nonterminal ; or {w|w is in FIRST, (a) and there is a derivation * a ^ 8 =^ wx right most where 3 ^ Awx for any nonterminal A} * = {x|S ^ a&Y and x is in FIRST, (y) } SLR(l) grammars are defined as follows: Let G=(N,T,P,S) be a context free grammar (not neces- sarily LR(0)). Let S be the collection of sets of LR(0) items for G. Let Q be any set of items in S . Suppose that whenever [A-*a.B] and [B-^y.5] are two distinct items in Q, one of the following conditions is satisfied: (1) Neither of 8 and 6 are A. (2) B^A, <5 = A and FOLLOW. (B) C\ EFF 1 (6 FOLLOV^ (A)) = (3) 8 = A, 6^A and FOLLOV^ (A) f\ EFF (6 FOLLOV^ (B) ) = 4> (4) 3=A, 6 = A and FOLLOW-j^ (A) f\ FOLLOW. (B) = Then G is said to be a simple LR(1) grammar (SLR(l) grammar). 4 . 5 The Combinability Criterion 4.5.1 Definitions Each segment undergoes SLR(l) table generation. For each state in a segment's tables, the Segment Generator keeps 27 track of the set of lexemes which cause a correct parse to continue. This set is called the continue-set for the state. The names of all segments which could be entered from each state are also kept on a segment-list for that state. Each segment has a single entry point — its first state. The continue-set for a segment's first state will be called the segment's seed-set . For instance, referring back to the example in section 2.3, and Figure 8 below, the seed-set of ASSIGN is the set IDENTIFIER of identifiers, the seed-set of NAME is the set IDENTIFIER, and the seed-set of EXPR is the set IDENTIFIER \J DIGITS. The first state in ASSIGN'S parse table lists NAME as a segment which could be entered. State 3 in ASSIGN has "=" in its continue-set and nothing in its enterable-segment list. 4.5.2 The Criterion Regardless of the parsing technique used for any of the segments, there is a general criterion for deterministic segmentation parsing . Intuitively, the Combinability Criterion states that when all segments are combined to form a language, a fixed number of symbols will be sufficient to determine the next parsing action or the next entry into or exit from a segment. Genesis uses one symbol to make that decision. Each state in each segment has a possibly-empty continue-set C. If that state's segment-list is not empty, then associated with that state is a collection S of seed-sets. 28 The algorithm to determine the members of collection S is: 1. Copy the state's segment-list into membership list M. 2. Examine each element of M, beginning with the first. If that segment has a segment-list in its first state, merge that list into M (if a member of the segment-list is already in M, don't do anything for that member) . 3. Continue until there are no more elements in M to examine. 4. The group of seed-sets of all segments named in M is the collection S. The combinability criterion is that C and all sets in S must be collectively disjoint. Let I be an index set for S. Then, C S. = (Vi el) and S ± A S. = 4> (Vi, j € 13 i*j) where is the empty set and S. is the i ' th set in S. Figure 7 shows that the three example segments meet the Combinability Criterion. 29 SEED SET: ASSIGN IDENTIFIER NAME stat i» Continue set EOF EOF Enterable segments NAME EXPR Combinability Check {}fl{ldentifier} =4> n r Digits, i {}A( TJ Di 8^s f } Identifier IDENTIFIER Continue state set Enterable segments 1 IDENTIFIER 2 EOF 3 EOF 4 EOF 5 EXPR 6 ) 7 EOF EXPR SEED SET: DIGITS, IDENTIFIER 3tat< Continue set Enterable segments 1 DIGITS NAME 2 EOF 3 * + 4 + EOF 5 i'c EOF 6 DIGITS NAME 7 DIGITS NAME 8 + EOF 9 + EOF Combinability Check {DIGITSlAt IDENTIFIER} {DIGITS }A{ identifier} {DIGITSMi identifier} =4 EOF: End of File Figure 7. Combinability Criterion Applied to Example Segments (Appendix C contains the complete construction of the parse tables for these segments.) 30 5. ALGORITHM DESCRIPTION 5 . 1 Parser The Genesis DPDA, discussed in section 4.2, is the model for parsing used in the Recognizer, but the algorithm used to implement it makes some necessary modifications to it. The two actions ENTER and EXIT do not exist at all in the Recognizer algorithm. They are replaced by extra code in the ACCEPT and ERROR actions. The reason for this change is one of efficiency. The ENTER and EXIT instructions can- not be generated by the Segment Generator since it cannot know the seed-set for segments to be entered and since it cannot know whether the segment being generated will be the major segment of a language (the major segment is the segment whose sentence symbol becomes the sentence symbol of the entire language). The Linker knows both of these things, but for efficiency, does not alter the parse tables of the segments which it is linking. The Parser has four actions: SHIFT, REDUCE, ACCEPT, and ERROR. SHIFT pushes an input symbol on top of the parse stack, then pushes the next parse state number onto the stack and causes a transfer to that state. A new input symbol is then read. REDUCE causes twice the number of symbols which are on the right-side of a production to be removed from the 31 stack (one for each right-hand side symbol plus one for each state number pushed on). Then, the top of the stack will hold a new state number which becomes the current state. The left-hand side nonterminal symbol is pushed onto the stack, and a table called the GOTO table is consulted to see what the next parse state will be. That next state is pushed onto the parse stack. The method for constructing the GOTO table will be described in section 5.3.5. ERROR causes the machine to look in the segment-list for the current parse state and segment. If a segment name is on the list, that segment is entered by placing the cur- rent segment number on the stack, placing state number one on the stack (first state of the new segment) , then changing the current segment number to the new segment number. If the segment-list for the current state and segment is empty, an ERROR is signalled and the parse stops. Thus, ERROR simulates the ENTER action of the Genesis DPDA in some cases. The Genesis Linker only allows one segment to be enterable from any one state and it does not record the seed-set for the enterable segment. Whenever an error occurs and a segment is enterable, that segment is entered. To avoid a possible infinite loop where a particular lexeme does not appear in the seed-set of any of a circular path of enterable segments, a run-time check must be made to make sure that no segment is entered for a second time for the same input symbol. For example, Figure 8 shows this input symbol SEED SET: Segment A ii i ii "it" Enterable State Segments 1 B " -S egment B M ll_ II 11.11 » Enterable State Se2ments 1 A - Figure 8. Potential Infinite Loop Situation for Segment Entrance type of situation. If the symbol ";" (not in the seed-set of either segment) is examined in the first state of segment A, the parser would find ERROR as its action, then notice it could enter B, which would find the same situation and re- enter A. An infinite loop would result if it were not stopped. A run-time check would stop execution with an ERROR when A was entered for the second time without ";" being consumed. If the full Genesis DPDA were implemented and if the Linker implemented the full Combinability Criterion by altering the segments' parse tables with ENTER and EXIT actions, the run-time check would be unnecessary. ACCEPT causes an "input accepted" message if the current segment is the major segment. If it is not, then the action is treated just as if it were a REDUCE for production number one of the segment. After the right-hand side of 33 production one is removed from the stack, the last segment name is removed from the stack. It is changed to a non- terminal name and the GOTO table is consulted for the next parser state. The nonterminal name is pushed onto the stack, then so is the next parse state. Accept models the EXIT action of the Genesis DPDA exactly for all minor segments (all segments are minor, except for the major segment) . 5 . 2 Lexical Scanner 5.2.1 General Description The Genesis lexical scanner reads the input program and converts it to tokens which it passes to the parser. The token conversion process is guided by a lexical scanner table. The scanner recognizes lexemes of one or more charac- ters as well as identifiers, digit strings, and literals. The latter are recorded in special tables. Comments are recog- nized, then discarded. The model machine for the scanner is a finite state machine. The current input symbol together with the current state uniquely determines the actions which the machine will take. The actions are of four different types. Actions of type 1 and type 2 are carried out with every state transition. Sometimes three and possibly all four types are carried out during one state transition. 34 5.2.2 Lexical Scanner Actions Type 1 : (a) CONSUME INPUT SYMBOL or (b) DON'T CONSUME INPUT SYMBOL The machine can choose to go to the next input symbol or not. Type 2 : (a) TRANSFER TO This changes the current scanner state to . The scanner starts in state 1 when called. For every state transition, the state is re-assigned with this action. Type 3: (optional) (a) INDICATE THAT IS RECOGNIZED or (b) INDICATE THAT MIGHT BE RECOGNIZED or (c) DENY PREVIOUS POSSIBLE LEXEME or (d) CONFIRM PREVIOUS POSSIBLE LEXEME When a lexeme has been recognized, either (a) or (d) will signal that fact. Whenever (b) is performed, one of (c) or (d) will be performed on the next state transition. If (c) is executed, the "possible" lexeme will be declared a false alarm and the machine will continue until one of (a) or (d) is executed. Actions (b) , (c) and (d) are necessary since some user-defined lexemes have the same structure as identifiers (like "BEGIN," "END," etc.). If "BEGIN" were a user-defined lexeme, then when "BEGIN" is seen on input, (b) is executed. Action (a) is 35 not possible in this case because the next symbol will de- termine whether the lexeme is "BEGIN," or possibly some identifier like "BEGINNING." If the next character after "BEGIN" is a legal continuation for an identifier, then (c) is executed. If the next character is not a legal continua- tion for an identifier, then (d) is executed. Type 4: (optional) (a) ENTER IN NAME TABLE or (b)' ENTER IN LITERAL TABLE or (c) ENTER IN NUMBER TABLE This action enters the input string into a table according to the type of symbol it has found. A global variable is set with the symbol's position in the table, so that the user-written semantics can access the symbol. When the scanner reaches the end of the input file, the scanner immediately sets the recognized symbol to be , then returns. 5 . 3 Segment Generator 5.3.1 General Before the Segment Generator algorithms are described, a crucial term must be defined. The core of a parse state is the set of LR(0) items which exist in that state before closure is performed on that state. Closure is a process which generates all additional relevant LR(0) items for a state from the core item of that state. Closure will be described in greater detail later. 36 5.3.2 Generating the Parse Table The parse table generation algorithm begins with the core of the first state being set to the item: [S - . a) where S is the sentence symbol and S ■*■ a is the first produc- tion in the segment's grammar. The algorithm first performs closure on the core of the current state. Then, for each item in the state, it com- putes a parse action or a GOTO action and possibly generates a core item for some new state in a temporary state area. When all items in the current state are processed, a set of cores for new states will have been generated in the temporary state area. This set of temporary cores is then merged with all the existing states. If a temporary core matches the core of an existing state, all references to the temporary state are changed to refer to the existing state. If the temporary core does not match any existing core, then that temporary core is added to the list of existing cores. When all temporary cores have been merged into the existing cores, the next successive existing state is processed (closure is performed, parse actions computed, and temporary states generated, then merged) . This process con- tinues until all existing states are processed. 5.3.3 Closure Closure is performed by looking at each LR(0) item in the state (both core items and any generated items) in 37 turn. If the grammar symbol to the right of the cursor is a terminal symbol, then nothing is done. If the symbol is a nonterminal A, then the state is augmented by all items of the form [A -*• . a] such that A ■* a is a production in the grammar. 5.3.4 Computing a Parse Action In a state X, an item of the form [A -»■ a . b 3] where b is a terminal symbol, produces a parse action of SHIFT, the creation of a core item for a new state Y of the form [A -*• a b . B] and an indication to transfer to that new state after the SHIFT. Thus, whenever the terminal b is reported by the lexical scanner and the parse is in state X, the parser will SHIFT the b onto the stack and transfer to parse state Y. An item of the form [A ■+ a . ] produces a parse action of REDUCE for all terminal symbols in the follow set F of symbol A. Whenever a terminal symbol in F is reported, REDUCE is executed for the production. 5.3.5 Forming the GOTO Table In a state X, an item of the form [A -*- a -. B 33 where B is the nonterminal, causes the creation of a core item 38 for a new state Y of the form [A ■*■ a B . g] and produces a GOTO table entry to transfer to state Y. Several GOTO table entries could exist for each state, but only one for each unique nonterminal appearing to the immediate right of the cursor in that state. 5.3.6 Parse State Conflicts A conflict occurs in a parse state for a grammar when more than one parse action is possible for a particular terminal symbol. If this occurs, the grammar is not an SLR(l) grammar. . For instance, consider the following parse state in the SLR(l) construction for the example segment EXPR: State X [EXPR ■+ EXPR '*' EXPR .J on • * i REDUCE on '+' REDUCE (conflict) on REDUCE [EXPR -> EXPR . ' + ' EXPR] on '+' SHIFT (conflict) (the followset of EXPR is {'**, »+' , }) The symbol "+" could trigger two possible actions, SHIFT and REDUCE. This is called a SHIFT-REDUCE conflict. It shows that EXPR is not in SLR(l) form. 39 Some SHIFT-REDUCE conflicts can be resolved by Genesis Specifically, ambiguous expression grammars like the EXPR grammar above can be disambiguated if operator precedence and associativity information is included in the grammar. 5.3.7 Disambiguating Expressions (Refer back to section 3.2 for the format of operator associativity and precedence information used for Genesis . ) Conflicts in the states of an expression grammar parse are of two kinds. The first is in a state with the following two types of items: [EXPR ■+ EXPR Op EXPR .] [EXPR ■*- EXPR . 0p 2 EXPR] "Op," and "0p o " denote different operators. When two differ- ent operators are involved in the conflict, as above, the operator with the highest precedence has control. In the above situation, if Op, had higher or equal precedence, the action would be REDUCE. If 0p~ had higher precedence, the action would be SHIFT. The second kind of conflict is one with the following two types of items in one state: [EXPR -> EXPR Op, EXPR .] [EXPR ■+ EXPR . Op, EXPR] This conflict is between two identical operators. The associativity of the operator determines the parse action. If the operator is left-associative, then REDUCE is the proper action. If the operator is right-associative, then SHIFT is 40 the proper action. If the operator is non-associative, then ERROR is the proper action. 5 . 4 Linker 5.4.1 General The Linker program- performs the following six major tasks . (1) Parse Table Construction The Parse and GOTO table construction is done by stacking the parse and GOTO tables for all the segments together. A segment index is built which shows where each segment's tables begin. Figure 9 shows the structure of the combined parse tables. Section 5.5.2 discusses the Segment Generator's algorithm for producing the individual compressed parse tables. (2) Terminal Symbol Cross-Reference The terminal symbols from all the segments are merged into one list with one entry for each unique terminal symbol. Then, for each entry in that terminal symbol list, a terminal capsule is formed. A terminal capsule lists the numeric code used for a certain terminal in each segment. None of the parse table entries need to be changed, even though one terminal symbol is represented by different codes in different seg- ments, since the proper code for the symbol in any segment can always be found in its capsule. 41 Non-Error Parse Table Segments \ States ^> Entries 1 1 > -^^ 2 \3 \ 1 ^ 3 \ 1 " 2 y 2 • -. 3 \ \\ 4 5 6 7 8 • • • • • • Figure 9. Parse Table Structure for Linked Segments (3) Segment Cross-Reference Segment references are treated as undefined non- terminals in each segment. Each segment name, therefore, is represented by some numeric code in at least one other seg- ment. The codes for each segment name in all segments are kept together in a table similar to the terminal cross- reference table. (4) Lexical Scanner Table Generation The Finite State Machine Generator program accepts a list of user-defined lexemes plus several constants and option-selections as input. Its output is a FSM table which 42 can be used to recognize any of the user-defined lexemes plus any identifiers, literals, digit strings, and comments. Before the table generation process begins, the list of user-defined lexemes is sorted by length, longest first. Then the list is partitioned into several classes. The lexemes are classified according to the first character of each. The classes are: Alphabetic (A-Z), Blank, and Special ("*", "-", "$", etc.). No user-defined lexeme can begin with a digit. The various parts of the table can be built in any order. The parts which have to be built are one each for: identifiers, literals, digit strings, comments, those lexemes beginning with a blank, those beginning with an alphabetic character, and those beginning with a special character. The parts of the table built from the user-defined lexemes are constructed by starting in state 1 and building state transitions, character-by-character, until the end of the lexeme. If some of the state transitions were built previously, they are left intact. Figure 10 shows an example of states being built for the lexemes "BE" and "BEGIN." "BEGIN" would have been processed first since it is longer. The part of the table for "BE" would use the previously-built states and transitions. It is recommended that no lexeme begin with a blank, since if one does, every blank in an input program must be checked to see whether it is the start of that lexeme. This greatly degrades scanner performance. Each special token 43 Sl 8nal ^ nf irm Figure 10. Finite State Machine States for "BE" and "BEGIN" 44 (identifier, literal, etc.) causes the construction of special states. Literals, comments, and digit strings each start with a character unique among all lexemes. When this character is found in scanner state 1, the machine jumps to the appropriate state. There are two special identifier states. One is an identifier prefix state which checks to see whether a "pos- sible" lexeme is really that symbol, or is really an identi- fier. The other state causes the machine to loop until the end of an identifier and then reports it has found an identifier, and enters that identifier in the name table. The states in the table which represent DIGITS, LITERAL, and comments are constructed in the obvious manner, based on the definition of each. (5) PL/I Code Generation The Linker generates the INIT routine from user specifications and from the PL/I code given in the user's compiler description. The SEMANT routine is also generated. This routine is called by the parser whenever a semantic action is to be done. The SEMANT routine then calls the appropriate user-written semantic routine. (6) Compiler Link-Edit The final job of the Linker is not performed by the PL/I coded Linker program at all. The linking of all the object code for the compiler is done by the system linkage editor after the INIT and SEMANT routines are compiled. 45 5 . 5 Implementation Description 5.5.1 Segment Generator Virtual Parse Table The representation chosen for the parse table, the GOTO table, and two other tables used in the Segment Genera- tor causes them to get very large for large grammars. These tables were made into virtual tables to enable the user to select any main memory size for them. The user specifies the number of parse states which are to reside in main memory during parse table generation for each table. Each of the virtual tables are referenced through a virtual table manager routine which first checks to see whether the requested state is in memory. If it isn't, the current block of states is written to secondary storage and the correct block is read from secondary storage. Then the correct offset into the main memory block is computed and returned. 5.5.2 Parse Table Encoding and Compression The parse actions from the parse table are each en- coded into one word and all ERROR actions are left out to compress the parse table. The compressed parse table takes the form of a parse action list with one entry per parse action and an index with one entry per parse state. The entries in the index point to the starting place in the parse action list of the entries for each parse state. Figure 11 shows this structure. State 1 2 Non-Error Parse Table Entries 46 Figure 11. Compressed Parse Table Structure for Each Segment The parse actions are encoded as either REDUCE or SHIFT. The ACCEPT action is treated as a REDUCE for produc- tion 1. For SHIFT, the terminal symbol, the next parse state, and the semantic action (if any) are encoded into one word. The word is set negative to indicate SHIFT. For REDUCE, the terminal symbol, the production number, and the semantic action (if any) are encoded in one word. The word is positive for REDUCE. 5.5.3 The Lexical Scanner Table Encoding and Compression The Genesis lexical scanner table has a unique char- acter above each column. A row represents one state of the machine. The letters (A-Z), the special characters ("+", 47 "-", " = ", etc.), and the digits (0-9) form three distinct classes of entries. If the characters in these classes occupy adjacent columns in the table, then entries are fairly uniform as one follows a row across the table. For this reason, the method of encoding the table involves recording entries in a list only when they change, going across in a row. Therefore, for one row (state), no two successive entries in the compressed table are the same. A bit map is recorded for each row, showing which entries in the original table exist in the compressed table. An index points to the place in the entry list where a state's entries begin. The first entry for a state is always taken from the first column in the table. Thereafter, entries are recorded only when they differ from the last one recorded. The bit map for a state contains one bit for every column in the table. When a bit is set to 1, the correspond- ing column entry exists in the compressed table. The first column entry always exists in the entry list, though the bit for that column is always set to zero. The algorithm for decoding this table is to count how many bits are set to 1 in the bit map from bit one up to and including the bit for the column which is wanted from the table. That result is added to the index entry for the proper state to get to the index of the requested element in the entry list. Figure 12 shows this structure. ABC D E F G II IJ 48 original table : state x Q Q Q R S S s T T T compressed table : BITMAPS POINTERS state x 0001100100 SEQUENTIAL LIST Figure 12. Compressed Lexical Scanner Table Structure 49 6. FUTURE IMPROVEMENTS Seven areas where improvement would be helpful in Genesis are described below: (1) Building the Parse Table While building the SLR(l) parse table in the Gener- ator, the entire table need not be represented anywhere. Entries only need be allocated when they are being built. Each entry of the table could contain a pointer to the next entry of the table for a particular state. A set of pointers could point to the first entry in each state. This method of representing the parse table should save a significant amount of memory since the great majority of entries in an SLR(l) table are ERROR entries, which needn't be built at all and would therefore never be allocated. (2) Calculating FIRST and FOLLOW Sets The calculation of the FIRST and FOLLOW sets is done using recursion in the current Genesis system. This process is simple, but expensive in both time and space requirements. If a bit map were kept for each set, both space and time usage would improve. The bit maps would be constructed once, then used continually throughout the building of the SLR(l) table. To calculate the FIRST, set, first find all pro- ductions whose right-hand side begins with terminal symbol and 50 mark that symbol as being in the left-hand side's FIRST, set. When no more such productions exist, then a dependency graph must be built for nonterminals in the grammar. For .instance, if the following production appears in the grammar: A + BCD then, calculation of the FIRST, set of A depends on the cal- culations of the FIRST set of B. Thus, © >® depends on would be added to the dependency graph. When the graph is built, it is checked to make sure that it has no cycles. If it has no cycles, then "starting points" can be found from which calculation of all FIRST, sets can be completed. Starting Point Qr^ After the dependency graph has been built, all arrows are reversed. A "starting point" is any node with no in- coming arrows. The FIRST 1 sets will flow in the direction of the arrows. The FIRST, set is copied from one node to the other in the direction of the reversed arrow. Likewise, to calculate the FOLLOW., set, all produc- tions would be sequentially searched. Whenever a terminal 51 follows any 'symbol, that terminal is added to the symbol's FOLLOW, set. When a nonterminal follows a symbol, the FIRST, set of that nonterminal is added to the FOLLOW, set of the symbol . Next, a dependency graph must be constructed for the symbols which are on the end of each production. For example, if A -*- BCD ■ y & ' ' . appears as a production in the grammar, the FOLLOW, set of D depends on the FOLLOW, set of A, so, © >© ^ depends on would be added to the dependency graph. After the dependency graph is built, it would be checked to make sure that there are no cycles. If there are no cycles, starting points would be chosen and the FOLLOW, set calculation would continue in the same manner as in the FIRST, algorithm. The only differ- ence between the two graphs is that the FOLLOW, set graph includes both terminals and nonterminals, while the FIRST, set graph only includes terminals. (3) Parse Table Compression Many techniques for the compression of an SLR(l) table exist. Several are mentioned in [Aho; 1973]. One of these techniques could substantially reduce the size of the parse table. It is claimed in [Aho; 1973] that the SLR(l) table 52 can be compressed to the point where it is competitive with a precedence parse table in size. (4) Deterministic Parsing Each segment which is generated should have its seed-set included within its tables so that the Linker can determine whether the resulting language meets the Combin- ability Criterion. Then, the restriction that only one segment reference can appear in any one state can be relaxed. Restric- tion of one segment per state is the present method of ensur- ing a deterministic parse. (5) Linker The Linker should be made more powerful. If it were, it could alter the parse tables for the various segments and include the ENTER and EXIT actions at the proper places in the table. It could also eliminate the need for the terminal cross-reference table, the segment name cross-reference table, and the sets of segment constants, which must be stored in the present system, by standardizing the codes for all symbols . (6) Production Lengths The table of production lengths which is currently passed along with the segments should be eliminated. Instead, the length information should be encoded in the parse tables by the Segment Generator. 53 (7) Error Recovery Mechanism Some sort of error recovery mechanism should be in- cluded within the standard parser so that some degree of intelligent recovery from an error is possible. This is one major area which deserves more research. It may be possible to specify an error recovery mechanism within the syntax of a segment. The error recovery notation should be concise and should fit into the normal syntax specification in a natural way. This is a very interesting and possibly fertile area for research. 54 CLOSING REMARKS The Genesis system is currently being used at the Illinois State Geological Survey in the development of the Retrieval Request Language (RRL) of the Mineral Resources Evaluation System. Genesis is especially well-suited to the special environment of the Computer Services Unit of the Survey. The programming staff is small and always has much more work to do than it can do. Genesis has allowed us to work on RRL in short bursts, as our schedules permit, while still accomplishing something. RRL was divided into ten segments. We have been able to code a small group of segments at one time, then test it. Since segments can have only a limited interaction with each other, debugging is simple. With the additional assistance of a semantic trace, we have been able to quickly isolate bugs as they arise. Thus far the Genesis system has proven to be a useful tool. 55 BIBLIOGRAPHY Aho, A. V. , & J. D. Ullman. The Theory of Parsing, Transla - tion and Compiling . Volumes I and II. Englewood Cliffs, New Jersey: Prentice-Hall, 1972 and 1973. Aho, A. V., S. C. Johnson, & J. D. Ullman. "Deterministic Parsing of Ambiguous Grammars." Comm. of the ACM, August 1975. Balzer, R. M. , & D. J. Farber. "APAREL — A Parse Request Language." Comm. of the ACM, November 1969. Conway, M. E. "Design of a Separable Transition-Diagram Compiler." Comm. of the ACM, July 1963. DeRemer, F. L. Practical Translators for LR(k) Languages . Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, 1969. Geschke, C. M. , & J. G. Mitchell. "On the Problem of Uniform References to Data Structures." IEEE Trans, on Software Engineering, June 1975. Hopcroft, J. E., & J. D. Ullman. Formal Languages and Their Relation to Automata . Reading, Massachusetts: Addison-Wesely Publishing Co., 1969. Horning, J. J., & W. R. Lalonde . "Empirical Comparison of LR(k) and Precedence Parsers." ACM SIGPLAN Notices, November 19 70. Horning, J. J. "LR Grammars and Analysers." In Compiler Construction. An Advanced Course . Heidelberg: Springer-Verlag Berlin, 1974. Johnson, W. L., J. H. Porter, S. I. Ackley, & D. T. Ross. "Automatic Generation of Efficient Lexical Processors Using Finite State Techniques." Comm. of the ACM, December 1968 . Korenjak, A. J. "A Practical Method for Constructing LR(k) Processors." Comm. of the ACM, November 1969. Liskov, B. H. , & S. N. Zillies. "Specification Techniques for Data Abstractions." IEEE Trans, on Software Engineering, March 1975. 56 McKeeman, W. M., J. J. Horning, & D. B. Wortman. A Compiler Generator . Englewood Cliffs, New Jersey: Prentice- Hall, 1970. Mickunas, M. D., & V. B. Schneider. "A Parser Generating System for Constructing Compressed Compilers." Comm, of the ACM. November, 1973. Reynolds, J. C. "GEDANKEN — A Simple Typeless Language Based on the Principle of Completeness and the Reference Concept." Comm. of the ACM, May 1970. 57 APPENDIX A Genesis Syntactic Meta-Language Syntax INPUT -> INITIAL SEG; INITIAL ->■ INITS; INITS -V INITS INIT; -> INIT; INIT -> SETTABLE '=' 21 DIGITS; -> CHARCONST '=' 22 LITERAL; -> 23 'LISTPARSE' ; ■+■ 24 'LISTFSM' SETTABLE -> 1 'MXPRODS 1 ->- 2 'MXCORES' | -V 3 'ACTNSIZ 1 ( -> 4 •GOSIZ' ; -y 5 ■NAMSIZ'; -*- 6 ' LITSIZ' ; -*■ 7 'LISTSIZ' ; -+ 8 'MXRIGHT' ; •+• 9 'LITLEN' ; ■+ 10 ' NAMLEN ' ; -> 11 •MXPSTAT' ; -> 12 ' ITEMPER' ; -> 13 'ACTSIZ' ; -> 14 •GOTOSIZ 1 ; ->■ 15 'GOSMSIZ 1 ; -»■ 16 1 ITEMSIZ ' ; -> 17 'MXFENTRY' ; ->■ 18 'MXFSTAT' ; -»■ 19 ' NUMSIZ' ; -> 20 1 NUMLEN ' ; CHARCONST ->■ 25 'NEWCHARS' ; ■> 26 1 COMSTART ' ; -»- 27 * COMEND ' ; # THE BNF SYNTAX ; SEG -*- SYNT SEMANT; SYNT -*• 50 'SYNTAX: ' PRODS; PRODS -> PRODS PROD; -»■ PROD; PROD ->- LHS ' + ' RHS 59 ' ; ' ; LHS -> 51 IDENTIFIEF t; 52 RHS RHSYMS; RHSYMS -► RHSYMS RHSYM; ->■ RHSYM; RHSYM ■+■ 53 LITERAL; -*> 54 IDENTIFIER; ■+■ 55 ■<•; ■* 56 •>'; -V 57 DIGITS; •+ 58 > SEMANT ■+ 60 'SEMANTICS: SEMANTICS . 59 APPENDIX B Compiler Description Language Syntax SYNTAX : LINKINPUT INITS INITS SEGMENTS FSM COMPSPEC 35; INITS INIT; INIT; INIT SETTABLE - SETTABLE ' = ' 21 DIGITS; • 2 2 'MAXPSTATE' ; +23 'MAXSEGS' ; +24 'MAXPRODS' ; +25 'MAXTRMLN* ; +26 'MAXNAMLN' ; +27 'GOTOSIZ' ; + 28 "ACTSIZ * ; + 29 ' TRMSIZ' ; + 30 'NTRMSIZ ' ; + 40 'NAMSIZ ' ; + 41 'LITSIZ ' ; # SEGMENT SPECIFICATION ; SEGMENTS -> MAJOR MINOR; MAJOR -y 32 T^IAJOR: ' 1 IDENTIFIER • »• MINOR + 'MINOR:' IDLIST 3 '.'; ' ->- IDLIST -, IDLIST ','2 IDENTIFIER; + 2 IDENTIFIER; ■ # COMPILER COMPOSITION ; ROri^S^ 1 ^? 1 ^^ IDENTIFIER 34 ROUTINES LISTING COMPSPEC ROUTINES ROUTINE LISTING OPTIONS + ROUTINES ' , ' ROUTINE; + ROUTINE; + 4 IDENTIFIER '(' 5 IDENTIFIER ')'• + 'LISTING:' OPTIONS ' '• + ; + OPTIONS OPTION; + OPTION; INITIAL; 60 OPTION ■+ 1 INDENT' _ i 6 DIGITS; -> ' RMARGIW ■ _ • 7 DIGITS -> •LMARGIN' i _ ' 8 DIGITS -y 'PSTAKLW i _ * 9 DIGITS ■+ 'NAMSIZ' = ' 50 DIGITS -y •NAM LEW = ' 51 DIGITS -*■ •LITSIZ' = ' 52 DIGITS ■* 'LITLEN' = ' 53 DIGITS ->• 'NUMSIZ' = ' 54 DIGITS ■+ 'NUMLEN' = ' 5 5 DIGITS -► 'FAIL' = • 10 DIGITS INITIAL ■+ 'INIT:' IDECLARE IBODY; IDECLARE -»■ 11 'DECLARE:' #CARD IMAGES OF SOME PL/I DECLARES; IBODY ■+ 12 'BODY: ' #CARD IMAGES OF A PL/I INIT ROUTINE; # FINITE STATE MACHINE INITIALIZATIOS ; FSM ■+ 'FSM: 1 INITSFSM; -V • I INITSFSM ■* INITSFSM INITFSM; ■»■ INITFSM; INITFSM ->• FSETTABLE '=' 13 DIGITS; ->■ CHARCONST '=' 14 LITERAL; -> 15 'LISTFSM' ; FSETTABLE •+ 16 'MXFENTRY' i ->■ 17 'MXFSTAT' ; CHARCONST -y 18 'NEWCHARS' i -V 19 1 COMSTART ' / ->- 20 'COMEND' ; SEMANTICS 61 APPENDIX C SLR(l) CONSTRUCTION PERFORMED ON THE EXAMPLE SEGMENTS Augmented Grammars: ASSIGN ■ ASSIGN$ -> ASSIGN$; -> NAME EXPR; EXPR EXPR$ EXPR$ EXPR$ EXPR$ "* EXPR$; -^ EXPR$ -> EXPR$ -> NAME ; ■> DIGITS; i * i • 4- < EXPR$ ; EXPR$; NAME NAME$ SUBSCR SUBSCR -> NAME$ ; -4 IDENTIFIER -» ' ( ' EXPR SUBSCR; ) '; 62 SLR(l) CONSTRUCTION ASSIGN STATE [ASSIGN ASSIGN$] on ASSIGN$ goto STATE 2 [ASSIGN$ *. NAME ' = ' EXPR; on NAME goto STATE 3 STATE © [ASSIGN > ASSIGN$. ] on EOF REDUCE STATE © [ASSIGN$ — * NAME EXPR on ' = ' SHIFT and goto STATE 4 STATE (T\ [ASSIGNS — » NAME ' = '. EXPR] on EXPR goto STATE 5 -- STATE © [ASSIGN$— > NAME EXPR, on EOF REDUCE 63 EXPR [EXPR$— » . EXPR$ STATE Q [EXPR ) . EXPR$] on EXPR$ goto STATE 2 [EXPR$ — ). EXPR$ •*» EXPR$] on EXPR$ goto STATE 2 EXPR$] on EXPR$ goto STATE 2 [EXPR$— > . NAME] on NAME goto STATE 3 [EXPR$ > . DIGITS] on DIGITS SHIFT and goto STATE 4 STATE (2 J [EXPR— * EXPR$ . ] on EOF REDUCE [EXPR$— > EXPR$ . '*' EXPR$] on '*' SHIFT and goto STATE 5 [EXPR$— > EXPR$ . * + ' EXPR$] on '+' SHIFT and goto STATE 6 64 [EXPR$ * NAME . ] on '+' REDUCE on •*' REDUCE on EOF REDUCE STATE © STATE © [EXPR$ > DIGITS . ] on '+' REDUCE on '*' REDUCE on EOF REDUCE STATE © [EXPR$ > EXPR$ * • EXPR$] on EXPR$ goto STATE 7 [EXPR$ >. EXPR$ '*• EXPRS] on EXPR$ goto STATE 7 [EXPR$ >. EXPR$ '+' EXPR$] on EXPR$ goto STATE 7 [EXPR$ > . NAME] on NAME goto STATE 3 [EXPR$ * . DIGITS] on DIGITS SHIFT and goto STATE 4 65 [EXPR$ > EXPR$ + STATE © . EXPR$] on EXPR$ goto STATE 8 [EXPR$ » . EXPR$ •*' EXPR$] on EXPR$ goto STATE 8 [EXPR$ >. EXPR$ ' + ■ EXPR$] on EXPR$ goto STATE 8 [EXPR$ > . NAME] on NAME goto STATE 3 [EXPR$ — > . DIGITS] on DIGITS SHIFT and goto STATE 4 STATE © [EXPR$— -► EXPR$ • * i EXPR$ . ] on ■*' REDUCE on • + ' REDUCE on EOF REDUCE [EXPR$ — -» EXPR$ . i * i EXPR$] on '*' REDUCE [EXPR$ — -* EXPR$ . ' + ' EXPR$] on '+' REDUCE 66 STATE [EXPR$ EXPR$ ' + ' EXPR$ . ] on '*' SHIFT and goto STATE 5 on ' + ' REDUCE on EOF REDUCE [EXPR$ EXPR$ . i * i EXPR$] on •*' SHIFT and goto STATE 5 [EXPR$ EXPR$ . ' + • EXPR$] on • + • REDUCE NAME 67 STATE © [NAME *. NAME$] on NAME$ goto STATE 2 [NAME » . IDENTIFIER SUBSCR] on IDENTIFIER SHIFT and goto STATE 3 [ NAME > NAME$ . ] on EOF REDUCE STATE © STATE (T) [NAME$ — -» IDENTIFIER . SUBSCR] on SUBSCR goto STATE 4 [SUBSCR — > .] on EOF [SUBSCR- REDUCE — > . ' (' EXPR ' ) ' ] on ' (' SHIFT and goto STATE 5 STATE (T) [NAME$ — -> IDENTIFIER SUBSCR . ] on EOF REDUCE 68 STATE © [SUBSCR » ' ( ' . EXPR ) 'J on EXPR goto STATE 6 STATE © [SUBSCR EXPR ') '] on ')' SHIFT and goto STATE 7 STATE © SUBSCR EXPR ') ' ■] on EOF REDUCE IBLIOGRAPHIC DATA HEET 1. Report No. UIUCDCS-R-7 7-854 3. Recipient's Accession No. 5. Report Date March. 1977 Title and Subtitle Genesis — A Compiler Generator Using Language Segmentation 6. Author(s) Jay Hoeflinger 8- Performing Organization Rept. No. Performing Organization Name and Address Department of Computer Science University of Illinois at Urb ana- Champaign Urbana, IL 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. I Sponsoring Organization Name and Address 13. Type of Report 8i Period Covered 14. Supplementary Notes Abstracts This thesis describes the Genesis compiler generator system, which allows the generation and linkage of individual language segments. Language segments are described and the concept of segmentation parsing is developed using a model automaton. The implementation of segmentation parsing within Genesis is discussed. The various features of Genesis are also presented, including its self-generative capability. I Key Words and Document Analysis. 17o. Descriptors compiler generators segmentation SLR(l) self-generative expression disambiguation '*■ Identifiers/Open-Ended Terms i7| COSAT1 Field/Group 8\\ ailability Statement 19. Security Class (This Report) UNCLASSIFIED curity Class (Thi 20. Security Class (Ihis Page UNCLASSIFIED 21* No. of Pages 22. Price *>v« NTIS-35 (10-70) USCOMM-DC 40329-P71 Ap% 9 s 13??