Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/genesisacompiler854hoef 
 
.Co V 
 
 UIUCDCS-R-77-854 
 
 n 
 
 •z- 
 
 ~) r yisJl*i 
 
 Genesis — A Compiler Generator Using 
 Language Segmentation 
 
 UILU-ENG 77 1707 
 
 s 
 
 March 1977 
 
 by 
 Jay Philip Hoef linger 
 
 The Ubrary of tha 
 
 APR 19 1977 
 
 University 01 Illinois 
 irbana-Cham 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
GENESIS--A COMPILER GENERATOR USING LANGUAGE SEGMENTATION 
 
 BY 
 JAY PHILIP HOEFLINGER 
 B.S., University of Illinois, 1974 
 
 THESIS 
 
 Submitted in partial fulfillment of the requirements 
 for the degree of Master of Science in Computer Science 
 in the Graduate College of the 
 University of Illinois at Urbana-Champaign , 1977 
 
 Urbana, Illinois 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 I would like to express my gratitude to my thesis 
 advisor, Professor Garry Kampen. He provided many stimu- 
 lating ideas and much guidance during the preparation of 
 this thesis. 
 
 I would also like to express my sincere appreciation 
 to my wife, Donna, whose support, patience, and hard work 
 have made the job of preparing this thesis much easier. 
 
IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 1.1 Computing Environment 1 
 
 1.2 Motivation 2 
 
 1.3 Background 3 
 
 1.4 Organization of this Thesis 4 
 
 2. OVERVIEW 6 
 
 2.1 Genesis Description 6 
 
 2 . 2 Segmentation 8 
 
 2.3 Example 10 
 
 3. SYSTEM OPERATION 16 
 
 3.1 General 16 
 
 3.2 Segment Generator 17 
 
 3.3 Linker 20 
 
 4. THEORETICAL BASIS 22 
 
 4.1 Choice of Parsing Technique 
 
 4.2 SLR(l) Parsing and Table Construction .... 23 
 
 4.3 Basis for SLR(l) Construction 24 
 
 4.4 Formal SLR(l) Definitions 25 
 
 4.5 The Combinability Criterion 26 
 
 5. ALGORITHM DESCRIPTION 30 
 
 5.1 Parser 30 
 
 5.2 Lexical Scanner 33 
 
 5.3 Segment Generator 35 
 
 5.4 Linker 40 
 
 5.5 Implementation Description 45 
 
 6. FUTURE IMPROVEMENTS 4 9 
 
 7. CLOSING REMARKS 54 
 
 BIBLIOGRAPHY 55 
 
 APPENDIX A 57 
 
 APPENDIX B 59 
 
 APPENDIX C 61 
 
LIST OF FIGURES 
 
 Figure Page 
 
 1. Genesis Overall Structure 7 
 
 2. One-Step Compiler Generation with Special 
 
 Version of Genesis 8 
 
 3. Segmentation Parse Tree for "A=B" 12 
 
 4. Segmentation Parse Tree for "A(A(5))=B" .... 13 
 
 5. Example Segments 14 
 
 6. Genesis Recognizer 17 
 
 7. Combinability Criterion Applied to Example 
 Segments 29 
 
 8. Potential Infinite Loop Situation for 
 
 Segment Entrance 32 
 
 9. Parse Table Structure for Linked Segments ... 41 
 
 10. Finite State Machine States for "BE" and 
 
 "BEGIN" 43 
 
 11. Compressed Parse Table Structure for Each 
 
 Segment 45 
 
 12. Compressed Lexical Scanner Table Structure . . 43 
 
1. INTRODUCTION 
 
 The Genesis ( Gen erator Sys tem) compiler generator 
 system allows the user to generate language subsets indi- 
 vidually, then link together those which form a complete 
 language to get a compiler for the language. The language 
 subsets are called "segments" of the language. Segment- 
 module libraries may be built which make the modules avail- 
 able to any language designer or compiler writer who may 
 need them. 
 
 One segment-module is generated at a time by a pro- 
 gram called the Segment Generator. The segment syntax is 
 written in a syntactic meta-language similar to BNF and the 
 segment semantics is written in PL/I. The segment-modules 
 are linked together by a program called the Linker which 
 produces a compiler based on a description written in a 
 compiler description language. 
 
 This thesis describes the theoretical aspects, the 
 algorithms, and the implementation details behind Genesis . 
 
 1 . 1 Computing Environment 
 
 The Genesis compiler generator is written in PL/I 
 and runs on an IBM 360/75 running under HASP-OS/MVT with 1M 
 bytes fast core and 2M bytes slow core located on the campus 
 of the University of Illinois at Urbana-Champaign. The 
 Segment Generator runs in 128K bytes plus an input-dependent 
 
area. The Linker runs in 108K bytes plus an input-dependent 
 area. The compiler generated from the system runs in 48K 
 bytes plus an area for its syntax tables and an area for its 
 semantic code . 
 
 1 . 2 Motivation 
 
 Genesis is a tool which should promote the "structured 
 design" of programming languages, just as "structured" pro- 
 gramming languages have promoted the structured design of 
 programs. With Genesis , a language can be divided into seg- 
 ments which have one entry and one exit, much as a program 
 can be divided into subroutines which have one entry and one 
 exit. This modularized language design technique decreases 
 the complexity of the compiler-debugging task. 
 
 Since the segment-modules are independent of each 
 other and are stored in a library, much like a subroutine 
 library, they are available to anyone who may need them. 
 Ideally, segments which are sufficiently general could be 
 used in many compilers. Segments such as one with standard 
 control structures, or one with expressions containing 
 standard operators could be useful in many languages. Each 
 "standard" segment used by a compiler writer would make the 
 task of writing and debugging the compiler easier. In ad- 
 dition, the use of "standard" segments would tend to stan- 
 dardize the structure of the compiler and therefore would 
 make the compiler easier to understand for someone familiar 
 with the standard components. 
 
1 . 3 Background 
 
 How does Genesis compare with other compiler gener- 
 ators which have been developed? Where does Genesis fit in 
 the wide spectrum of approaches to the construction of com- 
 piler generators? To answer these questions, three widely 
 different approaches to compiler generation have been selected 
 to compare and contrast with Genesis . The three are: XPL 
 [McKeeman; 1970], APAREL [Balzer; 1969], and PGS [Mickunas; 
 1973]. 
 
 XPL and APAREL are compiler-writing languages. PGS 
 and Genesis rely on ordinary programming languages for the 
 actual implementation of a compiler. 
 
 Of the three systems, APAREL is probably the least 
 like Genesis . Both the syntax and semantics of a language 
 are expressable within the framework of the APAREL language. 
 The language is an extension to PL/I which includes pattern 
 statements resembling the Bachus Naur Form for specifying 
 the syntax of the language. While APAREL encourages complete 
 mixing of syntax and semantics within a user-written com- 
 piler, Genesis provides nearly complete separation for them. 
 The principle behind this separation in Genesis is that these 
 two dissimilar, very complex, and often unrelated entities 
 should be studied individually before they are studied to- 
 gether. APAREL achieves much greater flexibility for the 
 compiler because syntax patterns can be invoked at any point 
 in the compiler. 
 
McKeeman et al. not only provide a compiler writing 
 language in XPL, but they provide a compiler skeleton as 
 well. SKELETON, as it is called, consists of a structure of 
 subroutines written in XPL which is filled in by the user to 
 form a compiler. In addition, the language syntax may be 
 written in BNF and submitted to the ANALYSER program which 
 constructs parse tables for SKELETON. XPL differs from 
 Genesis in one major point — the compiler writer must know 
 the structure and details of the XPL system thoroughly to be 
 able to add his/her own code to that of SKELETON. Using that 
 knowledge, the user can tailor the lexical analysis to his 
 own special needs. One of the prime goals in the design of 
 Genesis is that a user need not be bothered with the details 
 of implementation. 
 
 The PGS system is the most like Genesis of the three 
 systems. In PGS, the syntax is expressed in BNF with semantic 
 tags, showing which semantic subroutine is to be executed at 
 each point in the parse. Assembler, Algol, and Fortran sub- 
 routines may be invoked as semantic routines in PGS. In 
 Genesis , "semantic numbers" are placed at various points 
 within the syntax and at those points the user-written PL/I 
 semantic routine for the current segment is called and the 
 semantic number is passed to it. 
 
 1. 4 Organization of this Thesis 
 
 This thesis is organized in a top-down fashion. The 
 major ideas behind the system are presented in this Introduction. 
 
 
In Section 2, a general overview of the system plus a gen- 
 eralized system diagram are presented. Also, a simple 
 example of segmentation is presented. This example will be 
 referenced throughout the paper. In Sections 3 through 5, 
 the system components will be described in detail, the 
 theoretical basis for the system will be stated, and the 
 system algorithms will be presented. Section 6 discusses 
 possible improvements to the system. 
 
2. OVERVIEW 
 
 2 . 1 Genesis Description 
 
 Genesis consists of two major programs, the Segment 
 Generator and the Linker. The Segment Generator receives a 
 segment written in the Genesis syntactic meta-language 
 (syntax) and PL/I (semantics) . It then generates a parse 
 table module, which it puts in a parse table library, and 
 uses a PL/I compiler to produce a semantic object module, 
 which it puts in the semantic library. These two modules 
 are logically associated because they are filed under the 
 same name in each library. The Linker reads a description 
 of the compiler which it must build and finds the appropriate 
 segments in both libraries. The Linker produces three 
 modules: a compiler load module, a parse table module and 
 a lexical scanner taole module. Those three modules together 
 form the compiler for the language. Figure 1 shows the 
 Genesis overall structure. 
 
 A special-case version of the Genesis system exists, 
 which generates an entire compiler in one step from one 
 syntax/semantics description. This version strips away all 
 segmentation overhead made unnecessary since only one "segment" 
 exists. Figure 2 shows a block diagram of this program. 
 
segment 
 
 -> 
 
 language x 
 computer 
 description 
 
 -> 
 
 Segment 
 Generator 
 
 Linker 
 
 3L 
 
 Compiler Semantics 
 
 Parser 
 
 Parse 
 
 Table 
 
 Module 
 
 Lexical K- 
 Scanner 
 
 Compiler Load Module 
 
 Scanner 
 Table Module 
 
 LANGUAGE X COMPILER 
 
 Figure 1. Genesis Overall Structure 
 
Language X 
 syntax/ 
 semantics 
 
 Program 
 written in 
 Language X 
 
 -> 
 
 Parse 
 
 Table 
 
 Module 
 
 ■> 
 
 LANGUAGE 
 GENERATOR 
 
 JL 
 
 Compiler Semantics 
 
 Parser 
 
 Lexical 
 Scanner 
 
 Compiler Load Module 
 
 Scanner 
 Table Module 
 
 LANGUAGE X COMPILER 
 
 Figure 2. One-Step Compiler Generation with Special Version 
 of Genesis 
 
 2 . 2 Segmentation 
 
 2.2.1 Definitions 
 
 The word "segment" will appear many times in this 
 thesis. It is important that the term be clearly understood. 
 A language segment is the combined syntax-semantics descrip- 
 tion for a subset of a language. 
 
 To distinguish the input to the Segment Generator 
 from its output, the syntactic meta-language and PL/I descrip- 
 tion will be called a segment, while the syntax tables and 
 object code produced will be called a segment-module. 
 
A language which is formed by the combination of two 
 or more segments is termed a "segmented language." The recog- 
 nition process for a segmented language which involves both 
 parsing within a segment, and moving between segments is 
 called "segmentation parsing." The segmentation process has 
 little to do with the actual parsing technique used within 
 each segment. Segmentation merely provides a superstructure 
 within which the parse is carried out. This would make it 
 possible for each segment to be parsed via a different tech- 
 nique. 
 
 For the most part, this thesis will deal with the 
 problems inherent in combining the syntax parts of several 
 segments. Thus, for simplicity, "segment" will be frequently 
 used in place of "syntax part of a segment" throughout the 
 rest of this paper. 
 
 2.2.2 Segment References 
 
 Within one segment, a reference to another segment 
 looks exactly like the reference to an ordinary nonterminal 
 symbol. The Segment Generator can distinguish between the 
 two since a segment reference is simply an undefined non- 
 terminal within some segment. 
 
 For example, consider the segments below. 
 
 A 
 B 
 
 __ v rtr> - 
 
 ,,,.,. \ i K • - 
 
 > D , 
 
 
 Segment A 
 
 C - 
 
 \ 
 
 ■c'; 
 
 / 
 
 
 Segment C 
 
 
10 
 
 Within Segment A, A and B are nonterminals, 'b' is a 
 terminal symbol, and C is a segment reference. Now, suppose 
 the input sentence 'be' were to be parsed according to the 
 syntax structure defined by segments A and C. The 'b' of 
 the input sentence would be found to be syntactically correct 
 according to Segment A. To examine the rest of the input 
 sentence, the segmentation superstructure causes an entrance 
 into Segment C and then parsing continues within C. When 'c' 
 has been recognized as syntactically correct within C, the 
 segmentation superstructure leaves Segment C and returns to 
 Segment A, indicating that C has been successfully recognized 
 
 2 . 3 Example 
 
 The segmentation concept may be best illustrated at 
 this point by an example. First, consider the grammar for a 
 very simple type of assignment statement which is made up of 
 three segments: 
 
 ASSIGN » NAME 
 
 EXPR; 
 
 EXPR > NAME; 
 
 » DIGITS; 
 
 NAME 
 
 -» IDENTIFIER; 
 
 Segment ASSIGN 
 
 Segment EXPR 
 
 Segment NAME 
 
 In this example and throughout this thesis, the 
 special names IDENTIFIER, DIGITS, and LITERAL will denote 
 special sets of lexemes. IDENTIFIER is the name of the set 
 of lexemes which begin with a letter (A-Z) and continue with 
 
11 
 
 either letters or digits (0-9) . DIGITS is the name of the 
 set of lexemes which are strings of digits. LITERAL is the 
 name of the set of lexemes which begin with a single quote 
 and end with a single quote, with any character in between. 
 An internal single quote is represented by two consecutive 
 single quotes. 
 
 This grammar describes a language which allows sen- 
 tences of two forms: 
 
 IDENTIFIER = IDENTIFIER 
 
 and 
 IDENTIFIER = DIGITS. 
 The syntax tables for each of these three segments would be 
 generated separately, then linked together. The segmentation 
 parse tree for the input sentence "A=B" is shown in Figure 3. 
 
 To explore the power provided by segmentation, sup- 
 pose a wider variety of names are to be accepted, namely 
 singly subscripted identifiers. To accomplish this, the NAME 
 segment would be rewritten, its syntax tables generated and 
 then linked with the existing ASSIGN and EXPR segments. The 
 new NAME segment is shown in Figure 4 . The EXPR segment is 
 mentioned in the new NAT-IE segment and therefore anything which 
 the existing EXPR segment accepts is immediately valid where 
 EXPR appears in NAME. Figure 4 illustrates this with the 
 input sentence "A(A(5))=B." 
 
 Next, EXPR can be expanded to accept operators. The 
 new EXPR segment is shown in Figure 5. When it is generated 
 
12 
 
 ASSIGN 
 
 EXPR 
 
 "> NAME 
 
 EXPR; 
 
 "> NAME; 
 -> DIGITS; 
 
 NAME 
 
 > 
 
 IDENTIFIER; 
 
 GRAMMAR 
 
 ASSIGN 
 
 IDENTIFIER 
 
 v_ 
 
 J 
 
 EXPR 
 
 NAME 
 
 Figure 3. Segmentation Parse Tree for "A=B' 
 
ASSIGN 
 
 EXPR 
 
 NAME 
 SUBSCR 
 
 "> 
 
 NAME 
 
 EXPR; 
 
 -> NAME; 
 -> DIGITS; 
 
 
 IDENTIFIER SUBSCR; 
 
 ( ' EXPR ' ) ' ; 
 
 13 
 
 Grammar 
 
 ASSIGN 
 
 EXPR 
 
 NAME 
 
 IDENTIFIER 
 
 Figure 4. Segmentation Parse Tree for "A(A(5))=B' 
 
14 
 
 ASSIGN 
 
 ■> NAME 
 
 EXPR; 
 
 Segment ASSIGN 
 
 EXPR 
 
 -> EXPR < ' * ' 
 
 -> EXPR <' + ' 
 
 -> NAME ; 
 "^ DIGITS; 
 
 EXPR; 
 EXPR; 
 
 Segment EXPR 
 
 NAME 
 
 SUBSCR 
 
 -> IDENTIFIER SUBSCR; 
 
 -> 
 
 > ' C 
 
 -^ 
 
 EXPR 
 
 i \ i , 
 
 ) •; 
 
 Segment NAME 
 
 Figure 5. Example Segments 
 
15 
 
 and linked with the existing NAME and ASSIGN segments, ex- 
 pressions throughout the language are suddenly richer. The 
 input sentence "A (B+3) =B+A (B+2) " is acceptable with this 
 new set of segments. 
 
 While it is true that the grammar given in Figure 5 
 for the EXPR segment is ambiguous, it has been shown [Aho; 
 1975] that such a grammar plus some operator associativity 
 and precedence information is a perfectly valid expression 
 grammar. In fact, the Genesis system accepts an expression 
 grammar including sufficient associativity and precedence 
 information to remove ambiguity. 
 
16 
 
 3. SYSTEM OPERATION 
 
 3 . 1 General 
 
 The most striking thing about the two major components 
 of Genesis (the Segment Generator and the Linker) and the 
 generated compiler is that they all have exactly the same 
 structure. This brings about one very important property 
 of the system: new versions of either the Segment Generator 
 or the Linker can be produced by the existing system, which 
 would automatically generate the new program. 
 
 The structure which the major components of Genesis 
 exhibit is shown in Figure 6. This structure consists of a 
 standard Recognizer which accesses a parse table and a 
 lexical scanner table and which calls user-written semantics 
 routines. 
 
 The Recognizer is made up of a parser and a lexical 
 scanner. The parser accesses the parse table and performs 
 parse actions according to an SLR(l) parsing algorithm. In 
 addition, it causes entry into and exit from the various seg- 
 ments in the language being parsed. The lexical scanner 
 accesses the lexical scanner table and performs actions 
 according to a finite state machine algorithm. 
 
 At various points in the course of parsing an input 
 program, the parse table may indicate that a semantic action 
 must be performed. When such a point occurs, the parser calls 
 
SEMANTICS 
 
 Parse 
 Table 
 
 Figure 6. Genesis Recognizer 
 
 17 
 
 Shaded 
 areas are 
 standard 
 components 
 
 the appropriate user-written semantic routine which then 
 performs the action and returns. 
 
 3 . 2 Segment Generator 
 
 3.2.1 Syntactic Meta-language 
 
 The Segment Generator's input is a grammar for a 
 single segment written in a syntactic meta-language. The 
 
18 
 major extensions beyond BNF in the meta-language concern ways 
 of specifying certain Genesis options and constants, semantic 
 information and operator associativity and precedence infor- 
 mation within expressions. The full syntax of the Genesis 
 syntactic meta-language appears in Appendix A. 
 
 In this syntax notation, a blank left-hand side means 
 the same left-hand side as that of the last production. In 
 addition, several productions with identical left-hand sides 
 may be written with one left-hand side and a series of right- 
 hand sides separated by the alternation symbol ( | ) . Each 
 production or production-group with a single left-hand side 
 is terminated by a semicolon ( ; ) . 
 
 The semantic numbers are "associated" with the symbol 
 appearing to be the immediate right of the number. If there 
 is no associated symbol (the semantic number appears at the 
 far right-hand end of the production) , then the number is 
 associated with the reduction of the entire production. When 
 the associated symbol has been recognized, the semantic 
 package indicated by the semantic number is executed. 
 
 The special symbols which have to do with expression 
 disambiguation are left- or right-pointing arrows ("<" or 
 ">"). An operator is identified by the presence of a left- 
 arrow to its left or a right-arrow to its right. The left- 
 pointing arrow on the left side of a symbol denotes left- 
 associativity. A right-pointing arrow on the right side of 
 an operator symbol denotes. right-associativity. If both a 
 left-arrow and a right-arrow appear surrounding a symbol, 
 then the symbol is thought of as non-associative. 
 
19 
 
 The precedence of an operator is determined by its 
 position in the grammar compared with the positions of the 
 other operators. The higher a production appears on the page 
 in a listing of the grammar, the higher the precedence of its 
 operator. If the operator is a nonterminal symbol, all terminal 
 symbols which that nonterminal can produce are defined to be 
 operators of equal precedence and associativity. 
 
 The expression grammar shown below contains six 
 operators. The exponentiation ('**') operator has the highest 
 precedence and is right-associative. The multiplication ('*') 
 and division ('/') operators have equal precedence and are 
 both left-associative. The addition ('+') and subtraction 
 ('-') operators are both of equal precedence and of lower 
 precedence than exponentiation, multiplication and division. 
 The less-than operator ('<') has the lowest priority and is 
 non-associative. This means that an expression like "A<B<C" 
 is not allowed according to this grammar. 
 
 EXPR > EXPR '**' > EXPR; 
 
 -» EXPR < MULOP EXPR 
 
 -> EXPR < ADDOP EXPR 
 
 -> EXPR < ' < ' > EXPR 
 
 MULOP 
 ADDOP 
 
 -> IDENTIFIER; 
 
 -> '*' I '/'; 
 
 9 ' + 
 
2', 
 
 3.2.2 Segment Generator Semantics 
 
 The Segment Generator performs four major tasks. 
 First, and foremost, it builds the parse table by applying 
 the Simple LR(1) construction algorithm to the segment's 
 grammar. While that construction proceeds, the Generator 
 performs its second major task, that of keeping track of any 
 segment references in the grammar and the parse states in 
 which they occur. The third task takes place after the 
 construction algorithm has made one complete pass over the 
 grammar—all ambiguities caused by an expression grammar are 
 resolved. The fourth major task is that of compressing the 
 parse table. 
 
 3 . 3 Linker 
 
 3.3.1 Linker Input Language 
 
 The Linker's compiler description language can com- 
 pletely describe a compiler. The language provides a means 
 for setting various Linker constants, naming which segments 
 are to be linked together, specifying a body of initialization 
 code which will be executed in the compiler before compilation 
 begins, and selecting from among various compiler options. 
 The syntax of this compiler description language appears in 
 Appendix B. 
 
 3.3.2 Linker Operation 
 
 The Linker performs six major tasks. First, it reads 
 the segments for the compiler and stacks the parse tables 
 
21 
 
 together to form one parse table. Second, it builds a 
 lexeme cross-reference table which keeps track of the way 
 each user-defined lexeme is coded in every segment. Third, 
 a similar cross-reference table is built for the coding of 
 each segment name in every segment. Fourth, a lexical scanner 
 table is generated for the compiler. Fifth, the customized 
 initialization routine and one other customized routine are 
 generated and compiled. Finally, the sixth task is the link- 
 editing of the standard Recognizer with the customized 
 routines and the semantic routines. 
 
 When all of the above has been completed by the Linker, 
 the compiler load module, the parse table and the lexical 
 scanner table are written to files and are ready for use. 
 
22 
 
 4. THEORETICAL BASIS 
 
 4 . 1 Choice of Parsing Technique 
 
 Basic to the success of the Genesis system is the 
 use of a theoretically sound parsing technique. The SLR(l) 
 technique was chosen. SLR(l) parsing was first introduced 
 by F. L. DeRemer in [DeRemer; 1969]. 
 
 There are four reasons why the SLR(l) technique was 
 chosen. First, SLR grammars form a large subset of the 
 deterministic context free languages. Most programming 
 languages in use today can be described by one of the SLR 
 grammars. Second, I personally find that SLR grammars are 
 more natural to write than other types of grammars. Third, 
 SLR parsers report an error at the earliest possible time. 
 This is not the case for some other parsing techniques. For 
 instance, a precedence parser may examine an arbitrary 
 number of symbols after an error has occurred before it 
 reports the error. Finally, SLR parsers can be made com- 
 petitive in size and speed with other parsing techniques 
 through table transformations. [Aho; 1973] 
 
 The SLR(l) technique has two big advantages over other 
 LR techniques. The computation time required to generate its 
 parse table is much less in general for SLR(l) and there are 
 usually far fewer table entries with SLR(l) than with other 
 LR techniques. [Aho; 1973] 
 
23 
 
 4 . 2 SLR(l) Parsing and Table Construction 
 
 The classic model machine for deterministic context 
 free language recognition is the deterministic push down 
 automaton (DPDA) . This machine is sufficient for SLR(l) 
 parsing, but is not quite sophisticated enough for effi- 
 cient SLR(l) parsing. The machine chosen to model SLR(l) 
 parsing efficiently for Genesis is called the Genesis DPDA . 
 
 The Genesis DPDA consists of three elements: an 
 input tape (where the input program comes from) ; a finite 
 state control (which controls the machine's actions); and 
 a push down stack. The machine's current condition is 
 characterized by its "state" (which is actually the state of 
 the finite control) . 
 
 The Genesis DPDA can carry out six actions: SHIFT 
 an input symbol onto the stack; REDUCE a meta-language pro- 
 duction by taking its right-hand side off the stack and 
 putting its left-hand side on the stack; ACCEPT the input 
 string; report an ERROR; ENTER another Genesis DPDA; and EXIT 
 the current Genesis DPDA. Embedded in three of the six 
 actions is the possible execution of semantic routines. 
 After the input symbol has been SHIFTED onto the stack, a 
 semantic routine can be executed. Within REDUCE, both after 
 the right-hand side has been taken off the stack, and after 
 the left-hand side has been put on the stack, semantic rou- 
 tines can be executed. Within ACCEPT, after the right-hand 
 side of production 1 has been removed from the stack, some 
 
24 
 
 semantic routine may be executed. The inclusion of these 
 semantic actions makes the Genesis DPDA a translating 
 machine instead of just a parsing machine. The ENTER and 
 EXIT actions implement segmentation parsing. 
 
 The Genesis DPDA effectively models the parse tree 
 for an input sentence. The SHIFT action gets input symbols 
 onto the stack, then the REDUCE action has the effect of 
 replacing them with a single nonterminal symbol (their 
 father node in a parse tree) . That nonterminal can then be 
 one of the symbols replaced by another REDUCE, and the 
 process continues until the sentence symbol alone is left 
 on the stack, at which point the input sentence is ACCEPTED. 
 
 As long as the finite control causes the proper ac- 
 tions at the proper times, the Genesis DPDA can model any 
 deterministic context-free language parse. The finite 
 control built for Genesis is for SLR(l) grammars. 
 
 4 .3 Basis for the SLR(l) Construction 
 
 The SLR(l) table construction is done by building a 
 series of LR(0) items , each of which results in one parse 
 action entry in the SLR(l) table. The LR(0) items are built 
 from an SLR(l) grammar by moving a cursor through the pro- 
 ductions of the grammar to all the possible points which the 
 parse of an input sentence might reach. 
 
 Each LR(0) item has the form 
 
 [A + a • 3] 
 
25 
 
 where A is a nonterminal in the grammar, a and 3 could be non- 
 terminals, terminals, or A (the empty symbol), and A ■*■ a3 
 is a production in the grammar. The cursor is represented by 
 a period ( . ) . An item represents a parse that has reached 
 the point between a and 3. 
 
 Each set of LR(0) items is called a state. For each 
 state there is a collection of valid terminal symbols which 
 will cause the parse to continue. These symbols are each 
 associated with a parse action. The procedure for determining 
 the parse actions is described in section 5.3.4. 
 
 4 .4 Formal SLR(l) Definitions 
 
 A grammar is expressed as an ordered 4-tuple G= 
 (N,T,P,S) . 
 
 N is the set of all nonterminals in the grammar. 
 T is the set of all terminals in the grammar. 
 P is the set of all productions in the grammar. 
 S is the Start symbol of the grammar. SeN. 
 The symbol "=y means "produces." A grammar is used to derive 
 
 the string on the right of "4" from the 
 string on the left. 
 "=t> " means "produces in zero or more steps." 
 "^ rightmost" means "produces in zero or more 
 steps using a right-most derivation." 
 The EFF 1 (Epsilon-Free First) and FOLLOW, sets will be 
 used in the definition of SLR(l) grammars. The FIRST set will 
 be used to define EFF, . 
 
26 
 
 FIRST 1 (a) 
 EFF., (a) = 
 
 FOLLOV^ (8) 
 
 (x|a 4x6 and length (x) = 1} 
 FIRST., (a) if a does not begin with a 
 nonterminal ; 
 or 
 {w|w is in FIRST, (a) and there is 
 
 a derivation 
 
 * 
 a ^ 8 =^ wx 
 
 right 
 
 most 
 
 where 3 ^ Awx for any 
 
 nonterminal A} 
 
 * 
 = {x|S ^ a&Y and x is in FIRST, (y) } 
 
 SLR(l) grammars are defined as follows: 
 Let G=(N,T,P,S) be a context free grammar (not neces- 
 sarily LR(0)). Let S be the collection of sets of LR(0) items 
 for G. Let Q be any set of items in S . Suppose that whenever 
 [A-*a.B] and [B-^y.5] are two distinct items in Q, one of the 
 following conditions is satisfied: 
 
 (1) Neither of 8 and 6 are A. 
 
 (2) B^A, <5 = A and FOLLOW. (B) C\ EFF 1 (6 FOLLOV^ (A)) = <J> 
 
 (3) 8 = A, 6^A and FOLLOV^ (A) f\ EFF (6 FOLLOV^ (B) ) = 4> 
 
 (4) 3=A, 6 = A and FOLLOW-j^ (A) f\ FOLLOW. (B) = <f> 
 Then G is said to be a simple LR(1) grammar (SLR(l) grammar). 
 
 4 . 5 The Combinability Criterion 
 4.5.1 Definitions 
 
 Each segment undergoes SLR(l) table generation. For 
 each state in a segment's tables, the Segment Generator keeps 
 
27 
 
 track of the set of lexemes which cause a correct parse to 
 continue. This set is called the continue-set for the state. 
 The names of all segments which could be entered from each 
 state are also kept on a segment-list for that state. 
 
 Each segment has a single entry point — its first 
 state. The continue-set for a segment's first state will be 
 called the segment's seed-set . 
 
 For instance, referring back to the example in section 
 2.3, and Figure 8 below, the seed-set of ASSIGN is the set 
 IDENTIFIER of identifiers, the seed-set of NAME is the set 
 IDENTIFIER, and the seed-set of EXPR is the set IDENTIFIER \J 
 DIGITS. The first state in ASSIGN'S parse table lists NAME as 
 a segment which could be entered. State 3 in ASSIGN has "=" 
 in its continue-set and nothing in its enterable-segment list. 
 
 4.5.2 The Criterion 
 
 Regardless of the parsing technique used for any of 
 the segments, there is a general criterion for deterministic 
 segmentation parsing . Intuitively, the Combinability Criterion 
 states that when all segments are combined to form a language, 
 a fixed number of symbols will be sufficient to determine the 
 next parsing action or the next entry into or exit from a 
 segment. Genesis uses one symbol to make that decision. 
 
 Each state in each segment has a possibly-empty 
 continue-set C. If that state's segment-list is not empty, 
 then associated with that state is a collection S of seed-sets. 
 
28 
 
 The algorithm to determine the members of collection 
 S is: 
 
 1. Copy the state's segment-list into membership list M. 
 
 2. Examine each element of M, beginning with the first. 
 If that segment has a segment-list in its first 
 state, merge that list into M (if a member of the 
 segment-list is already in M, don't do anything for 
 that member) . 
 
 3. Continue until there are no more elements in M to 
 examine. 
 
 4. The group of seed-sets of all segments named in M 
 is the collection S. 
 
 The combinability criterion is that C and all sets in 
 S must be collectively disjoint. Let I be an index set for S. 
 Then, 
 
 C S. = <j> (Vi el) 
 and 
 
 S ± A S. = 4> (Vi, j € 13 i*j) 
 
 where <j> is the empty set and S. is the i ' th set in S. 
 
 Figure 7 shows that the three example segments meet 
 the Combinability Criterion. 
 
29 
 
 SEED 
 SET: 
 
 ASSIGN 
 
 IDENTIFIER 
 
 NAME 
 
 stat i» 
 
 Continue 
 set 
 
 EOF 
 
 EOF 
 
 Enterable 
 segments 
 
 NAME 
 
 EXPR 
 
 Combinability 
 Check 
 
 {}fl{ldentifier} 
 
 =4> 
 
 n r Digits, i 
 
 {}A( TJ Di 8^s f } 
 Identifier 
 
 IDENTIFIER 
 
 Continue 
 state set 
 
 Enterable 
 segments 
 
 1 
 
 IDENTIFIER 
 
 
 2 
 
 EOF 
 
 
 3 
 
 EOF 
 
 
 4 
 
 EOF 
 
 
 5 
 
 
 EXPR 
 
 6 
 
 ) 
 
 
 7 
 
 EOF 
 
 
 EXPR 
 
 SEED SET: 
 
 DIGITS, IDENTIFIER 
 
 3tat< 
 
 Continue 
 set 
 
 Enterable 
 segments 
 
 1 
 
 DIGITS 
 
 NAME 
 
 2 
 
 EOF 
 
 
 3 
 
 * 
 
 + 
 
 
 4 
 
 + 
 EOF 
 
 
 5 
 
 i'c 
 
 EOF 
 
 
 6 
 
 DIGITS 
 
 NAME 
 
 7 
 
 DIGITS 
 
 NAME 
 
 8 
 
 + 
 
 EOF 
 
 
 9 
 
 + 
 EOF 
 
 
 Combinability 
 
 Check 
 
 {DIGITSlAt IDENTIFIER} 
 
 {DIGITS }A{ identifier} 
 {DIGITSMi identifier} 
 
 =4 
 
 EOF: End of File 
 
 Figure 7. Combinability Criterion Applied to Example Segments 
 (Appendix C contains the complete construction of 
 the parse tables for these segments.) 
 
30 
 
 5. ALGORITHM DESCRIPTION 
 
 5 . 1 Parser 
 
 The Genesis DPDA, discussed in section 4.2, is the 
 model for parsing used in the Recognizer, but the algorithm 
 used to implement it makes some necessary modifications to 
 it. The two actions ENTER and EXIT do not exist at all in 
 the Recognizer algorithm. They are replaced by extra code 
 in the ACCEPT and ERROR actions. The reason for this change 
 is one of efficiency. The ENTER and EXIT instructions can- 
 not be generated by the Segment Generator since it cannot 
 know the seed-set for segments to be entered and since it 
 cannot know whether the segment being generated will be the 
 major segment of a language (the major segment is the segment 
 whose sentence symbol becomes the sentence symbol of the 
 entire language). The Linker knows both of these things, 
 but for efficiency, does not alter the parse tables of the 
 segments which it is linking. 
 
 The Parser has four actions: SHIFT, REDUCE, ACCEPT, 
 and ERROR. SHIFT pushes an input symbol on top of the parse 
 stack, then pushes the next parse state number onto the stack 
 and causes a transfer to that state. A new input symbol is 
 then read. 
 
 REDUCE causes twice the number of symbols which are 
 on the right-side of a production to be removed from the 
 
31 
 
 stack (one for each right-hand side symbol plus one for each 
 state number pushed on). Then, the top of the stack will 
 hold a new state number which becomes the current state. 
 The left-hand side nonterminal symbol is pushed onto the 
 stack, and a table called the GOTO table is consulted to see 
 what the next parse state will be. That next state is pushed 
 onto the parse stack. The method for constructing the GOTO 
 table will be described in section 5.3.5. 
 
 ERROR causes the machine to look in the segment-list 
 for the current parse state and segment. If a segment name 
 is on the list, that segment is entered by placing the cur- 
 rent segment number on the stack, placing state number one on 
 the stack (first state of the new segment) , then changing 
 the current segment number to the new segment number. If 
 the segment-list for the current state and segment is empty, 
 an ERROR is signalled and the parse stops. 
 
 Thus, ERROR simulates the ENTER action of the Genesis 
 DPDA in some cases. The Genesis Linker only allows one segment 
 to be enterable from any one state and it does not record the 
 seed-set for the enterable segment. Whenever an error occurs 
 and a segment is enterable, that segment is entered. 
 
 To avoid a possible infinite loop where a particular 
 lexeme does not appear in the seed-set of any of a circular 
 path of enterable segments, a run-time check must be made to 
 make sure that no segment is entered for a second time for 
 the same input symbol. For example, Figure 8 shows this 
 
input 
 symbol 
 
 SEED 
 SET: 
 
 Segment A 
 
 ii i ii "it" 
 
 Enterable 
 State Segments 
 
 1 
 
 
 B " 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 -S egment B 
 
 M 
 
 ll_ II 11.11 
 
 » 
 
 Enterable 
 State Se2ments 
 
 1 
 
 
 
 A 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 - 
 
 Figure 8. Potential Infinite Loop Situation for Segment Entrance 
 
 type of situation. If the symbol ";" (not in the seed-set 
 of either segment) is examined in the first state of segment 
 A, the parser would find ERROR as its action, then notice it 
 could enter B, which would find the same situation and re- 
 enter A. An infinite loop would result if it were not 
 stopped. A run-time check would stop execution with an ERROR 
 when A was entered for the second time without ";" being 
 consumed. 
 
 If the full Genesis DPDA were implemented and if the 
 Linker implemented the full Combinability Criterion by 
 altering the segments' parse tables with ENTER and EXIT 
 actions, the run-time check would be unnecessary. 
 
 ACCEPT causes an "input accepted" message if the 
 current segment is the major segment. If it is not, then the 
 action is treated just as if it were a REDUCE for production 
 number one of the segment. After the right-hand side of 
 
33 
 
 production one is removed from the stack, the last segment 
 name is removed from the stack. It is changed to a non- 
 terminal name and the GOTO table is consulted for the next 
 parser state. The nonterminal name is pushed onto the stack, 
 then so is the next parse state. 
 
 Accept models the EXIT action of the Genesis DPDA 
 exactly for all minor segments (all segments are minor, 
 except for the major segment) . 
 
 5 . 2 Lexical Scanner 
 
 5.2.1 General Description 
 
 The Genesis lexical scanner reads the input program 
 and converts it to tokens which it passes to the parser. The 
 token conversion process is guided by a lexical scanner table. 
 
 The scanner recognizes lexemes of one or more charac- 
 ters as well as identifiers, digit strings, and literals. 
 The latter are recorded in special tables. Comments are recog- 
 nized, then discarded. 
 
 The model machine for the scanner is a finite state 
 machine. The current input symbol together with the current 
 state uniquely determines the actions which the machine will 
 take. The actions are of four different types. Actions of 
 type 1 and type 2 are carried out with every state transition. 
 Sometimes three and possibly all four types are carried out 
 during one state transition. 
 
34 
 
 5.2.2 Lexical Scanner Actions 
 Type 1 : 
 
 (a) CONSUME INPUT SYMBOL 
 or (b) DON'T CONSUME INPUT SYMBOL 
 The machine can choose to go to the next input symbol or not. 
 Type 2 : 
 
 (a) TRANSFER TO <STATE> 
 This changes the current scanner state to <STATE>. 
 The scanner starts in state 1 when called. For every state 
 transition, the state is re-assigned with this action. 
 Type 3: (optional) 
 
 (a) INDICATE THAT <LEXEME> IS RECOGNIZED 
 or (b) INDICATE THAT <LEXEME> MIGHT BE RECOGNIZED 
 or (c) DENY PREVIOUS POSSIBLE LEXEME 
 or (d) CONFIRM PREVIOUS POSSIBLE LEXEME 
 When a lexeme has been recognized, either (a) or (d) 
 will signal that fact. Whenever (b) is performed, one of (c) 
 or (d) will be performed on the next state transition. If 
 (c) is executed, the "possible" lexeme will be declared a 
 false alarm and the machine will continue until one of (a) 
 or (d) is executed. 
 
 Actions (b) , (c) and (d) are necessary since some 
 user-defined lexemes have the same structure as identifiers 
 (like "BEGIN," "END," etc.). 
 
 If "BEGIN" were a user-defined lexeme, then when 
 "BEGIN" is seen on input, (b) is executed. Action (a) is 
 
 
35 
 
 not possible in this case because the next symbol will de- 
 termine whether the lexeme is "BEGIN," or possibly some 
 identifier like "BEGINNING." If the next character after 
 "BEGIN" is a legal continuation for an identifier, then (c) 
 is executed. If the next character is not a legal continua- 
 tion for an identifier, then (d) is executed. 
 
 Type 4: (optional) 
 
 (a) ENTER <SYMBOL> IN NAME TABLE 
 or (b)' ENTER <SYMBOL> IN LITERAL TABLE 
 or (c) ENTER <SYMBOL> IN NUMBER TABLE 
 
 This action enters the input string into a table 
 according to the type of symbol it has found. A global 
 variable is set with the symbol's position in the table, so 
 that the user-written semantics can access the symbol. 
 
 When the scanner reaches the end of the input file, 
 the scanner immediately sets the recognized symbol to be 
 <END OF FILE>, then returns. 
 
 5 . 3 Segment Generator 
 5.3.1 General 
 
 Before the Segment Generator algorithms are described, 
 a crucial term must be defined. The core of a parse state is 
 the set of LR(0) items which exist in that state before 
 closure is performed on that state. Closure is a process 
 which generates all additional relevant LR(0) items for a 
 state from the core item of that state. Closure will be 
 described in greater detail later. 
 
36 
 
 5.3.2 Generating the Parse Table 
 
 The parse table generation algorithm begins with the 
 core of the first state being set to the item: 
 
 [S - . a) 
 where S is the sentence symbol and S ■*■ a is the first produc- 
 tion in the segment's grammar. 
 
 The algorithm first performs closure on the core of 
 the current state. Then, for each item in the state, it com- 
 putes a parse action or a GOTO action and possibly generates 
 a core item for some new state in a temporary state area. 
 
 When all items in the current state are processed, a 
 set of cores for new states will have been generated in the 
 temporary state area. This set of temporary cores is then 
 merged with all the existing states. If a temporary core 
 matches the core of an existing state, all references to the 
 temporary state are changed to refer to the existing state. 
 If the temporary core does not match any existing core, then 
 that temporary core is added to the list of existing cores. 
 
 When all temporary cores have been merged into the 
 existing cores, the next successive existing state is 
 processed (closure is performed, parse actions computed, and 
 temporary states generated, then merged) . This process con- 
 tinues until all existing states are processed. 
 
 5.3.3 Closure 
 
 Closure is performed by looking at each LR(0) item 
 in the state (both core items and any generated items) in 
 
37 
 
 turn. If the grammar symbol to the right of the cursor is a 
 terminal symbol, then nothing is done. If the symbol is a 
 nonterminal A, then the state is augmented by all items of 
 the form 
 
 [A -*• . a] 
 such that A ■* a is a production in the grammar. 
 
 5.3.4 Computing a Parse Action 
 
 In a state X, an item of the form 
 
 [A -»■ a . b 3] 
 where b is a terminal symbol, produces a parse action of 
 SHIFT, the creation of a core item for a new state Y of the 
 form 
 
 [A -*• a b . B] 
 and an indication to transfer to that new state after the 
 SHIFT. Thus, whenever the terminal b is reported by the 
 lexical scanner and the parse is in state X, the parser will 
 SHIFT the b onto the stack and transfer to parse state Y. 
 
 An item of the form 
 
 [A ■+ a . ] 
 produces a parse action of REDUCE for all terminal symbols in 
 the follow set F of symbol A. Whenever a terminal symbol in 
 F is reported, REDUCE is executed for the production. 
 
 5.3.5 Forming the GOTO Table 
 
 In a state X, an item of the form 
 
 [A -*- a -. B 33 
 where B is the nonterminal, causes the creation of a core item 
 
38 
 
 for a new state Y of the form 
 
 [A ■*■ a B . g] 
 and produces a GOTO table entry to transfer to state Y. 
 Several GOTO table entries could exist for each 
 state, but only one for each unique nonterminal appearing to 
 the immediate right of the cursor in that state. 
 
 5.3.6 Parse State Conflicts 
 
 A conflict occurs in a parse state for a grammar when 
 more than one parse action is possible for a particular 
 terminal symbol. If this occurs, the grammar is not an SLR(l) 
 grammar. . 
 
 For instance, consider the following parse state in 
 the SLR(l) construction for the example segment EXPR: 
 
 State X 
 
 [EXPR ■+ EXPR '*' EXPR .J 
 
 on 
 
 • * i 
 
 REDUCE 
 
 on '+' REDUCE (conflict) 
 on <end of file> REDUCE 
 
 [EXPR -> EXPR . ' + ' EXPR] 
 on '+' SHIFT (conflict) 
 
 (the followset of EXPR is {'**, »+' , <end of file>}) 
 
 The symbol "+" could trigger two possible actions, 
 SHIFT and REDUCE. This is called a SHIFT-REDUCE conflict. 
 
 It shows that EXPR is not in SLR(l) form. 
 
39 
 
 Some SHIFT-REDUCE conflicts can be resolved by Genesis 
 Specifically, ambiguous expression grammars like the EXPR 
 grammar above can be disambiguated if operator precedence and 
 associativity information is included in the grammar. 
 
 5.3.7 Disambiguating Expressions 
 
 (Refer back to section 3.2 for the format of operator 
 associativity and precedence information used for Genesis . ) 
 
 Conflicts in the states of an expression grammar 
 parse are of two kinds. The first is in a state with the 
 following two types of items: 
 
 [EXPR ■+ EXPR Op EXPR .] 
 [EXPR ■*- EXPR . 0p 2 EXPR] 
 "Op," and "0p o " denote different operators. When two differ- 
 ent operators are involved in the conflict, as above, the 
 operator with the highest precedence has control. In the 
 above situation, if Op, had higher or equal precedence, the 
 action would be REDUCE. If 0p~ had higher precedence, the 
 action would be SHIFT. 
 
 The second kind of conflict is one with the following 
 two types of items in one state: 
 
 [EXPR -> EXPR Op, EXPR .] 
 
 [EXPR ■+ EXPR . Op, EXPR] 
 
 This conflict is between two identical operators. 
 
 The associativity of the operator determines the parse action. 
 
 If the operator is left-associative, then REDUCE is the proper 
 
 action. If the operator is right-associative, then SHIFT is 
 
40 
 
 the proper action. If the operator is non-associative, then 
 ERROR is the proper action. 
 
 5 . 4 Linker 
 
 5.4.1 General 
 
 The Linker program- performs the following six major 
 tasks . 
 
 (1) Parse Table Construction 
 
 The Parse and GOTO table construction is done by 
 stacking the parse and GOTO tables for all the segments 
 together. A segment index is built which shows where each 
 segment's tables begin. Figure 9 shows the structure of the 
 combined parse tables. Section 5.5.2 discusses the Segment 
 Generator's algorithm for producing the individual compressed 
 parse tables. 
 
 (2) Terminal Symbol Cross-Reference 
 
 The terminal symbols from all the segments are merged 
 into one list with one entry for each unique terminal symbol. 
 Then, for each entry in that terminal symbol list, a terminal 
 capsule is formed. A terminal capsule lists the numeric code 
 used for a certain terminal in each segment. None of the parse 
 table entries need to be changed, even though one terminal 
 symbol is represented by different codes in different seg- 
 ments, since the proper code for the symbol in any segment 
 can always be found in its capsule. 
 
41 
 
 Non-Error 
 Parse Table 
 
 
 Segments 
 
 \ 
 
 States 
 
 ^> 
 
 Entries 
 
 
 
 
 
 1 
 
 
 1 > 
 -^^ 2 
 
 \3 
 
 \ 1 ^ 
 
 3 \ 
 
 1 " 
 2 
 
 
 y 
 
 2 
 
 • -. 
 
 
 
 
 3 
 
 
 
 
 
 
 \ \\ 
 
 4 
 
 
 
 
 
 
 5 
 
 
 
 
 6 
 
 
 
 
 7 
 
 
 
 
 8 
 
 
 
 
 
 • 
 • 
 • 
 
 
 
 
 • 
 • 
 • 
 
 
 
 
 
 
 Figure 9. Parse Table Structure for Linked Segments 
 
 (3) Segment Cross-Reference 
 
 Segment references are treated as undefined non- 
 terminals in each segment. Each segment name, therefore, is 
 represented by some numeric code in at least one other seg- 
 ment. The codes for each segment name in all segments are 
 kept together in a table similar to the terminal cross- 
 reference table. 
 
 (4) Lexical Scanner Table Generation 
 
 The Finite State Machine Generator program accepts 
 a list of user-defined lexemes plus several constants and 
 option-selections as input. Its output is a FSM table which 
 
42 
 
 can be used to recognize any of the user-defined lexemes plus 
 any identifiers, literals, digit strings, and comments. 
 
 Before the table generation process begins, the list 
 of user-defined lexemes is sorted by length, longest first. 
 Then the list is partitioned into several classes. The 
 lexemes are classified according to the first character of 
 each. The classes are: Alphabetic (A-Z), Blank, and 
 Special ("*", "-", "$", etc.). No user-defined lexeme can 
 begin with a digit. 
 
 The various parts of the table can be built in any 
 order. The parts which have to be built are one each for: 
 identifiers, literals, digit strings, comments, those lexemes 
 beginning with a blank, those beginning with an alphabetic 
 character, and those beginning with a special character. 
 
 The parts of the table built from the user-defined 
 lexemes are constructed by starting in state 1 and building 
 state transitions, character-by-character, until the end of 
 the lexeme. If some of the state transitions were built 
 previously, they are left intact. Figure 10 shows an example 
 of states being built for the lexemes "BE" and "BEGIN." 
 "BEGIN" would have been processed first since it is longer. 
 The part of the table for "BE" would use the previously-built 
 states and transitions. 
 
 It is recommended that no lexeme begin with a blank, 
 since if one does, every blank in an input program must be 
 checked to see whether it is the start of that lexeme. This 
 greatly degrades scanner performance. Each special token 
 
43 
 
 Sl 8nal 
 
 ^ nf irm 
 
 Figure 10. Finite State Machine States for "BE" and "BEGIN" 
 
44 
 
 (identifier, literal, etc.) causes the construction of special 
 states. Literals, comments, and digit strings each start with 
 a character unique among all lexemes. When this character is 
 found in scanner state 1, the machine jumps to the appropriate 
 state. 
 
 There are two special identifier states. One is an 
 identifier prefix state which checks to see whether a "pos- 
 sible" lexeme is really that symbol, or is really an identi- 
 fier. The other state causes the machine to loop until the 
 end of an identifier and then reports it has found an 
 identifier, and enters that identifier in the name table. 
 
 The states in the table which represent DIGITS, 
 LITERAL, and comments are constructed in the obvious manner, 
 based on the definition of each. 
 
 (5) PL/I Code Generation 
 
 The Linker generates the INIT routine from user 
 specifications and from the PL/I code given in the user's 
 compiler description. The SEMANT routine is also generated. 
 This routine is called by the parser whenever a semantic 
 action is to be done. The SEMANT routine then calls the 
 appropriate user-written semantic routine. 
 
 (6) Compiler Link-Edit 
 
 The final job of the Linker is not performed by the 
 PL/I coded Linker program at all. The linking of all the 
 object code for the compiler is done by the system linkage 
 editor after the INIT and SEMANT routines are compiled. 
 
45 
 
 5 . 5 Implementation Description 
 
 5.5.1 Segment Generator Virtual Parse Table 
 
 The representation chosen for the parse table, the 
 GOTO table, and two other tables used in the Segment Genera- 
 tor causes them to get very large for large grammars. These 
 tables were made into virtual tables to enable the user to 
 select any main memory size for them. 
 
 The user specifies the number of parse states which 
 are to reside in main memory during parse table generation 
 for each table. Each of the virtual tables are referenced 
 through a virtual table manager routine which first checks to 
 see whether the requested state is in memory. If it isn't, 
 the current block of states is written to secondary storage 
 and the correct block is read from secondary storage. Then 
 the correct offset into the main memory block is computed and 
 returned. 
 
 5.5.2 Parse Table Encoding and Compression 
 
 The parse actions from the parse table are each en- 
 coded into one word and all ERROR actions are left out to 
 compress the parse table. The compressed parse table takes 
 the form of a parse action list with one entry per parse 
 action and an index with one entry per parse state. The 
 entries in the index point to the starting place in the 
 parse action list of the entries for each parse state. 
 Figure 11 shows this structure. 
 
State 
 1 
 2 
 
 Non-Error 
 
 Parse Table 
 
 Entries 
 
 46 
 
 Figure 11. Compressed Parse Table Structure for Each Segment 
 
 The parse actions are encoded as either REDUCE or 
 SHIFT. The ACCEPT action is treated as a REDUCE for produc- 
 tion 1. 
 
 For SHIFT, the terminal symbol, the next parse state, 
 and the semantic action (if any) are encoded into one word. 
 The word is set negative to indicate SHIFT. 
 
 For REDUCE, the terminal symbol, the production 
 number, and the semantic action (if any) are encoded in one 
 word. The word is positive for REDUCE. 
 
 5.5.3 The Lexical Scanner Table Encoding and Compression 
 The Genesis lexical scanner table has a unique char- 
 acter above each column. A row represents one state of the 
 machine. The letters (A-Z), the special characters ("+", 
 
 
47 
 
 "-", " = ", etc.), and the digits (0-9) form three distinct 
 classes of entries. If the characters in these classes 
 occupy adjacent columns in the table, then entries are fairly 
 uniform as one follows a row across the table. For this 
 reason, the method of encoding the table involves recording 
 entries in a list only when they change, going across in a 
 row. Therefore, for one row (state), no two successive 
 entries in the compressed table are the same. A bit map is 
 recorded for each row, showing which entries in the original 
 table exist in the compressed table. 
 
 An index points to the place in the entry list where 
 a state's entries begin. The first entry for a state is 
 always taken from the first column in the table. Thereafter, 
 entries are recorded only when they differ from the last one 
 recorded. The bit map for a state contains one bit for every 
 column in the table. When a bit is set to 1, the correspond- 
 ing column entry exists in the compressed table. The first 
 column entry always exists in the entry list, though the 
 bit for that column is always set to zero. 
 
 The algorithm for decoding this table is to count how 
 many bits are set to 1 in the bit map from bit one up to and 
 including the bit for the column which is wanted from the 
 table. That result is added to the index entry for the proper 
 state to get to the index of the requested element in the 
 entry list. Figure 12 shows this structure. 
 
ABC D E F G II IJ 
 
 48 
 
 original table : 
 
 state x 
 
 
 
 
 
 
 
 
 
 
 
 
 Q 
 
 Q 
 
 Q 
 
 R 
 
 S 
 
 S 
 
 s 
 
 T 
 
 T 
 
 T 
 
 
 
 
 
 
 
 
 
 
 
 
 
 compressed table : 
 
 BITMAPS 
 
 POINTERS 
 
 state x 
 
 0001100100 
 
 SEQUENTIAL 
 LIST 
 
 Figure 12. Compressed Lexical Scanner Table Structure 
 
 
49 
 
 6. FUTURE IMPROVEMENTS 
 
 Seven areas where improvement would be helpful in 
 Genesis are described below: 
 
 (1) Building the Parse Table 
 
 While building the SLR(l) parse table in the Gener- 
 ator, the entire table need not be represented anywhere. 
 Entries only need be allocated when they are being built. 
 Each entry of the table could contain a pointer to the next 
 entry of the table for a particular state. A set of pointers 
 could point to the first entry in each state. 
 
 This method of representing the parse table should 
 save a significant amount of memory since the great majority 
 of entries in an SLR(l) table are ERROR entries, which needn't 
 be built at all and would therefore never be allocated. 
 
 (2) Calculating FIRST and FOLLOW Sets 
 
 The calculation of the FIRST and FOLLOW sets is done 
 using recursion in the current Genesis system. This process 
 is simple, but expensive in both time and space requirements. 
 If a bit map were kept for each set, both space and time 
 usage would improve. The bit maps would be constructed once, 
 then used continually throughout the building of the SLR(l) 
 table. 
 
 To calculate the FIRST, set, first find all pro- 
 ductions whose right-hand side begins with terminal symbol and 
 
50 
 
 mark that symbol as being in the left-hand side's FIRST, set. 
 When no more such productions exist, then a dependency graph 
 must be built for nonterminals in the grammar. 
 
 For .instance, if the following production appears in 
 the grammar: 
 
 A + BCD 
 then, calculation of the FIRST, set of A depends on the cal- 
 culations of the FIRST set of B. Thus, 
 
 © >® 
 
 depends 
 on 
 
 would be added to the dependency graph. When the graph is 
 built, it is checked to make sure that it has no cycles. If 
 it has no cycles, then "starting points" can be found from 
 which calculation of all FIRST, sets can be completed. 
 
 <S> 
 
 Starting 
 Point 
 
 Qr^ 
 
 After the dependency graph has been built, all arrows 
 are reversed. A "starting point" is any node with no in- 
 coming arrows. The FIRST 1 sets will flow in the direction of 
 the arrows. The FIRST, set is copied from one node to the 
 other in the direction of the reversed arrow. 
 
 Likewise, to calculate the FOLLOW., set, all produc- 
 tions would be sequentially searched. Whenever a terminal 
 
51 
 
 follows any 'symbol, that terminal is added to the symbol's 
 FOLLOW, set. When a nonterminal follows a symbol, the FIRST, 
 set of that nonterminal is added to the FOLLOW, set of the 
 symbol . 
 
 Next, a dependency graph must be constructed for the 
 symbols which are on the end of each production. For example, 
 if 
 
 A -*- BCD 
 
 ■ y & ' ' . 
 
 appears as a production in the grammar, the FOLLOW, set of D 
 depends on the FOLLOW, set of A, so, 
 
 © >© 
 
 ^ depends 
 
 on 
 would be added to the dependency graph. After the dependency 
 graph is built, it would be checked to make sure that there 
 are no cycles. If there are no cycles, starting points would 
 be chosen and the FOLLOW, set calculation would continue in 
 the same manner as in the FIRST, algorithm. The only differ- 
 ence between the two graphs is that the FOLLOW, set graph 
 includes both terminals and nonterminals, while the FIRST, 
 set graph only includes terminals. 
 
 (3) Parse Table Compression 
 
 Many techniques for the compression of an SLR(l) table 
 exist. Several are mentioned in [Aho; 1973]. One of these 
 techniques could substantially reduce the size of the parse 
 table. It is claimed in [Aho; 1973] that the SLR(l) table 
 
52 
 
 can be compressed to the point where it is competitive with 
 a precedence parse table in size. 
 
 (4) Deterministic Parsing 
 
 Each segment which is generated should have its 
 seed-set included within its tables so that the Linker can 
 determine whether the resulting language meets the Combin- 
 ability Criterion. Then, the restriction that only one segment 
 reference can appear in any one state can be relaxed. Restric- 
 tion of one segment per state is the present method of ensur- 
 ing a deterministic parse. 
 
 (5) Linker 
 
 The Linker should be made more powerful. If it were, 
 it could alter the parse tables for the various segments and 
 include the ENTER and EXIT actions at the proper places in 
 the table. It could also eliminate the need for the terminal 
 cross-reference table, the segment name cross-reference table, 
 and the sets of segment constants, which must be stored in 
 the present system, by standardizing the codes for all 
 symbols . 
 
 (6) Production Lengths 
 
 The table of production lengths which is currently 
 passed along with the segments should be eliminated. Instead, 
 the length information should be encoded in the parse tables 
 by the Segment Generator. 
 
53 
 
 (7) Error Recovery Mechanism 
 
 Some sort of error recovery mechanism should be in- 
 cluded within the standard parser so that some degree of 
 intelligent recovery from an error is possible. This is 
 one major area which deserves more research. It may be 
 possible to specify an error recovery mechanism within the 
 syntax of a segment. 
 
 The error recovery notation should be concise and 
 should fit into the normal syntax specification in a natural 
 way. This is a very interesting and possibly fertile area 
 for research. 
 
54 
 
 CLOSING REMARKS 
 
 The Genesis system is currently being used at the 
 Illinois State Geological Survey in the development of the 
 Retrieval Request Language (RRL) of the Mineral Resources 
 Evaluation System. Genesis is especially well-suited to the 
 special environment of the Computer Services Unit of the 
 Survey. The programming staff is small and always has much 
 more work to do than it can do. Genesis has allowed us to 
 work on RRL in short bursts, as our schedules permit, while 
 still accomplishing something. 
 
 RRL was divided into ten segments. We have been able 
 to code a small group of segments at one time, then test it. 
 Since segments can have only a limited interaction with each 
 other, debugging is simple. With the additional assistance 
 of a semantic trace, we have been able to quickly isolate 
 bugs as they arise. Thus far the Genesis system has proven 
 to be a useful tool. 
 
55 
 
 BIBLIOGRAPHY 
 
 Aho, A. V. , & J. D. Ullman. The Theory of Parsing, Transla - 
 tion and Compiling . Volumes I and II. Englewood 
 Cliffs, New Jersey: Prentice-Hall, 1972 and 1973. 
 
 Aho, A. V., S. C. Johnson, & J. D. Ullman. "Deterministic 
 Parsing of Ambiguous Grammars." Comm. of the ACM, 
 August 1975. 
 
 Balzer, R. M. , & D. J. Farber. "APAREL — A Parse Request 
 Language." Comm. of the ACM, November 1969. 
 
 Conway, M. E. "Design of a Separable Transition-Diagram 
 Compiler." Comm. of the ACM, July 1963. 
 
 DeRemer, F. L. Practical Translators for LR(k) Languages . 
 
 Ph.D. Thesis, Massachusetts Institute of Technology, 
 Cambridge, Massachusetts, 1969. 
 
 Geschke, C. M. , & J. G. Mitchell. "On the Problem of Uniform 
 References to Data Structures." IEEE Trans, on 
 Software Engineering, June 1975. 
 
 Hopcroft, J. E., & J. D. Ullman. Formal Languages and Their 
 Relation to Automata . Reading, Massachusetts: 
 Addison-Wesely Publishing Co., 1969. 
 
 Horning, J. J., & W. R. Lalonde . "Empirical Comparison of 
 
 LR(k) and Precedence Parsers." ACM SIGPLAN Notices, 
 November 19 70. 
 
 Horning, J. J. "LR Grammars and Analysers." In Compiler 
 Construction. An Advanced Course . Heidelberg: 
 Springer-Verlag Berlin, 1974. 
 
 Johnson, W. L., J. H. Porter, S. I. Ackley, & D. T. Ross. 
 
 "Automatic Generation of Efficient Lexical Processors 
 Using Finite State Techniques." Comm. of the ACM, 
 December 1968 . 
 
 Korenjak, A. J. "A Practical Method for Constructing LR(k) 
 Processors." Comm. of the ACM, November 1969. 
 
 Liskov, B. H. , & S. N. Zillies. "Specification Techniques 
 for Data Abstractions." IEEE Trans, on Software 
 Engineering, March 1975. 
 
56 
 
 McKeeman, W. M., J. J. Horning, & D. B. Wortman. A Compiler 
 Generator . Englewood Cliffs, New Jersey: Prentice- 
 Hall, 1970. 
 
 Mickunas, M. D., & V. B. Schneider. "A Parser Generating 
 
 System for Constructing Compressed Compilers." Comm, 
 of the ACM. November, 1973. 
 
 Reynolds, J. C. "GEDANKEN — A Simple Typeless Language Based 
 on the Principle of Completeness and the Reference 
 Concept." Comm. of the ACM, May 1970. 
 
57 
 
 APPENDIX A 
 
 Genesis Syntactic Meta-Language Syntax 
 
 INPUT 
 
 -> 
 
 INITIAL SEG; 
 
 
 INITIAL 
 
 ->■ 
 
 INITS; 
 
 INITS 
 
 -V 
 
 INITS INIT; 
 
 
 -> 
 
 INIT; 
 
 INIT 
 
 -> 
 
 SETTABLE '=' 21 DIGITS; 
 
 
 -> 
 
 CHARCONST '=' 22 LITERAL; 
 
 
 -> 
 
 23 
 
 'LISTPARSE' ; 
 
 
 ■+■ 
 
 24 
 
 'LISTFSM' 
 
 
 SETTABLE 
 
 -> 
 
 1 
 
 'MXPRODS 1 
 
 
 
 ->- 
 
 2 
 
 'MXCORES' | 
 
 
 
 -V 
 
 3 
 
 'ACTNSIZ 1 ( 
 
 
 
 -> 
 
 4 
 
 •GOSIZ' ; 
 
 
 -y 
 
 5 
 
 ■NAMSIZ'; 
 
 
 -*- 
 
 6 
 
 ' LITSIZ' ; 
 
 
 -*■ 
 
 7 
 
 'LISTSIZ' ; 
 
 
 -+ 
 
 8 
 
 'MXRIGHT' ; 
 
 
 •+• 
 
 9 
 
 'LITLEN' ; 
 
 
 ■+ 
 
 10 
 
 ' NAMLEN ' ; 
 
 
 -> 
 
 11 
 
 •MXPSTAT' ; 
 
 
 -> 
 
 12 
 
 ' ITEMPER' ; 
 
 
 -> 
 
 13 
 
 'ACTSIZ' ; 
 
 
 -> 
 
 14 
 
 •GOTOSIZ 1 ; 
 
 
 ->■ 
 
 15 
 
 'GOSMSIZ 1 ; 
 
 
 -»■ 
 
 16 
 
 1 ITEMSIZ ' ; 
 
 
 -> 
 
 17 
 
 'MXFENTRY' ; 
 
 
 ->■ 
 
 18 
 
 'MXFSTAT' ; 
 
 
 -»■ 
 
 19 
 
 ' NUMSIZ' ; 
 
 
 -> 
 
 20 
 
 1 NUMLEN ' ; 
 
 CHARCONST 
 
 ->■ 
 
 25 
 
 'NEWCHARS' ; 
 
 
 ■> 
 
 26 
 
 1 COMSTART ' ; 
 
 
 -»- 
 
 27 
 
 * COMEND ' ; 
 
 # THE 
 
 BNF SYNTAX ; 
 
 SEG 
 
 -*- 
 
 SYNT SEMANT; 
 
 SYNT 
 
 -*• 
 
 50 
 
 'SYNTAX: ' PRODS; 
 
 PRODS 
 
 -> 
 
 PRODS PROD; 
 
 
 -»■ 
 
 PROD; 
 
 PROD 
 
 ->- 
 
 LHS 
 
 ' + ' RHS 59 ' ; ' ; 
 
 LHS 
 
 -> 
 
 51 
 
 IDENTIFIEF 
 
 t; 
 
 52 
 

 RHS 
 
 
 RHSYMS; 
 
 RHSYMS 
 
 -► 
 
 RHSYMS RHSYM; 
 
 
 ->■ 
 
 RHSYM; 
 
 RHSYM 
 
 ■+■ 
 
 53 
 
 LITERAL; 
 
 
 -*> 
 
 54 
 
 IDENTIFIER; 
 
 
 ■+■ 
 
 55 
 
 ■<•; 
 
 
 ■* 
 
 56 
 
 •>'; 
 
 
 -V 
 
 57 
 
 DIGITS; 
 
 
 •+ 
 
 58 
 
 > 
 
 SEMANT 
 
 ■+ 
 
 60 
 
 'SEMANTICS: 
 
 SEMANTICS 
 
 . 
 
 
 
59 
 
 APPENDIX B 
 
 Compiler Description Language Syntax 
 
 SYNTAX : 
 
 LINKINPUT 
 
 INITS 
 
 INITS SEGMENTS FSM COMPSPEC 35; 
 
 INITS INIT; 
 INIT; 
 
 INIT 
 SETTABLE 
 
 - SETTABLE ' = ' 21 DIGITS; 
 
 • 2 2 'MAXPSTATE' ; 
 +23 'MAXSEGS' ; 
 +24 'MAXPRODS' ; 
 +25 'MAXTRMLN* ; 
 +26 'MAXNAMLN' ; 
 +27 'GOTOSIZ' ; 
 + 28 "ACTSIZ * ; 
 + 29 ' TRMSIZ' ; 
 + 30 'NTRMSIZ ' ; 
 + 40 'NAMSIZ ' ; 
 + 41 'LITSIZ ' ; 
 
 # SEGMENT SPECIFICATION ; 
 
 SEGMENTS -> MAJOR MINOR; 
 
 MAJOR -y 32 T^IAJOR: ' 1 IDENTIFIER • »• 
 
 MINOR + 'MINOR:' IDLIST 3 '.'; ' 
 ->- 
 
 IDLIST -, IDLIST ','2 IDENTIFIER; 
 + 2 IDENTIFIER; ■ 
 
 # COMPILER COMPOSITION ; 
 
 ROri^S^ 1 ^? 1 ^^ IDENTIFIER 34 ROUTINES LISTING 
 
 COMPSPEC 
 ROUTINES 
 
 ROUTINE 
 LISTING 
 
 OPTIONS 
 
 + ROUTINES ' , ' ROUTINE; 
 
 + ROUTINE; 
 
 + 4 IDENTIFIER '(' 5 IDENTIFIER ')'• 
 
 + 'LISTING:' OPTIONS ' '• 
 
 + ; 
 
 + OPTIONS OPTION; 
 + OPTION; 
 
 INITIAL; 
 
60 
 
 OPTION 
 
 ■+ 
 
 1 INDENT' 
 
 _ i 
 
 6 DIGITS; 
 
 
 -> 
 
 ' RMARGIW 
 
 ■ _ 
 
 • 7 
 
 DIGITS 
 
 
 -> 
 
 •LMARGIN' 
 
 i _ 
 
 ' 8 
 
 DIGITS 
 
 
 -y 
 
 'PSTAKLW 
 
 i _ 
 
 * 9 
 
 DIGITS 
 
 
 ■+ 
 
 'NAMSIZ' 
 
 = ' 
 
 50 
 
 DIGITS 
 
 
 -y 
 
 •NAM LEW 
 
 = ' 
 
 51 
 
 DIGITS 
 
 
 -*■ 
 
 •LITSIZ' 
 
 = ' 
 
 52 
 
 DIGITS 
 
 
 ■* 
 
 'LITLEN' 
 
 = ' 
 
 53 
 
 DIGITS 
 
 
 ->• 
 
 'NUMSIZ' 
 
 = ' 
 
 54 
 
 DIGITS 
 
 
 ■+ 
 
 'NUMLEN' 
 
 = ' 
 
 5 5 
 
 DIGITS 
 
 
 -► 
 
 'FAIL' 
 
 = • 
 
 10 
 
 DIGITS 
 
 INITIAL ■+ 'INIT:' IDECLARE IBODY; 
 IDECLARE -»■ 11 'DECLARE:' 
 
 #CARD IMAGES OF SOME PL/I DECLARES; 
 
 IBODY ■+ 12 'BODY: ' 
 
 #CARD IMAGES OF A PL/I INIT ROUTINE; 
 
 # FINITE STATE MACHINE INITIALIZATIOS ; 
 
 FSM ■+ 'FSM: 1 INITSFSM; 
 
 -V • 
 
 I 
 
 INITSFSM ■* INITSFSM INITFSM; 
 ■»■ INITFSM; 
 
 INITFSM 
 
 ->• 
 
 FSETTABLE '=' 
 
 13 DIGITS; 
 
 
 ->■ 
 
 CHARCONST '=' 
 
 14 LITERAL; 
 
 
 -> 
 
 15 
 
 'LISTFSM' ; 
 
 
 FSETTABLE 
 
 •+ 
 
 16 
 
 'MXFENTRY' 
 
 i 
 
 
 ->■ 
 
 17 
 
 'MXFSTAT' ; 
 
 
 CHARCONST 
 
 -y 
 
 18 
 
 'NEWCHARS' 
 
 i 
 
 
 -V 
 
 19 
 
 1 COMSTART ' 
 
 / 
 
 
 ->- 
 
 20 
 
 'COMEND' ; 
 
 
 SEMANTICS 
 
 
 
 
 
61 
 
 APPENDIX C 
 
 SLR(l) CONSTRUCTION PERFORMED ON THE EXAMPLE SEGMENTS 
 
 Augmented Grammars: 
 
 ASSIGN ■ 
 ASSIGN$ 
 
 -> ASSIGN$; 
 
 -> NAME 
 
 EXPR; 
 
 EXPR 
 
 EXPR$ 
 
 EXPR$ 
 
 EXPR$ 
 
 EXPR$ 
 
 "* EXPR$; 
 
 -^ EXPR$ 
 
 -> EXPR$ 
 
 -> NAME ; 
 
 ■> DIGITS; 
 
 i * i 
 
 • 4- < 
 
 EXPR$ ; 
 EXPR$; 
 
 NAME 
 NAME$ 
 SUBSCR 
 SUBSCR 
 
 -> NAME$ ; 
 
 -4 IDENTIFIER 
 
 -» ' ( ' EXPR 
 
 SUBSCR; 
 
 ) '; 
 
62 
 
 SLR(l) CONSTRUCTION 
 
 ASSIGN 
 
 STATE 
 
 
 
 [ASSIGN 
 
 ASSIGN$] 
 
 on ASSIGN$ goto STATE 2 
 [ASSIGN$ *. NAME ' = ' EXPR; 
 
 on NAME goto STATE 3 
 
 STATE 
 
 © 
 
 [ASSIGN > ASSIGN$. ] 
 
 on EOF REDUCE 
 
 STATE 
 
 © 
 
 [ASSIGN$ — * NAME 
 
 EXPR 
 
 on ' = ' SHIFT and goto STATE 4 
 
 STATE (T\ 
 
 
 [ASSIGNS — » NAME ' = '. 
 
 EXPR] 
 
 on EXPR goto STATE 5 
 
 -- 
 
 STATE 
 
 © 
 
 [ASSIGN$— > NAME 
 
 EXPR, 
 
 on EOF REDUCE 
 
63 
 
 EXPR 
 
 [EXPR$— » . EXPR$ 
 
 STATE 
 
 Q 
 
 [EXPR ) . EXPR$] 
 
 on EXPR$ goto STATE 2 
 
 [EXPR$ — ). EXPR$ •*» EXPR$] 
 
 on EXPR$ goto STATE 2 
 
 EXPR$] 
 
 on EXPR$ goto STATE 2 
 
 [EXPR$— > . NAME] 
 
 on NAME goto STATE 3 
 
 [EXPR$ > . DIGITS] 
 
 on DIGITS SHIFT and goto STATE 4 
 
 STATE (2 J 
 
 
 [EXPR— * EXPR$ . ] 
 
 
 on EOF REDUCE 
 [EXPR$— > EXPR$ . '*' EXPR$] 
 
 on '*' SHIFT and goto STATE 5 
 [EXPR$— > EXPR$ . * + ' EXPR$] 
 
 on '+' SHIFT and goto STATE 6 
 
64 
 
 [EXPR$ * NAME . ] 
 
 on '+' REDUCE 
 on •*' REDUCE 
 on EOF REDUCE 
 
 STATE 
 
 © 
 
 STATE 
 
 © 
 
 [EXPR$ > DIGITS . ] 
 
 on '+' REDUCE 
 on '*' REDUCE 
 on EOF REDUCE 
 
 STATE 
 
 © 
 
 [EXPR$ > EXPR$ 
 
 * • 
 
 EXPR$] 
 
 on EXPR$ goto STATE 7 
 
 [EXPR$ >. EXPR$ '*• EXPRS] 
 
 on EXPR$ goto STATE 7 
 
 [EXPR$ >. EXPR$ '+' EXPR$] 
 
 on EXPR$ goto STATE 7 
 [EXPR$ > . NAME] 
 
 on NAME goto STATE 3 
 
 [EXPR$ * . DIGITS] 
 
 on DIGITS SHIFT and goto STATE 4 
 
65 
 
 [EXPR$ > EXPR$ 
 
 + 
 
 STATE 
 
 © 
 
 . EXPR$] 
 
 on EXPR$ goto STATE 8 
 [EXPR$ » . EXPR$ •*' EXPR$] 
 
 on EXPR$ goto STATE 8 
 [EXPR$ >. EXPR$ ' + ■ EXPR$] 
 
 on EXPR$ goto STATE 8 
 [EXPR$ > . NAME] 
 
 on NAME goto STATE 3 
 [EXPR$ — > . DIGITS] 
 
 on DIGITS SHIFT and goto STATE 4 
 
 STATE 
 
 © 
 
 [EXPR$— 
 
 -► EXPR$ 
 
 • * i 
 
 EXPR$ . ] 
 
 on ■*' 
 
 REDUCE 
 
 
 
 on • + ' 
 
 REDUCE 
 
 
 
 on EOF 
 
 REDUCE 
 
 
 
 [EXPR$ — 
 
 -» EXPR$ . 
 
 i * i 
 
 EXPR$] 
 
 on '*' 
 
 REDUCE 
 
 
 
 [EXPR$ — 
 
 -* EXPR$ . 
 
 ' + ' 
 
 EXPR$] 
 
 on '+' 
 
 REDUCE 
 
 
 
66 
 
 STATE 
 
 
 
 [EXPR$ 
 
 EXPR$ 
 
 ' + ' 
 
 EXPR$ . ] 
 
 
 on '*' 
 
 SHIFT and 
 
 goto 
 
 STATE 5 
 
 on ' + ' 
 
 REDUCE 
 
 
 
 
 on EOF 
 
 REDUCE 
 
 
 
 
 [EXPR$ 
 
 EXPR$ . 
 
 i * i 
 
 EXPR$] 
 
 
 on •*' 
 
 SHIFT and 
 
 goto 
 
 STATE 5 
 
 [EXPR$ 
 
 EXPR$ . 
 
 ' + • 
 
 EXPR$] 
 
 
 on • + • 
 
 REDUCE 
 
 
 
NAME 
 
 67 
 
 STATE 
 
 © 
 
 [NAME *. NAME$] 
 
 on NAME$ goto STATE 2 
 [NAME » . IDENTIFIER SUBSCR] 
 
 on IDENTIFIER SHIFT and goto STATE 3 
 
 [ NAME > NAME$ . ] 
 
 on EOF REDUCE 
 
 STATE 
 
 © 
 
 
 STATE (T) 
 
 [NAME$ — 
 
 -» IDENTIFIER . SUBSCR] 
 
 on SUBSCR goto STATE 4 
 [SUBSCR — > .] 
 
 on EOF 
 [SUBSCR- 
 
 REDUCE 
 — > . ' (' EXPR ' ) ' ] 
 
 on ' (' 
 
 SHIFT and goto STATE 5 
 
 
 
 STATE (T) 
 
 [NAME$ — 
 
 -> IDENTIFIER 
 
 SUBSCR . ] 
 
 on EOF 
 
 REDUCE 
 
 
68 
 
 STATE 
 
 © 
 
 [SUBSCR » ' ( ' . EXPR 
 
 ) 'J 
 
 on EXPR goto STATE 6 
 
 STATE 
 
 © 
 
 [SUBSCR 
 
 EXPR 
 
 ') '] 
 
 on ')' SHIFT and goto STATE 7 
 
 STATE 
 
 © 
 
 SUBSCR 
 
 EXPR 
 
 ') ' ■] 
 
 on EOF REDUCE 
 
IBLIOGRAPHIC DATA 
 HEET 
 
 1. Report No. 
 
 UIUCDCS-R-7 7-854 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 March. 1977 
 
 Title and Subtitle 
 
 Genesis — A Compiler Generator Using Language Segmentation 
 
 6. 
 
 Author(s) 
 
 Jay Hoeflinger 
 
 8- Performing Organization Rept. 
 No. 
 
 Performing Organization Name and Address 
 
 Department of Computer Science 
 
 University of Illinois at Urb ana- Champaign 
 
 Urbana, IL 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 I Sponsoring Organization Name and Address 
 
 13. Type of Report 8i Period 
 Covered 
 
 14. 
 
 Supplementary Notes 
 
 Abstracts 
 
 This thesis describes the Genesis compiler generator system, which 
 allows the generation and linkage of individual language segments. Language 
 segments are described and the concept of segmentation parsing is developed 
 using a model automaton. The implementation of segmentation parsing within 
 Genesis is discussed. The various features of Genesis are also presented, 
 including its self-generative capability. 
 
 I Key Words and Document Analysis. 17o. Descriptors 
 
 compiler generators 
 
 segmentation 
 
 SLR(l) 
 
 self-generative 
 
 expression disambiguation 
 
 '*■ Identifiers/Open-Ended Terms 
 
 i7| COSAT1 Field/Group 
 
 8\\ ailability Statement 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 curity Class (Thi 
 
 20. Security Class (Ihis 
 Page 
 
 UNCLASSIFIED 
 
 21* No. of Pages 
 
 22. Price 
 
 *>v« NTIS-35 (10-70) 
 
 USCOMM-DC 40329-P71 
 
Ap% 9 
 
 s 13??