1 1 LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 no.9^>0-82)5 A' "if „ UIUCDCS-R-76-834 u ^ IMPLEMENTATION OF THE LANGUAGE CLEOPATRA; THE ANALYSIS PASS by Scott Harley Fisher October I976 b 11 S '0 in UIUCDCS-R-T6-83U IMPLEMENTATION OF THE LANGUAGE CLEOPATRA: THE ANALYSIS PASS BY SCOTT BARLEY FISHER B.S., Oniversity of Illinois, 1972 THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign , 1976 Urbana, Illinois Ill Acknowledgeaent As I write this acknowledqemen t and think back over the lonq months of work that went into this project, it is difticult to name th<^ individuals who have influences this work. To any deserving but unnamed indivduals, I express my reqrets for the omission. I would like to thank my thesis advisor Dr. H. George Friedman, Jr. for suggesting this project, reading this thesis, and supplying the proper environment for the work contained herein. When embarking on a new field, it is always beneficial to study the previous literature on the subject. In my case, the "previous literature" was my Director of Research, but more than that, my "previous literature" was embodied in an enthusiastic pedagogue and good frifend--Dr. Axel T. Schreiner. Axel was tremendously helpful, and always willing to discuss difficult points. He also displayed 4 IV confidence in my abilities and in the project even when confidence seemed out of order. It seems inadequate, but--thanJcs Axel. I also extend a special thanks to my family. Although they do not understand the esoterica involved in the analysis pass of a compiler, they were always confident in and supportive of my work. It is, to a large part, through their efforts that I have attained this goal. For that, I sincerf^lv thank them all. Lastly, I wish to express my thanks to two fellow graduate students and good f riends--John Hodry and John Bowman--for helpful suggestions and technical advice on various aspects of this project. It is to these people that I dedicate this thesis. S. H, F. ii^m: Table of Contents Chapter Page 1.0 introduction 1 2.0 Accessibility and Scope 3 2.1 The Configuration - U 2.2 Local vs. Global Scope 6 2.2.1 Global Scope 6 2.2.2 Local Scope 8 3.0 The Subset language 9 3.1 Structure Blocks 10 3.2 Data Blocks 11 3.3 Routine Blocks 12 U.O Implementation of the Subset 13 U.I Implementation Rules for Structure Blocks ... 13 4,2 Implementation Rules for Data Blocks 16 U. 3 Implementation Rules for Routine Blocks 18 H.n statements 19 S.O Further Considerations 27 5.1 Pointer Variables 27 5.2 Presentation of Blocks for Compilation 30 *5.0 Structure of the Compiler 36 6.1 Compiler Parameters 39 6.2 Symbol ^able Complex 42 6.2.1 Name Recognition 43 6.2.2 Type Analysis 44 6.2.3 level/Configuration Table SO 6.2.4 Configuration Table 51 6.2.5 Type Table 51 6.3 lexical Analysis • 52 6.4 Syntactic Analysis 53 6.5 Semantic Analysis 56 7.0 Conclusions 58 References 62 ■i 1.0 Introduction Proqramminq languages come in two basic flavours — the easy to use, easy to write variety; and the more complicated, but more powerful variety. Both have important applications and, as evidenced by sheer numbers, are quite viable. The former type, as exemplified by BASIC, is adequate for the "simple" program where intricate data structuring is not of primary concern, or for the novice programmer. The latter, typified by PL/I, allow more complex data bases and more powerful manipulations. It becomes the task of the programmer to select the language suitable for the particular project. CLEOPATRA (Comprehensive Language for Elegant OPerAting system and TRAnslator design) is a member of the latter class. CLEOPATRA can be characterized by a rather detailed program text, a very modular logical structure, and powerful manipulative abilities. The language specifications as well as a discussion of the implications have been presented by Axel T. Schrfciner in [ 1 ] and [2]. This thesis presents the initial implementation of the analysis pass of a CLEOPATRA compiler. (Certain extensions and restrictions have been imposed on the original language.) ■11 •'is ill is I The obvious question that arises when a new language is presented is, "Why another language?". This is indeed a valid question, and a language, to be a worthy and viable tool, must be able to address this question. In the present case, the objectives have been to produce a compiler for a language that: 1) Allows user-defined data types; 2) Can act as a laboratory for new features such as: Decision tables, and Powerful control structures that facilitate the design of programs. Within this framework, the user is allowed a cectain amount of flexibility to produce a well structured and concise program for compilation. Further, in an effort to produce a "secure" program, full type checking is enforced. For the compiler, a certain level of complexity is introduced since opening and closing of "blocks" or levels of scope may be more frequent. It is also necessary to maintain more complex data bases to provide these special services. 2.0 Accessibility and Scope In days of old (relatively speaking!), a program "owned" the computer for the duration of its execution. As a result, the computer and all of its facilities became a servant to the user program. Nowadays, of course, the converse is true: the computer runs the program. This being the case, the program and in particular its contents have a life cycle. In a non-procedural language, the whole program is active for the duration of execution. In the more sophisticated languages (e.g. ALGOL), only portions (often called procedures or blocks) are active at any given time. In CLEOPATRA, this basic unit is the configuration. All scope rules revolve around the configuration. However not all elements of a configuration have identical scopes. The scope of an element is a function of its placement in a i295.i or global block. This relationship between global and local elements will be brought forth after an examination of the basic unit in CLEOPATRA — the configuration. ■^1 ^Pc in 38 2. 1 The Configuration The structure of the source program written in any procedure oriented language can be described by a tree in the graph theoretic sense. That is, every procedure has a father; the father of the main procedure might be considered to be the run-time environment. It is from this tree that ALGOL-style languages develop the scope or accessibility rules. CLEOPATRA also uses this tree structure concept. Each node of the resulting program tree is called a configuration. (At this stage, a configuration may be thought of as a procedure in the AL30L or PASCAL sense. This definition is incomplete but sufficient for the moment.) To the compiler, the underlying concept of the configuration is the configuration's position in the configuration tree. Each node in the tree is given a unigue number. The crucial point is that each node in the tree defines the configuration's relation to the program as a whole. To the user, a configuration is denoted by a configuration name. In most programming languages, the static program tree, corresponding to the configuration tree, is determined and constructed from the £hisical nesting of the source program. That is, one procedure is a descendant of another if the forioer*s statements are physically placed within the latter»s statements. In CLEOPATRA, the tree is determined ^y loaical nesting rather than physical nesting. This is accomplished via the CLEOPATRA structure block. The structure block defines the accessibility of elements in the tree defined from the present configuration (node the the tree) toward its descendants. To this point, the term used to describe the program's logical structure has been the static program tree. This is the tree that is constructed at coropile-time and represents the true logical nesting ot the source program. In contrast to the static program tree is the dynamic run-time tree. The dynamic tree is composed of the physically growing and shrinking program in memory. At run-time, configurations are allocated space in memory for their dynamic variables and required return addresses. This space, which grows upon entry to a configuration and shrinks upon exit, is a constituent of the dynamic program tree. The run-time environment is determined by the code generator, and is described in [3]. (The reader will find a good discussion of this topic as well as an excellent reference to compiler principles in [5] and [6].) I "3' n 11 I is '12 P E 2.2 Local vs. Global scope To this point, the static program has been created. In a languaqe such as ALGOL, the scope rules would now be automatically defined. This is not the case in CLEOPATRA. There are two basic scope rules rather than one. These can be explained within the framework of the static program tree, and the context of the current configuration tree. The current configuration tree is defined as the currently active configuration plus all descendants of that configuration in the static program tree. In the case of the initial configuration (similar to the main procedure in PL/I) the current configuration tree is the whole program. In the case of a configuration at a maximum nesting level, the current configuration tree is the configuration itself. 2. 2. 1 Global Scope The scope rule for global elements is the same as that of ALGOL. A global element will potentially be active and thus available throughout the current configuration tree. However, a global element can become inactive in two ways. each producing different results. If a local element of the same name as the global element is declared at a deeper nesting level, the global element becomes known throughout the current configuration tree except in the configuration in which the local element is defined. In the second case, if another global element of the same name as the first global element is defined at a deeper nesting level, the first element is known in its current configuration tree until the configuration in which the second element is declared. At this point, the second global element becomes active for its current configuration tree or until its scope is altered by one of these two situations. An element is made global by being defined in a alobal ^§;t§. blgck^ tjrpe data block, or by means of a alobal structure block. The global structure block alters the "globalness" slightly. This alteration will be discussed in due time. For now, it can be stated that the general global scope implies activation throughout the current configuration tree unless the name is redefined. irc<>\-'>%'M 2.2-2 Local Scope The concept of the local scope is rather unique to CLEOPATRA. Further, its scope is easy to define. A local element is active only for the configuration in which it is defined. Thus, a local element is not known by any of its descendants. The purpose of such local scope is for scratch variables, counters, and the like. This scope concept helps contribute to program structure, since scratch variables should not be made global. An element is made local by declaration in a local data block. The analysis pass detects local and global elements and indicates the type for the code generator. The code generator then allocates the local element upon invocation of its configuration and deallocates upon exit. 3.0 The Subset Language To this point, the term configuration has been defined as a procedure or a node in the static program tree. A configuration is indeed those things, but in a aore precise sense, a configuration is a collection of statements defining an accessibility sequence, the data availability, and the actions to be taken on these data. From this definition, it can be concisely stated from what a configuration is formed. A configuration is composed of a routine block and possibly a combination of a structure and/or a data block. In the context of the previous definition, the following correspondences are formed: 10 CONCEPTUAL BLOCK ACTUAL BLOCK USE Structural Structure and Accessibility global structure sequences Data Local data global data Definition and scope of data Routine Procedure and operator block Executable statements 3. 1 Structure Blocks As alluded to previously, the structure block defines the logical nesting of the program's configurations by constructing the static program tree. The scope of a structure block is the current configuration tree. This implies that procedures are logically nested in the manner of ALGOL procedures. 11 As indicated, the alobal structure block modifies slightly the global scope rule. A global structure block is associated with a user-defined data type. The global structure block "pulls" the definition of the type to the same level as the structure defining the user-defined type. That is, a user-defined type is declared in a structure block, which is a node in the static program tree. The global structure block pulls the nesting level in the tree to the same level as the defining configuration. 3.2 Data Blocks Data blocks are used to define data items available to a configuration. A data block is not required for a configuration if the corresponding configuration does not need to possess any new variables. A configuration does own data items declared global in a predecessor configuration. However, good coding practice favours elimination of scratch global variables. Local data blocks are used to create the scratch variables needed by a configuration. As previously stated, an identifier may be declared in the current 12 configuration with the sane name as a previously declared global identifier. In this case, the current identifier is the active element. 3.3 Routine Blocks The last type of block is the routine block. To this point, the input has established the synbol table and configuration table with the proper declarations and calling sequences. That having been coapleted, the executable statements can be parsed. The routine block is the only block which must be present in a configuration. This is the case because a "procedure" may use global data and may not need to possess nested configurations, but must contain executable statements. Two types of routine blocks exist for this purpose — the ££2^£liiE§. block and the operator blcck. The procedure block is used to describe the user's algorithm in the CLEOPATRA language. The operator block defines the actions to be taken by a user-defined operator. 13 '*.0 Implementation of the Subset As previously stated, this thesis is the presentation of one aspect of the subset language--the analysis of input through the generation of intermediate text. The purpose of the current implementation is not only to produce a working compiler, but also to determine the types of algorithms reguired to implement tho language constructs. The subset has been defined in [*♦]. The subset specifications will not be repeated here except in those areas where changes have been made. Though the code generator was complete before the analysis pass was begun, changes are still possible so long as the code generator receives the reguired input. In some cases, restrictions have been placed on input, but in other cases extensions have been added. U.I Implementation Rules for Structure Blocks In [U], all configuration names were reguired to be unigue. This was necessary for the linkage editor being used. This restriction has been removed by the analysis # |3 '0 1U pass. The analysis pass requires uniqueness within any qiven level in the conf iquration tree; however, duplication is allowed between levels. This extension was made so that the user can apply the same scope rules to all names presented for compilation. In fact, the analysis pass forms the unique names for the user and passes them to the code qenarator. This is done by concatenatinq "CLEO" with a three diqit confiquration number (which, as will be recalled, is unique) . There are two cases where this transformation does not occur. These are for the procedures "INPUT" and "OUTPUT". The code qenerator uses the input and output routines supplied by the linkaqe editor. It is, therefore, imperative that the linkaqe editor actually gets th*=' names "INPUT" and "OUTPUT". This conversion mechanism also allows another extension. Formerly, configuration names were restricted to seven characters. By the above method, there is no restriction in lenqth of these names. The name table constructed by the analysis pass indicates both the confiquration name given by the user, and the configuration name passed to the code generator. It should also be noted that the linkage editor map will show the converted names (i.e. the unique names constructed by the analysis pass) and not the user's configuration names. The correspondence can be found by using the name table which is a default option of the analysis pass. 15 In constructing the static program tree^ the analysis pass accepts the first configuration name as the root of the tree (i.e. as the main program in the PL/I sense). This first configuration presented for compilation then is not predefined, but after that point the declare-bef ore-use rule is strictly enforced for structure blocks. The implication is that no configuration may be presented before it is given in a structure block. This is not solely to enforce the declaration rule: the analysis pass has no way of determining where a configuration fits into the configuration tree if its nesting has not been declared along with that of its predecessors. This does not contrast with other languages since physical nesting implies the structure in other block structured languages. An important distinction must be made: A structure block "I" does not define configuration "Y". Rather, some structure block "X" defines "Y" where "X" is the ancestor of "¥" — i.e. "X" owns "Y". >J^iX->^ ''/'y/ 20 b) ITERATE, with options FOR, WHILE, WHEN and EXIT. c) DECISION which allows selective execution of statements. A key difference between CLEOPATRA and most other lanquaqes is that expressions are evaluated from right to left, (This is also the method employed in APL) Further, there is no operator precedence (parenthesization forces precedence for the parenthesized quantity), CLEOPATRA allows the user to define operators as does ALGOL w. However, ALGOL V maintains operator precedence by requiring the user to assign a precedence number to each user-defined operator. User-defined operators may also have parameter lists similar to those of a procedure except that the parameter lists may exist on the right and/or the left side of the operator. In fact, the code generator only accepts right-side parameters. Therefore, the analysis pass converts all left parameters into right parameters. The syntax for parameter lists to operators has been changed to remove an ambiguity in the use of operators with parameters. Formerly, parameters to operators were denoted by placing the parameters in parentheses. The problem arose that if a 21 binary and unary operator, each with parameters, were placed next to each other, the analysis pass would not be able to resolve the question of which parameter list belonged to which operator. This is because of the ambiguity and the fact that the operators in question may have unary and binary components. Thus parameters to operators must be denoted by placing an apostrophe ( ' ) on the side of the parameter list closest to the operator. The last restriction on expressions is that there may be no more than 50 elements in an expression. This is because the code generator places a limit on the length of an expression in intermediate text form. One change has been made to the language specifications of the FOB statement. Previously [U], the following would be a legal albeit meaningless statement: FOR identifier ; ; ; Although most programmers would refrain from such a nebulous statement, the analysis pass would have to recognize the construction and the code generator would have to generate code for it. The meaning is certainly questionable, as would be the resultant code. Therefore, the following FOR statement has been implemented: 22 FOF Statement ::= FOR identifier [FROm expression] STEP expression [[;] OPTO I DOWNTO expression ] ; Tn the absence of the FROM clause, the initial value of the identifier used as the index is its present value. If the npTO DOWNTO expression option is not used, the effect is to increment the index and continue the loop indefinitely. This should only be used in conjunction with a WHEN, WHILE, or EXIT statement. This form of the FOH statement allows as flexible and powerful implementation as the code generator via the intermediate text will accept. The first proposed revision was similar to the version implemented, however it would have made optional the STEP expression phrase when using UPTO or DOWNTO expression. FOR Statement ::= FOR identifier [FROM expression] [ STEP expression ] £[ ; ] UPTO I DOWNTO expression ] ; The default when STEP is omitted would be one. This is the 23 optimal solution, but because of the requirenents of the intermediate text, it was not possible to implement this form. All keywords in the subset have been reserved. A list of these reserved words along with their symbol table entry numbers can be found in Figure 1. Figure 2 gives a list of the builtin operators provided by the subset. These may be redefined in an operator block and thus are not reserved. The code generator requires that there be no more that 1500 symbol table entries in addition to the list of reserved words and operators given. Although 9U keywords are given in the figures, there are actually 122 keywords. The discrepancy lies in the fact that most of the operators have more than one entry (e.g. 66 and 69). The operators, though apparently equivalent., are not because they operate on different operand types. The negative sign in entry 66 is a unary negative and operates on integers. The negative sign in entry 69 is a binary negative and operates on integers. This same situation applies to most operators. The effect is invisible to the user, however, since a semantic analysis routine is called during the parsing of expressions to determine the semantically correct operator. If k 4 i!« IS l '0 > I* 24 ZHtrj SYmbol Entry Symbol Number Number 1 ACTION 2 ALLOCATE 3 BEGIN 4 DECISION 5 DOWNTO 6 ELSE 7 END 8 EXIT 9 FOR 10 FROM 11 IF 12 ITERATE 13 NIL 14 OPERATOR 15 PROCEDURE 16 RELEASE 17 RETURN 18 STEP 19 THEN 20 WHEN 21 UPTO 22 WHILE 23 ADDRESS 24 ALIAS 25 B 26 BIT 27 BUILT 28 BY 29 C 30 CHARACTER 31 i i COMMENT 32 Figure la Reserved Words COMPILE 25 Number Entry S^Bbol Number 33 CONSTANT 35 DEFER 37 F 39 IN 41 INTEGEB 43 LONGINTEGER US RETURNS 47 S 49 VECTOR 51 X 53 FALSE 55 LARGE 57 SMALL 34 DATA 36 EXTENTS 38 GLOBAL 40 INIT 42 INTO 44 POINTER 46 RIGHT 48 TO 50 TYPE 52 STRUCTURE 54 FIRST 56 LAST 5B TRUE Fiqure lb Reserved Words (continued) 26 ^: is •5 '0 Entrj Symbol Number 66 - 68 ♦ 70 * 72 ** 7U AND 76 OR 78 = = 80 > 82 >= 8a • 86 LENGTH 88 -> 90 "-> 92 II 9U CHAR EntrY Syabol Number 67 ABS 69 - 71 // 73 HOD 75 LBOUND 77 -» 79 = 81 < 83 < = 85 87 HBOOND 89 <- 91 ?-> 93 LINT Figure 2 Built-in Operators 27 5.0 Further Considerations During the implementation of the analysis pass, several problems arose. In most cases, by some modification, they were overcome. However, two major problems have persisted. In both cases, implementation was attempted within the constraints of the language, facilities, and code generator. Unfortunately, the attempts met with limited success. Valuable information was gleaned from this process and is presented below so that future efforts might profit from the results thus far. 5. 1 Pointer Variables A ma-jor dilemma arose in the implementation of pointer variables. The BNF (Backus Naur Form) of the productions causing the problem is: Basic_Ref_Type ... Pointer (ref_type) Ref_Type : := Basic_Ref_Type .. & Parameter BLK DUMP CONFIG DUMP DEBUG CRD DEBUG PROC Action Output block tables. Output configuration tables. Output a trace of the compiler along with selected values beginning at this card number. To be used only on compiler error as it generates considerable output. Same as above. It may be enabled either initially or by the above card number. Figure Ua Compiler Parameters Default Yes Yes 30000 No U1 Parameter Action SefaiOi 1ST TXT Output a listing of the No internediate text — this requires a "//GO.ITEXT DD SYSODT=A" card if the catalogued procedure is not used. LINECT Number of source cards to be listed on a page. 58 Nam tab dump Output the nane table. Yes OPEN TST Output block tables, configuration No tables, and selected values upon entering a nevf block. SOURCE Output a source listing. Yes SYM_TAB_DOnP Output the symbol table, Yes TYP^TAB^DOMP Output the type tables. Yes Figure Ub Compiler Parameters (continued) it k / 11 1 s I 12! !3 42 Since laany large tables are required by the analysis pass, compilation began to require large amounts of main memory. Therefore, the analysis pass has been overlayed to reduce this requirement. The overhead in swapping the overlays is minimal compared to the saving in space. The analysis pass was originally designed to have expandable tables (symbol table, block table etc.) for increased generality . Unfortunately, this feature had to be removed. As the amount of code increased, the cost of recompilation of the analysis pass increased prohibitively. Therefore, other methods (external procedures in PL/I, and use of load modules) which do not allow dynamic bounds on arrays were employed. This saved in the cost of implemrntation, but forced the use of static bounded tables. In fact, this is not too limiting since the code generator has a fixed bound on the size of the symbol table. 6.2 Symbol Table Complex The Symbol Table Complex is the major data base for the compiler. This is due not only to its information content, but also to its physical size and time spent in its 43 manipulations. This size is in part due to the power of the compiler. The symbol table complex is used by all phases although not all phases may change an entry. The symbol table complex is composed of four basic modules: 1) Name Recognition 2) Type Analysis 3) Level / Configuration Table 4) Configuration table 5) Type Table In the following sections each table will be presented for an overview of its function. 6. 2. 1 Name Recognition The Name Recognition Table contains the actual character representation of the input tokens. This table consists of a linear string of all tokens concatenated together. The table also contains a vector of three pointers for each symbol table entry. The first is a pointer to the start of the token in the string. The second I a ' ail'' > I '0 !' P 44 is a length count. The last is a link of the liks-names (like but not equivalent). As previously mentioned, two similarly named symbols may exist provided they have different scopes. However, only one entry is kept in the name table in order to save space. Name recognition is the first action taken by the symbol table manager in searching for an entry. The search technique employed is a hash table with chaining for duplicates. After some experimentation a suitable hash function was chosen and mapped into 256 hashing buckets. When two different tokens hash to the same bucket, the first is inserted into the table in the usual fashion, but the second is chained by a link from the first. 6.2.2 Type Analysis The Type Analysis table contains the type representation and parts of the scope values for symbols. As discussed above, the lexical analysis phase inserts values into the type analysis table. The fields and their values are given below: U5 Field Function Type Type of syabcl or returned type: 1 . . Long_integer 2 . . Character 3 . . Error 7 . . Pointer 8 . . B it 9 . . User Defined 13 . . Label 15 , . Integer. Constant Set if constant data. Array Literal Initial Set if array. Set if self-defining constant, Set if symbol is initialized. Local Set if local symbol. Defer Set if deferred storage. Formal Set if formal parameter. Alias In_type Set if alias name. Set if part of user-defined type 46 Link Set if link item (e.q. configuration name) . Entry I' #1 is S '8 Ptr Apt r Unused Plk S'3t if an entry item, following types: 01 .. if procedure 10 .. if unary operator 11 .. if binary operator. Type of symbol pointed to (not i mplemented) . Set if points to array (not i mplem ent^d) . Analysis pass sets bit number 1 if the identifier is a data item, and 2 if it is a read-only data item. Configuration number of surrounding conf iq uration. 47 Blklevel Nesting level of the configuration if the item is a configuration name, or the nesting level of the surrounding configuration if it is not a configuration. Atrl Depends on the item. Atr2 Depends on the item. Most of the entries are self-explanatory from the context of use. Those not clear are expanded upon below. Entrjr T^ijS S£ecial Field Usage Configuration name Reserve two symbol table rows. Set type returned in TYPE of row two. Link bit is set in rows one and two. Entry bit is set in row two. Atrl in row one contains the configuration number. U8 C A' Data item Character Atr2 in row one points to the configuration name in the constant table. Atrl in row two contains the number of parameters to the proced ure/operator. Atr2 in row two contains a pointer to the position of the entry pointed to if the procedure returns a pointer. (not used) Atrl is the naximum length if the item is of type character. Otherwise Atrl is the configuration number of the surrounding conf ig urat ion. Atr2 points to the row of the value to which the data item is initialized if the item has tne initial attribute. Length in Atrl. 'If initialized, Atr2 points to the initial value in the constant table. 49 Bit, Integer, Lonq_integer, Constant Alias User-defined types The value is in &tr2 unless long_integer, in which case the value is overlayed in Atrl and Atr2. Set alias bit and link if necessary. Atr2 points to the najor nane. Blk set to zero for elements and in_type bit is set. The type analysis table is one of two tables passed to the code generator. The other is the constant table. The constant table contains the following entries: 1) Seven character unique configuration names as described previously. 2) The number of extents and bounds for arrays. 3) The value of character literals. Mi 50 c Both tables are passed to the code generator in the intermediate text file. On entry to a routine block, the type analysis table entries froa the last element transmitted through to the current top of the table and the whole constant table are placed in the intermediate text file. (The last element of the type analysis table is denoted by eleven ones in the unused field.) This is followed by the intermediate text for the routine. ill 11! - " ;'B I 6.2.3 Level/Configuration Table The Level/Configuration table contains the information about the activation of symbols. One element of the table is a field that links together all symbols of the same configuration. This field is used for activation of symbols upon entry to a new block. other fields are used to form a chain of entries and their predecessors. 51 6.2.4 Configuration Table The Configuration Table is a presentation of the static program tree. It consists of a list of configuration numbers along with their immediate predecessors. There is also a pointer to a symbol number in the level/configuration table which thus links all elements of the same configuration. This table is built mainly from the structure blocks. 6.2.5 Type Table The Type Table holds the attributes of parameters to operators and procedures. When operators and procedures are declared, only the type of the parameter is given. It is at the point when the operator or procedure is called that the actual identifier is found. Parameters can be of any available type, arrays, and left or right in the case of operators. There is no specific bit in the type table to denote an array entry. This is done indirectly by recording the number of extents for the entry. The number of extents is always zero unless the item is ein array in which case it I i J' 1:1 !2' 52 has a positive value. The seaantic processor requires this information in type checking. 6.3 Lexical Analysis As shown in Figure 3, the Lexical Analysis phase is the first phase encountered by the input. The Analysis pass occurs as one pass thus there is an interaction among the phases. The lexical analysis phase performs the most rudimentary albeit important work on the input. First, the input stream is tokenized by the productions in the language specifications. Comments and blanks are eliminated. (See Figure 3.) The scanner is able to recognize certain types and resolve them--bit values, integers, long_integers. In the case where the scanner is able to determine the type (due to the syntax) the type may be inserted into the symbol table. When the scanner is not able to make this determination, it still is able to reduce the possibilities. It thus inserts an unresolved type. iUte 53 The scanner calls upon a routine of the Syibol Table Hanager to look up the syibol in the syabol table. Depending on the context, another routine can be called to insert the token into the Symbol Table Complex. Other utilities include reading routines and a routine to convert radixes since CLEOPATRA accepts input in binary, decimal, or hexadecimal integers. There is also a routine to convert characters to their integer representation, and integers to their character representation. The Lexical Analysis phase is driven by the Syntax Analyzer. The Syntax Analyzer "knows" what to look for and the context in which it resides. It thus knows whether it is in a declarative context or merely looking for a defined token. 6.U Syntactic Analysis If one phase of the compiler could be considered the driver for the analysis pass, the Syntactic Analysis phase could qualify. It is the syntax of the language upon which the parse is based. 'ii f '!3 I !2 54 The syntax analyzer is broken (and overlayed) into five major modules. These nodules follow the saae lines as the block structure: Local Data, Global Data, Local Structure, Global Structure, and Type Data. The basic flow of the syntax analyzer is: While source text exists, the main program determines the type of block that is being presented, and calls the proper block processor. The block processors are not completely self-contained; they all share certain modules such as the error handler. However, it is clear that once parsing begins on one type of block. the other block parsers are unnecessary. The selected processor then continues parsing until the block terminates. At that point, control returns to the main program for selection of the next block. Several different parsing methods were considered for the current implementation. Upon examination of the BNF specifications, it was found that in almost all cases. the parsing machinery could determine "where it was" in the parse by looking at the current symbol. In the worst case. one symbol look-ahead was necessary. For this reason, a recursive descent parser was chosen for all elements in the language. Although this is not as fast as an LR parser. implementation was quicker and did not require the calculation of the parsing tables. 55 Error correction is handled on a "need to know" basis. The parser continues parsing after an error is encountered until it is unable to deteraine where to continue in the parse. When this occurs, the parser flushes the input until it is able to determine where to continue. In the worst case, this means flushing the current statement which may be a compound statement (e.g. a FOR loop). In other cases only part of a statement is flushed. In any case, one or more error messages will be output. If a compiler error occurs, at least one error message is printed (which would likely be the cause of the severe error), and all tables are printed along with selected compiler variable values. An important function of the syntax analyzer is to insert values into the symbol table complex. It is at this point that the types of symbols are resolved. Recall that the lexical analysis phase inserted a type which was usually unresolved. The syntax analyzer is able to determine the types that the lexical analyzer is not able to distinguish. {9 I Is 56 6.5 Semantic Analysis It has been stated on several occasions that CLEOPATRA is a very type conscious language. That is, lixed mode operations are not allowed except by user-defined operators designed for this purpose. It is the task of the semantic processing modules to monitor all types and their usage. The primary semantic processor is associated with the expression parser. Each operand is pushed onto a type stack. As the associated operator arrives, the type of the top entry (or entries for binary operators) is compared to the type of the operator. If the types match, the parse continues. If the types are unmatched, the semantic processor searches the symbol table via the configuration link to find another entry for the operator which has the proper types. If the proper type is not found, an error message is emitted and the call to the code generator is inhibited by setting the condition code. Another important aspect of semantics in CLEOPATRA is the analysis of parameter lists. The full connection is a three-way association: The declaration of a procedure or operator specifies the attributes of the parameters. The data block for the respective block must declare a variable of each of the attributes to serve as a target for the 57 parameter. At this point no parameter checking is done, however, since the positional association has not yet been made. The third component of the association ties the package together. In the definition of the routine block, a name_list is specified if there are parameters to the routine. At this point identifiers must be placed positionally as to their type and the type in the declaration of the procedure or operator. A semantic processor then compares the position of the parameter to the expected type at that position against the type of the identifier. If there is a mismatch — too few or too many parameters — an error message is generated. 58 7.0 Conclusions r : I. H r- 121 The objective of the current research has been to implement a parser and intermediate text generator for the CLEOPATRA subset language. Further, this iapleaentation was to serve as a test of the feasibility of the language itself. It was to serve as an analysis of the algorithms and data bases required to provide the facilities of the CLEOPATRA language. In general, the research has been sucessful in this regard. Certain limitations were placed on the full language, but every attempt was made to keep the language intact. Is the subset implemention the perfect language? Certainly we would like to respond affirmatively to that guestion but, in fact, the answer is "not really". All of the major control structures and blocks have been implemented. But, other important features have been omitted. Some of these omissions include (recalling from Chapter 5) : 1) Pointers; 2) Complete freedom in the order of presentation of blocks for compilation ; 3) Allowing arrays within a user-defined type (see [4]). 59 Notwithstanding these onissions, the language CLEOPATRA and the present subset are viable tools in the programaer 's repertoire. During the coding of the analysis pass, the facilities of CLEOPATRA would have aade implementation much easier and cleaner. Further, the language leads quite naturally to a well structured and clean program. The data bases used in the subset should be sufficient for later implementations. Indeed, as a test for feasibility, the present implementation has demonstrated that the language is feasible. The present research has indicated a course of action for further efforts on the implementation of the full language. Initially, a study of the use, need and desirability of the basic type POINTER should be undertaken. As stated in Chapter 5, it has been argued that pointers are an artifact from compilers gone by. A careful analysis of whether pointers should be available and what should be a legal target for a pointer if they are to be allowed should be done first. If the type POINTER is to be retained, the method of implementation should be examined. This would answer such questions as: 1) "How is type checking to be handled?"; and 2) "How can arrays of pointers be implemented? ". ^: up 60 The next area of study should be in syabol table design. In Chapter 5, it was stated that the present syabol table is insufficient for indirect pointers. This problem, along with some general restructuring, should be undertaken — possibly in conjunction with the study of pointers. (Suggestions have been given in Chapter 5.) The presentation order of blocks should be examined rrlativo to the bit map discussed in Chapter 5. As stated, the bit map seems to be the "cheapest" method to allow maximum flexibility in presentation order, yet retaining the general two pass method. This problem is somewhat less crucial since placing the restriction used in the subset does not appear too restrictive. However, the algorithm should not be too difficult if the bit map is true for all cases. The last major area to be examined before implementing the full language is a study of optimal parsing and code generation methods for the CLEOPATRA language. A table driven method might be considered, although table size might make such a method impractical. In all considerations, the design of a well structured compiler with the capability of being overlayed should be paramount. 61 The above analyses having been completed, the full CLEOPATRA language could be inplenjented. With the lessons of the subset and the suggested analyses, CLEOPATBA would indeed be a very effective progranming language. 62 References [11 Schreiner, Axel T. , "A Proposal for Another Systea Implementation Lanquage", Ph.D. Thesis, Department of Computer Science, Oniversity of Illinois, Urbana, Illinois, 1974. C [21 Schreiner, Axel T. , "Comprehensive Language for Elegant Operating System and Translator Design", Technical Report UIUCDCS-E-7a-646, Department of Computer Science, University of Illinois, Urbana, Illinois, 197U. i il '0 IS [31 Halbur, John D. , "A Cotle Generator for the CLEOPATRA Language", Masters Thesis, Department of Computer Science, University of Illinois, Urbana, Illinois, 1975. [4 1 Halbur John D., "CLEOPATRA Code Generator User's Guide", Technical Report UIUCDCS-R-76-740, Department of Computer Science, University of Illinois, Urbana, Illinois, 1976. [51 Gries, David, Compii^r Construction for Digital Computers, John Wiley and Sons, Inc., New York 1971. [6 1 Baur, F. L., Eckel, J., C cm pile r Con struct ion- -An AlM^c^l^ Course, Springer-Verlag, New York 1974. [71 Hoare, C.a.R., Notes on Data Structuring, in Dahl, Dijkstra, and Hoare, Structured ££2aEilSisa» Academic Press, New York, pp. 83-174, 1972. im^ BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-76-834 3. Recipient's Accession No. 4. Title and Subtitle IMPLEMENTATION OF THE LANGUAGE CLEOPATRA; ANALYSIS PASS THE 5. Report Date October 1976 6. 7. Author(s) Scott Harley Fisher 8. Performing Organization Rept. No. 9. Performing Organization Name and Address Department of Computer Science University of Illinois at Urb ana- Champaign Urbana, Illinois 618OI 10. Project/Task/Work Unit No. 11. Contract /Grant No. 12. Sponsoring Organization Name and Address Department of Computer Science University of Illinois at Urbana-Charapaign Urbana, Illinois 618OI 13. Type of Report & Period Covered Master's Thesis 14. 15. Supplementary Notes 16. Abstracts CLEOPATRA is a general purpose language with features suitable for systems prograiraning. A compiler for the language CLEOPATRA has been implemented in two passes. This report describes the analysis pass which produces an intermediate text suitable for the code generation pass. The analysis pass was written in PL/l for the IBM 36O computer. Due to the facilities of the language, the analysis pass requires innovative data structures and algorithms - these are reported herein. 17. Key Words and Document Analysis. 17a. Descriptors Block Structured Language Compilation Compilers Intermediate Text Programming Languages Symbol Table Management Parsing 17b. Identifiers/Open-Ended Terms Systems Implementation Languages CLEOPATRA 17e. COSATI Field/Group 18. Availability Statement RELEASE UNLIMITED FORM NTIS-38 ( 10-70) 19.. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 22. Price USCOMM-DC 40329-P71 i. FEB 2. ^ «77 c Jl V 4 4 «■ "J ;li (3 :c •is '0 ■■ 4 4 roqvityy«VA.Ny:,ya.:g»or- /.->v^->v- .•a :>■> ^. ^: LilriibUtiilit «wiMiT>oim»iiwiT3*wwwM»iiw«»iiKinn»ntronTmnny'/' JAN .^ 9 1976 7 UNIVERSITY OF ILLINOIS-URBANA 510.84 IL6R no. COOZ no. 830-835(1976 Implementation of the language CLEOPATRA 3 0112 088403073