MlfflBM HHBSB! HflBolffin JHfl ■83 ■HHKaiBGfitBS H II LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 5/0.84 no. S9S-&O0 cop. 2. ..^.Report No. UIUCDCS-R-73-596 yyi^i^ A GENERALIZED LEXICAL SCANNER FOR A TRANSLATOR WRITING SYSTEM by Albert Cannon Baker, Jr, October 1973 ■ \\ DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLI Report No. UIUCDCS-R-73-596 A GENERALIZED LEXICAL SCANNER FOR A TRANSLATOR WRITING SYSTEM* by Albert Cannon Baker, Jr, October 1973 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 This work was supported in part by the National Science Foundation under Grant No. US NSF-GJ-328 and was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, October 1973. Digitized by the Internet Archive in 2013 http://archive.org/details/generalizedlexic596bake A GENERALIZED LEXICAL SCANNER FOR A TRANSLATOR WRITING SYSTEM Albert Cannon Baker, Jr., M.S. Department of Computer Science University of Illinois at Urbana-Champaign, 1973 This is an expository paper that is concerned with a Lexical Scanner for a translator writing system that has been in use at the Univer- sity of Illinois. Its significant features include a structured, binary- tree symbol table, a parameterized macro expander, and a compile-time flexibility for assigning characters that make up the basic terminal symbols. A comprehensive example of the scanner's operation is also in- cluded. m ACKNOWLEDGMENT I wish to express my deepest thanks to Professor J. R. Phillips for his continuing support as academic advisor and thesis supervisor; to Professor R. S. Northcote, who struck in me the spark of interest in compiler design and defined much of the material for the initial lexical scanner; and to Norma E. Able for her patience through many discussions leading me through the maze of TRANQUIL and the TWS. Finally, to my wife Carol, thanks for the continued encouragement and motivation, without which the successful completion of these studies would never have been possible. The Air Force Institute of Technology, Wright Patterson Air Force Base, Ohio, sponsored the author's studies at the University of Illinois. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 . 1.1 The Translator Writing System 1 1.2 The Lexical Scanner 2 2. SCANNER DATA STRUCTURES 5 2.1 General Considerations 5 2.2 BIGTAB - The Symbol Table 6 2.2.1 Basic Format 7 2.2.2 Storage for Keywords, Identifiers and String Literals 8 2.2.3 Storage for Numeric Literals 9 2.3 The SCAN Descriptor 12 2.4 MACROTAB - The Macro Text Table 14 2.4.1 Concept of a Definitional Facility 14 2.4.2 The TWS Macro Facility 14 2.4.3 MACROTAB Descriptors 16 2.5 CHARCLASS - The Character Class Table 18 3. MAIN SCANNER PROCEDURES 22 3.1 P.EADACARD 22 3.2 NEXTCHAR 23 3.3 Boolean Procedure TABLESEARCH 24 3.3.1 Table Lookup 24 3.3.2 Processing of an Entry Already in BIGTAB 25 3.3.3 Insertion of a New Entry Into BIGTAB 26 3.4 Procedures to Build BIGTAB Entries 27 3.5 The Macro Facility 27 3.6 Alpha Procedure SCAN 30 3.7 Integer Procedure SHAKEOUTBIGTAB 31 APPENDIX 36 A SYNTAX AND SEMANTICS OF SCANNER-DEFINED ITEMS 36 B SAMPLE PROGRAM 42 LIST OF REFERENCES 55 LIST OF FIGURES Page 1. BIGTAB Basic Format 7 2. Format of the SCAN Descriptors 13 3. Table Lookup Algorithm 25 4. Examples of Balanced Tree Structures 33 5. Unbalanced Tree Structure 43 6. Balanced Tree Structure 44 1. INTRODUCTION 1.1 The Translator Writing System One of the software efforts that was undertaken for the Illiac IV project at the University of Illinois at Urbana-Champaign was the development of a Translator Writing System which permits the implementation of the various compilers; in particular, the compilers for the Illiac IV problem-oriented languages TRANQUIL [1] and GLYPNIR [2] are the prime examples. The main components of the TWS are: a) syntax meta-languages (TWINKLE [7] and TBNF [4]), which are extensions to Backus-Naur Form [8, 9], and in which the syntax of a programming language £ is specified; b) a semantics meta-language (Illinois Semantics Language ISL [5]) which is an extension of Burroughs Extended Algol that includes special constructs to manipulate tables, stacks and generate object code; and in which the semantics of a programming language £ is specified; c) a basic core for each translator consisting of the lexical scanner, either the skeleton parser for a direct parsing algorithm translated into Algol code or a complete table- driven parser for an interpretive parsing algorithm [4], and miscellaneous auxiliary procedures which are independent of the source and object languages of the translator; d) a system consisting of syntax preprocessors to generate either the tables for the interpretive parser or a set of Algol source statements that will parse the language specified; the ISL translator to translate the ISL ex- tensions into Algol; a program to generate from the out- puts of the syntax preprocessor, the ISL translator, and the basic core a complete Algol program which, when compiled by the standard Algol compiler, will be the re- quired translator for the language £ as specified by the syntax and semantics. 1 .2 The Lexical Scanner The function of a lexical scanner in a compiler is to scan charac- ters from a source program, combining one or more of them together to form single terminal symbols when the syntactic recognizer (parser) makes a request for a new symbol. As far as the TWS and the scanner are concerned, the following symbols are deemed to be terminal symbols: a) special single characters ($,:-+=, etc.); b) key words in the language, including either reserved identifiers or special identifiers for which a special character (nominally "#") precedes the identifier (BEGIN, #ELSE, etc.); c) identifiers (X, CASHIN1STNATI0NALBANK, etc.); d) string and numeric literals ("THIS IS A STRING", 3.14159, @ 234, etc.). Internal to the scanner is a powerful parameterized text- type [3] macro expander which has the capability to recognize and store declarations of defined identifiers, and to regurgitate the stored text when the identifier is used subsequently. This facility is transparent to the syntactic recognizer and, except for block structure considerations, is transparent to the semantics as well. Section 2.4 contains a discussion of the data structure of the macros, and Appendix A includes a discussion of the syntax and semantics of the macro generator. The symbol table used by the scanner is a straightforward binary- tree structure, with disjoint trees for the several terminal symbol classes interleaved within the same table. Binary Coded Decimal (BCD) information is stored in this table packed six characters per 48-bit Burroughs B-5500 word. Section 2.2 contains a complete description of the symbol table. There is much flexibility built into the scanner to make the re- sultant compilers both more general and easier to use. Through control card options, the user of the compiler may specify that non-standard symbols be used to define the terminal classes, such as using "8" instead of "@" to mark the exponent part of a numeric literal, or that the key words in the language would be marked by a special symbol, freeing these identifiers for the pro- grammer's use. The inclusion of a macro facility gives the programmer the power to extend the basic language, or to make one language resemble another, or to make his source code appear more readable. For example, if the compiler for an Algol -like language were written using the TWS, one could extend the language at compile time by adding appropriate macro definitions to a program written in it to make it resemble COBOL: DEFINE ADDING = MEND , TO = + MEND , COMPUTE = MEND , BY = := MEND ; where MEND terminates a macro definition. Then the source language statement: COMPUTE XYZ BY ADDING A TO B; would be compiled as: XYZ := A+B since ADDING and COMPUTE were both defined to be null. Thus, to reiterate, the main functions of this lexical scanner are to assemble the terminal symbols from the source string, to pass simple repre- sentations of those symbols to the syntactic recognizer, to maintain the BCD symbol table, and to perform macro expansion. The data structures behind these functions are the subject of Chapter 2, and a functional description of these functions in terms of the Algol procedures that implement them are the subject of Chapter 3. 2. SCANNER DATA STRUCTURES 2.1 General Considerations The structure of the internal tables and the algorithms to use them will have a great effect on the speed and efficiency of any program. The lexical scanner is one of the most used procedures in any compiler, and attention must be paid to make it as efficient as possible. In the imple- mentation described here, one of the main considerations is the structure of the language into which the compilers are translated, Burroughs extended Algol for the B-5500. A brief introduction to the B-5500 and its constraints on the Algol language are appropriate here. The B-5500 [8] is a multiprogramming, multiprocessor computer system. With a limited (32K 48-bit words) main memory, it relies heavily on segmentation of both programs and data to make most effective use of limited memory to service the various programs in the mix. Specifically, programs and arrays are broken down into segments, each no larger than 1024 words. The program segments are stored on the disk. When a program enters the mix to be run, it is assigned a fixed, non-over! ayable, contiguous area for a run-time stack and program reference table. This latter contains storage for single variables and descriptors relating to each program and array segment. Then, program and data segments are read off the disk as they are needed, and assigned space in core possibly overlaying previous information from any of the programs in the mix. If the area being overlaid contains only program segments, or array segments that have not been written into, it is simply overwritten; the information is still on the disk. However, an area containing array segments with words that have been changed causes those segments to be written back onto the disk before being overlaid. The restriction that array segments be no longer than 1024 words is of primary interest here. The segmentation is by array rows - with each row occupying a segment. Thus, no row may be longer than 1024 words. A one- dimensional linear array is one row, so no linear array may be longer than 1024 words. Larger linear tables must be simulated as two-dimensional arrays For instance, an 8192 word table could be declared with array bounds [0:15, 0:511] so there would be sixteen segments each containing 512 words. When simulating large linear arrays, it is wise to express the range for each sub- script to be a power of two in each case in order to be able to access an entry in the table using a single index. In the case above, the column sub- script requires exactly four bits, whereas the row subscript requires exactly nine. Thus, given a single 48-bit index I, I. [35:4] would select the proper row, whereas I. [39:9] would select the proper column position within that row (in the partial word notation of Burroughs extended Algol). 2.2 BIGTAB - The Symbol Table In any compiler, storage for the representations of the terminal symbols must be made. The efficiency of the compiler can be greatly affected by the choice of data structure for the symbol table. The specific functions that must be optimized in the use of the symbol table are, in order of im- portance: a) lookup b) insertion c) traversing. Furthermore, separate lists must be maintained for the four classes of multi- character terminal symbols: <*!>, <*N>, <*S>, and <*R>. The basic structure BIGTAB was chosen so as to store data as a forest of binary trees, having four trees interleaved within one 8192-word table. The advantages of this approach are: a) being naturally linked lists, interleaving the trees in the same table is possible; thus, an identifier entry could be adjacent in the table to a numeric literal; space has to be reserved for only one table; b) lookup is fast compared to a linearly linked list; c) insertion can be made in the next sequential location in the table with no need to change links already established. Knuth [6] discusses extensively the characteristics of binary tree structures. 2.2.1 Basic Format Each tree has a head node in a fixed table location: BIGTABp] for identifiers, BIGTAB[2] for numeric literals, BIGTAB[3] for string literals and BIGTAB[15] for key words. This head node is a pointer to the root of its tree. Each BIGTAB entry consists of an entry head plus one to eight data words to store the BCD characters of the text. HEAD NODE ENTRY HEAD Semantic Part 0:16 Number of Characters 16:6 Left Pointer | Right Pointer 22:13 35:13 DATA WORDS _________ 0:12 BCD CHARACTERS Figure 1. BIGTAB Basic Format 8 2.2.2 Storage for Keywords, Identifiers and String Literals The syntax preprocessor will extract from the syntax definition the language specific keywords and non-terminals and place them, linked together in one tree in BIGTAB format, into a disc file called "/TABLES", where is the name assigned by the compiler designer. At the start of every execution of the TWS-built compiler, this file is read, initializing the run-time BIGTAB. The language non-terminals (such as , or ARITHMETIC PRIMARY>) will allow the table-driven parser to provide a trace of the parsing path. The non-terminals are inserted into the table with a leading blank so they will never be recognized as identifiers in the source string. Appendix B lists the TBNF syntax of a simple language DEMALGOL, and gives an example of the initial BIGTAB produced by this syntax. The scanner assumes as the nominal condition that these keywords will always be preceded by the special symbol "#" (i.e., #BEGIN), and that BIGTAB[15] will point to this initial syntax-preprocessor-built tree. Thus, all occurences of "#" followed by an identifier will cause reference to this tree. But, by control card option, the nominal condition can be replaced by a reserved word option. In this condition, all occurences in the source string of all syntax-defined keyword identifiers (i.e., BEGIN) will be reserved to have only the keyword meaning, and BIGTAB[1], the identifier tree, is set to point to the syntax preprocessor-built tree. Thereafter, all identifiers in the source string will be checked against this table, and newly-defined identifiers will be linked into it. The scanner will recognize the presence of a reserved identifier by the fact that the BIGTAB address is within the range of the initial table. For keywords, identifiers and string literals, the basic format is exactly as specified in section 2.2.1. The only difference among the three classes is in the use of the semantic part. The BCD characters are stored six per 48-bit word, allowing a maximum of 48 significant characters. For identifiers, headword bit [1:1] is reserved by the scanner to indicate this identifier is defined as a macro or macro formal parameter. If set, then bits [4:12] point to the address in MACROTAB of the stored text, and bits [2:1] indicate a formal parameter. If the identifier is not defined as a macro, bits [2:13] may be set by the compiler semantic routines as de- sired. In GLYPNIR [2], as implemented using the TWS, pointers to the semantic IDTAB and the parser MSTACK are inserted in the semantic part. For keywords, the syntax preprocessor places in the semantic part a unique symbol number for each keyword in the syntax. This allows the parser to consider the keyword as it would a single special symbol. For string literals, the semantic part is reserved for the semantic routines, typically to point to a literal table. 2.2.3 Storage for Numeric Literals A numeric literal is a string of characters that carries an inherent semantic value - the specific quantity that this string represents. As this semantic value will be variable and machine dependent, the TWS will not convert these literals to an internal machine representation, but rather transform the literal string to a normalized, consistent BIGTAB entry, with enough analysis performed on the source string to make the semantic conversion of the numeric literal to internal representation relatively straightforward for the semantic part of the compiler. In BIGTAB, the same basic header word and data word structure applies here as in the identifier, keyword and string literal tables. The semantic part of the header word can be used, as in the TRANQUIL compiler implemented 10 using the TWS, to store a literal type, and a pointer to a semantic table. But, in the data words, quite a different structure is used. Instead of the BCD characters packed six per 48-bit word, the full eight character capacity of each word is used, with the first two characters in the first data word being used to describe certain semantic attributes of the numeric literal: ENTRY HEAD: Semantic Part Number of Characters Left Tree Pointer Right Tree Pointer 0:16 16:6 22:13 35:13 DATA WORDS: Char Char 1 Char 2 Char 3 Char 4 Char 5 Char 6 Char 7 (1-8) 0:6 6:6 12:6 18:6 24:6 30:6 36: 6 42:6 The first two characters (12 bits) of the first data word have the following values: 0:1 - Unused, always zero 1:1 - Base indicator =0 Decimal numeral, base 10 =1 Nondecimal numeral , base 2 to 36 2:1 - Numeric type =0 Integer =1 Real 3:1 - Sign of exponent =0 Positive =1 Negative 4:2 - Number of exponent digits (I) 0-3 (i.e., exponent 0-999, Q ) 6:6 - Number of mantissa digits (N). 11 There follow "N" characters of the mantissa; followed by "I" by characters of the exponent (for real type numeric literals only); followed by one character containing the base (for nondecimal base numeric literals only), range 2-36^. The last data word is zero filled. For numeric literals containing a radix point, the mantissa is normalized, that is the exponent is recomputed as though the radix point is to the right of the rightmost mantissa digit. For nondecimal bases, there must be provision for up to 36 different digits. The scanner considers 0-9 and A-Z as the 36 digits. The internal character code for the decimal digits 0-9 exactly correspond to the "digit value" 0-9. But this is not true for the alphabetic letters. To correct for this, the input alphabetic character will be converted to a true digit value in the range 0-35 for storage in the data words. This is accomplished by subtracting a bias from the character code, depending on the letter: A-I subtract 7 J-R subtract 14 S-Z subtract 22. Consider as an example, the hexadecimal numeric literal 3A42E. 5690-354(1 6). The semantic descriptor would be composed as follows: 0:1 =0 1:1 =1 , nondecimal base 2:1 =1, real 3:1 =1, negative exponent 4:2 =3, 3 digits of exponent 6:6 =8, 8 digits of mantissa This produces for the first 12 bits 0111 111 001 000, or as six bit characters, 12 V8". This will produce the following data words: Data word 1: *■ 8 3 # 4 2 > 5 Data word 2: 69357+00 Note that the exponent has been changed from 354 to 357, repre- senting the normalization; that the base is represented by "+", or 16, Q ; that the second word is padded with zeros; that hex A (character code 17) is converted into "#" (character code 10); and that hex E (character code 21) is converted into ">" (character code 14). The simple decimal integer 1 would be converted for storage to: 0:1 =0 1:1 =0, decimal base 2:1 =0, integer 3:1 =0, positive exponent 4:2 =0, no exponent part 6:6 =1, 1 mantissa digit. This produces the six bit characters "01", and the following data word: Data word 1: 110 2.3 The SCAN Descriptor The ALPHA procedure SCAN is called by the parser (and recursively from within the scanner itself) when a new terminal symbol is required. The 48-bit value assigned to SCAN as a function is referred to as the SCAN de- scriptor, and has the format as shown in Figure 2. For keywords <*R>, the symbol number is assigned by the syntax pre- processor, starting with 66, «. Special single characters have a symbol number equal to their internal 6-bit character code, thus varying from 0, (numeral 13 zero) to 63, ("). This allows the keywords and the special single characters to be considered the same in the parsing routines. <*I>, Identifiers: Unused Class =1 BIGTAB Semantic Part 0:2 2:4 <*N> , Numeric Literals 6:12 Pointer to BIGTAB 18:13 Class =1 Pointer to BIGTAB 31:4 35:13 Unused 0:2 Class =2 BIGTAB Semantic Part Pointer to BIGTAB 2:4 6:12 18:13 Class =2 Pointer to BIGTAB 31:4 35:13 <*S> , String Literals Unused Class =3 BIGTAB Semantic Part 0:2 2:4 6:12 Pointer to BIGTAB Class =3 18:13 Pointer to BIGTAB 31:4 35:13 :*R>, Keywords: Unused Class = 15 Symbol Number Pointer to BIGTAB Class = 15 Symbol Number 0:2 2:4 6:12 18713 31:4 35:13 Special Single Characters: Unused Class = 15 Symbol Number Symbol Number Class = 15 Symbol Number 0:2 2:4 6:12 18:13 31:14 35:15 Figure 2. Format of the SCAN Descriptors For keywords <*R>, the symbol number is assigned by the syntax pre- processor, starting with 66, Q . Special single characters have a symbol number equal to their internal 6-bit character code, thus varying from 0, Q (numeral 14 zero) to 63, ("). This allows the keywords and the special single characters to be considered the same in the parsing routines. 2.4 MACROTAB - The Macro Text Table 2.4.1 Concept of a Definitional Facility Many compilers in current use (B-5500 ALGOL, IBM PL/I, JOVIAL, etc.) have a definition facility- -that is capability to define compile-time procedure- like constructs. One can compare a text- type definition or macro facility with a run-time procedure construct as follows: A procedure a) is considered syntactically as a complete or ; b) produces one set of machine code that may be executed by jumps and parameter linkages from different parts of the main program. A macro a) may be an incomplete syntactic fragment composed of a sequence of terminal symbols; b) produces a separate set of machine code for each invocation; c) is strictly a compile-time device that is transparent to the parsing and semantic portions of the compiler. 2.4.2 The TWS Macro Facility As the TWS was developed using Burroughs B-5500 ALGOL, it became apparent that the definition facility implemented on this compiler made the compiler easier to use, and actually allowed local extensions to be implemented in a rather straightforward manner. Therefore, as a practical matter a similar 15 parameterized macro expander was included as a part of the core compiler for all TWS-written compilers. The storage scheme selected was to store the macro text as the entire SCAN descriptor, with two header words for each definition. Some elements of the storage scheme are: a) one 48-bit word per terminal symbol; b) the scope of identifiers (a semantic concept) used in the macro text will be defined at the point in the program where the macro is declared, since the SCAN descriptor includes the BIGTAB semantic part at the time the macro was declared. c) accessing the pre-stored macro text by the scanner may be faster than scanning the text from the source string—as the time-consuming assembling of the characters into the numeric strings and identifier strings, and table lookup in BIGTAB is performed only once, no matter how many times the macro is invoked; d) block structure considerations are made to allow an iden- tifier to be defined, for example, as a label in one block and redefined as a macro in an inner block with the old semantic definition being restored upon block exit; e) a defined identifier may be redefined within the block in which it was declared, in which case the new text will re- place the old text for subsequent invocation (this implies that the macro declaration does not necessarily have to be placed in the block head for a block-structured language), f) no parsing or syntax checking of the text is made until the defined identifier is invoked; 16 g) defined identifiers (i.e., calls on other macros) may occur anywhere within the macro text, but the value is defined at the point of the macro declaration; h) defined identifiers may occur anywhere within the actual parameter part of any macro call; i) no recursion--i .e. , one macro directly or indirectly calling itself-- is permitted. Appendix A describes the detail of the syntax and semantics of the elements of the macro facility. 2.4.3 MACROTAB Descriptors As the macro text is processed by the scanner from the DEFINE dec- laration, two header words are set up in MACROTAB: W0R.D1 : Unused 0:6 Address of Return Descriptor Address of Actual Parameter Table 6:12 Number of Parameters 18:12 30:12 Unused 42:6 W0RD2: BIGTAB Semantic Pointer to BIGTAB Link to Block Part of Defined Address of Defined Previous Nesting Identifier Identifier MACROTAB entry Level 0:16 76TT3 29:12 4T77 The text is then scanned (by SCAN) into the table, one word per termi- nal symbol. If a defined identifier is encountered in the source string, a special macro call descriptor is inserted into the table. If a formal parameter is encountered, a special formal parameter descriptor is inserted. Finally, at the end of the text, a return descriptor is inserted: 17 SCAN DESCRIPTOR: Unused Class 0:2 Symbol # 2:4 6:12 BIGTAB Pointer for <*I> <*N> <*S> 18:13 Class I BIGTAB Pointer 31:4 35:13 MACRO CALL Unused Class Pointer to Address of Where Contents of DESCRIPTOR: =8 Called Macro to Continue after the Call Called Macro's Return Word 0:2 2:4 6:12 18:12 30:12 RETURN DESCRIPTOR: Unused 0:2 Class =9 Where to Continue Processing upon Return; =0 Means Outermost Macro Address of Macro Call Descriptor 2:4 6:12 18:12 FORMAL PARAMETER DESCRIPTOR: Unused Class = 10 Address of Macro Header Word Parameter Number 0:2 2:4 6:12 18:12 When the macro is invoked, the actual parameters must be stored, in a manner similar to the macro itself, as scan descriptors. In addition to the stored text, for each actual parameter, there will be one return descriptor as described above plus one parameter address and length word for each two actual parameters. ADDRESSES AND LENGTH DESCRIPTOR: First Length First Address Second Length Second Address 0:12 12:12 24:12 36:12 As the formal parameters used within the macro definitions are strictly local, provision has been made to use the high-order end of the macro table (from location 4095 down) as temporary storage for the semantic part of the parameter identifiers during scanning of the macro text. This semantic part is then restored when the mend terminating the definition is scanned: FORMAL PARAMETER SAVE WORDS BIGTAB Semantic Part BIGTAB Pointer 0:16 16:13 18 2.5 CHARCLASS - The Character Class Table The term "terminal head symbol" refers to the first character of a terminal symbol. The scanner needed a way to determine from the terminal head symbol what class of terminal symbol was to follow. For example, a decimal digit, radix point or exponent sign will indicate a numeric literal must be formed from the following characters. Similarly, a string quote indicates a string literal follows. For this and other decision points in the scanner, a table of character classes has been established, assigning to each six-bit BCD character a bit string: Character Class Class Value CHARCLASS Bit Positions 41 42 43 44 45 46 47 Digits 0-9 58 Special Keyword Delimiter ( £ ) 4 Numeric Literal Exponent Delimiter ( £ ) 18 Radix Point ( ^ ) 34 Numeric Literal Base Delimiter [ {_) 64 String Quote ( ^ ) 3 All Other Special Symbols Letters A-I 89 Letters J-R 105 Letters S-Z 121 1110 10 10 10 10 10 10 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Table 1. CHARCLASS Table When certain lexical decisions must be made about a character in the source string, a branch is made, indexed by some subset of the bits in the CHARCLASS entry. 19 When the scanner is ready to search for a new terminal symbol, a branch is made on bits [45:3] of the CHARCLASS of the terminal head symbol, with the following results: [45:3] Value Characters in This Class 1 Letters A-Z. Identifier follows. 2 @_, _;_, Digit 0-9. Numeric literal follows. 3 \ String literal follows. 4 #_. If the special word option for keywords was chosen, a language keyword follows. All other special characters. Process as a special single character. During the assembly of a numeric literal, when it is known the terminal head symbol has bits [45:3]=2, a further branch is made on bits [42:2], with the following results: [42:2] Val ue Characters in This Class 1 Q_. Exponent delimiter. 2 „ Radix point. 3 Digits 0-9. These correspond to special processing depending on the numeric type. Later, during assembly of the interior symbols of the numeric literal, a branch is made on bits [41:3] of the incoming source symbol, with the following results: 20 [41 :3] Value Characters in This Class 1 £. Branch to exponent logic. 2 _j_. Branch to fractional digit logic. 3 Digits 0-9. Assemble in normal manner. 4 {_. Branch to base logic. 5 A-I. Branch to logic to convert character codes 17-25 to true digit values 10-18. 6 J-R. Branch to logic to convert character codes 33-41 to true digit values 19-27. 7 S-Z. Branch to logic to convert character codes 50-57 to true digit values 28-35. All other special symbols. Terminate numeric Literal processing. During assembly of the alphanumeric internal characters of an iden- tifier, CHARCLASS bit [44:1] is used to indicate an alphanumeric character. [44:1] Value Characters in This Class 1 Digits 0-9, letters A-Z. Assemble into the identifier. All other characters. Terminate identifier processing. This procedure of using the internal character code of the source characters to index a table of character classes allows the TWS-supplied contro' card options below to be easily implemented: 21 ALPHABETIC a ALPHANUMERIC b RSWD SPWD c EXPONENT d Let a have CHARCLASS.[44:4]=9. Thus a can be an identifier terminal head symbol ( [45: 3]=1 ) or an internal identifier symbol ([44:13=1). Let b have CHARCLASS . [44: 1 ]=1 . Thus b can be an internal identifier symbol. Choose the reserved word option for keywords. Let # have CHARCLASS = zero. Retain the nominal special word option for keywords, but designate £ to be the delimiter by setting CHARCLASS c to 4, and reset CHARCLASS £ to zero. Change CHARCLASS d to 18; reset CHARCLASS £ to zero. Specifying "ALPHANUMERIC - " on a control card would allow a COBOL- like identifier CASH-IN-FIRST-NATIONAL-BANK. 22 3. MAIN SCANNER PROCEDURES This chapter will give a functional description of major parts of the lexical scanner in terms of the ALGOL procedures that perform the various functions. 3.1 READACARD This procedure defines the basic format of the source program cards accepted by the TWS scanner. a) The source program card images are read from the disc file "SOURCE" as 80-character records, into an array CARDBUF[0] to CARDBUF[9], appropriately recognizing the end-of-file condition. b) The text in card image columns 1-72 is then transferred into another array CHARBUF[1] to CHARBUF[72], one BCD character per word. c) The card image is analyzed to identify leading and trailing blanks, setting items "FCR" as the column with the first non- blank character, "LCR" as the last non-blank character, and "NCR" as the moving character pointer initially set equal to "FCR". d) The card image counter "CARDCOUNT" is incremented by one, with the current value placed in CARDBUF[10] for printing as columns 81-89 of the card image. e) The number in columns 72-80 of the card image is translated to internal form, and made available to the rest of the compiler procedures as "CARDNUM". If this field is blank on the first card image in the source stream, "CARDNUM" will be 23 subsequently set equal to "CARDCOUNT" for each card image. f) If column one is "$", the card image is printed on printer backup disc file "LINE", and control is passed to the pro- cedure "CONTROLCARD" for analysis of the control card information. g) If the card image was not a control card, and if the "PRINT" or "LIST" control card options had been chosen previously, the contents of array CARDBUF, including the inserted card counter, is printed on file "LINE". Thus, externally, the following are the user-oriented TWS features implemented through READACARD: a) control cards start with a "$" in column one; b) source text occupies columns 1-72; c) columns 72-80 may contain a card count, that will be made available, if the semantic routines store it, for program traces. 3.2 NEXTCHAR This procedure provides other scanner routines with text one charac- ter at a time. Its output is a variable NXTCHR containing the six-bit BCD code of the character scanned in the source string. In addition, the following func- tions are performed: a) If a "%" is detected anywhere on the card image, the rest of the card image is ignored. This is the basic COMMENT capa- bility provided by the TWS. An ALGOL-like facility may be implemented by the semantics, with the caveat that all the text in the comment would be scanned, with all words and 24 numbers placed into BIGTAB. b) When not internal to a string literal, multiple blanks in the source stream are reduced to a single blank for parsing purposes. 3.3 Boolean Procedure TABLESEARCH Prior to the execution of TABLESEARCH, other procedures will have been run to assemble from the source string one of the classes <*I>, <*R>, <*S>, or <*N> of multi -character terminal symbols into SYMBUF[0] to SYMBUF[7], in the format of the BIGTAB data words described in section 2.2. The value of the procedure will be TRUE if a successful BIGTAB table lookup has been made, meanwhile setting the global variable NEXTSYM to be the SCAN descriptor (see section 3.2). The value is set to FALSE if a macro definition or call is en- countered, indicating the main SCAN procedure must then either assemble further text from the source string, or obtain SCAN descriptors from the macro table. The TABLESEARCH procedure has three main portions that will be de- scribed in detail: table lookup in BIGTAB, processing of an entry already in BIGTAB, and insertion of a new entry into BIGTAB. 3.3.1 Table Lookup BIGTAB is a straightforward binary tree structure. The basic algo- rithm below does not reflect the complication in the scanner that is required by the fact that entries to be compared may be of different lengths, extending over one to eight words. 25 Given SYMBUF[0] to SYMBUF[7] containing an entry in BIGTAB format; TYP, the terminal symbol class (1, 2, 3, 15) of the entry; NWDCH, the length of the SYMBUF entry; LEFTPOINTER and RIGHTPOINTER referring to pointers in the BIGTAB entry head LI (Find root) L2(Test if link null) L3(Compare SYMBUF with BIGTAB) L4(SYMBUF f BIGTAB) L5(SYMBUF > BIGTAB) EXIT! EXIT2 Set ENTRYPTR + BIGTAB[TYP] Is ENTRYPTR = 0? Yes, go to EXIT1. If SYMBUF = BIGTAB entry, and length of SYMBUF length of BIGTAB entry, then go to EXIT2. If SYMBUF < BIGTAB entry, set ENTRYPTR *■ BIGTAB[ENTRYPTR].LEFTPTR, set K «■ 1, go to L2. Set ENTRYPTR «- BIGTAB[ENTRYPTR]. RIGHTPOINTER, set K ^ 2, go to L2. Entry not in BIGTAB. See section 3.3.3. Entry already in BIGTAB, ENTRYPTR is location of head node. See section 3.3.2. Figure 3. Table Lookup Algorithm 3.3.2 Processing of an Entry Already in BIGTAB If the symbol scanned in the source string is already in BIGTAB, three cases must be distinguished: a) The keyword DEFINE is encountered. A macro definition follows. Process the text into MACROTAB format (see section 2.4). Exit TABLESEARCH, indicating an unsuccessful table lookup. 26 b) A defined identifier indicating a macro call is encountered. Process the actual parameters of the call (see section 2.4) if any, change the scan mode to indicate subsequent scan descriptors are to come from the macro table. Exit TABLESEARCH, indicating an unsuccessful table lookup. c) A normal entry is encountered. Build the scan descriptor NEXTSYM according to the symbol class. Exit TABLESEARCH, indicating successful table lookup. 3.3.3 Insertion of a New Entry Into BIGTAB When the symbol scanned in the source string is not found in the BIGTAB table lookup, it must be then inserted into the symbol table: a) The head word is created, inserting only the number of characters. The semantic routines will set the semantic part, and both right and left tree links will be empty when the entry is created. b) The head word and the data word(s) are inserted into the table in the next available sequential location. A check is made to insure that the complete entry will fit into one array row, as to split elements of one symbol across array rows would cause undue overhead due to array row segmenta- tion in the B-5500. c) It was found that the entry was not in BIGTAB when either the right or the left of the entry head in location ENTRYPOINTER was null. If in L4, K was set to one, then the left pointer of the entry head at location ENTRYPOINTER must be set to point to this new entry address. Otherwise, in L5, K was set to two so the right pointer must be set to 27 point to the new entry, d) Build the scan descriptor NEXTSYM according to the symbol class. Exit TABLESEARCH, indicating a successful table lookup. 3.4 Procedures to Build BIGTAB Entries There exist three major procedures, NUMERICLIT, STRINGET, and ALPHAGET to assemble into SYMBUF to SYMBUF 7 the numeric literal, string literal and identifier data types. They are functionally described by their outputs in section 2.2. 3.5 The Macro Facility Section 2.4 describing the data storage for the macro text gives an adequate functional description of the procedures PROCESSMACRODECLARATION, MACROINVOCATION, and PROCESSMACROACTUALPARAMETERPART. This section will dis- cuss the procedure GETDESCRIPTORFROMMACROTAB, to illustrate how it "executes" the descriptors placed in the macro table by the other procedures. The major concept in designing this descriptor-based macro system that would allow nearly arbitrary text in the parameters of the macro invoca- tion, and that would further allow arbitrary (except recursive) macro invoca- tions either within the text or within the actual parameter was the concept that the descriptors could be considered as "instructions" directing the flow of data from the table, that would be "interpreted" by the Alpha procedure GETDESCRIPTOR- FROMMACROTAB. If a formal parameter or call on another macro is detected, during scanning of the text in the declaration, a special "jump instruction" is placed in the sequential macro table to direct the flow of data. At the end of the macro text itself, and at the end of an actual parameter, a "return" word is inserted - to direct the flow back to the point where it was interrupted. 28 The procedure is called from the SCAN procedure when a new symbol is needed in SCANMODE 5. The global variable NEXTMACRO contains the macro table address of the next sequential entry to be chosen. The macro table entry at this address is examined and "executed", depending on the class field, bits [2:4] of the entry: Class Value Action 1, 2, 3, 15 Normal SCAN descriptor . Set GETDESCRIPTORFROMMACROTAB +■ MACROTAB [NEXTMACRO]. Increment NEXTMACRO by one. Exit procedure. 8 Macro invocation descriptor . 1) Set up called macro's return descriptor. Return location is either NEXTMACRO + 1 or the address following the actual parameter table. This location has been in- serted in bits [18:12] of the macro invocation descriptor by the procedure PROCESSMACRODECLARATION. 2) Set in the called macro's entry head word one the ad- dress of the actual parameter table. If parameters are present, their location will be NEXTMACRO + 1. 3) Set NEXTMACRO to the first word of the called macro's text, located immediately following the second header word. 4) Branch to code to examine a new macro table entry. 9 Return descriptor . Either a complete macro call or an actual parameter has been "executed". Consider the following two cases: 1) Return address is zero. This is true only for an outermost macro call. Set GETDESCRIPTORFROMMACROTAB <- 0, 29 Class Value Action set SCANMODE *■ 0. Set PTMACROTAB (the pointer to the next available location for insertion of next text) to the value it had at the time the defined identifier was scanned in the source string. This will "erase" the actual parameters stored for this call. Exit from the pro- cedure. 2) Return address is not zero. Set NEXTMACRO «- return address, branch to code to examine a new macro table entry. 10 Formal parameter . Extract from descriptor bits [6:12] the address of the macro head word one. From bits [18:12] of the head word, extract the location of the actual parameter table as set when the actual parameters were scanned. From bits [18:12] of the formal parameter descriptor, extract the parameter number. Determine from the addresses and lengths de- scriptor in the actual parameter table the location of the specific actual parameter needed, as well as the address of its return descriptor. Set the actual parameter return ad- dress to NEXTMACRO + 1 . Set NEXTMACRO «- actual parameter address. Branch to code to examine a new macro table entry. Note that none of the "special" descriptors cause output from the pro- cedure, but just a redirecting of the flow, followed by "execution" of the de- scriptor in the new location. Note also that this procedure will work on either empty macros or empty parameters. In both cases, the stored text will be simply a return descriptor. 30 3.6 Alpha Procedure SCAN SCAN is the procedure that controls the actions of all the other pro- cedures mentioned above. It is the prime interface with the parsing routines, and is called when a new terminal symbol is required. The procedures to build the macro table are declared in the SCAN procedure head, and thus call SCAN re- cursively to obtain the descriptors to store in the macro table. The value of SCAN as a function is normally the scan descriptor as discussed at length in section 2.4. In SCANMODE 4, its value will be the contents of SYMBUF[0]. There are several modes of operation of SCAN, depending on the way the source characters are to be assembled. Setting of the global variable SCANMODE prior to call of SCAN will cause one of the following actions to be taken: SCANMODE Action Taken by the Scanner Normal operational mode. Ignore all embedded blanks out- side of string literals. Return normal SCAN descriptors. 1 As in SCANMODE 0, but reduce adjacent embedded blanks to one blank and report as a single special symbol. 2 Scan the text between FCR and LCR. Return a descriptor on each character in the source string as a single special character SCAN descriptor, but ignore blanks. 3 As SCANMODE 2, but reduce adjacent blanks to one and report. 4 As SCANMODE 0, but return contents of SYMBUF[0] - i.e., the first BCD characters of the terminal symbol - as the SCAN descriptor. Do not look up or enter the symbol in BIGTAB. 5 Fetch SCAN descriptor from the macro table. 31 When in SCANMODE 0, 1 or 4, a branch is made on CHARCLASS[45:3] of the terminal head symbol to define whether an identifier, a numeric literal, a string literal, a special word keyword or a single special symbol follows. Based on the specific branch made, the terminal symbol is assembled in the proper format into the array SYMBUF. In SCANMODE or 1 , a BIGTAB table lookup is performed, obtaining the BIGTAB semantic part and address of the symbol. With this information, the SCAN descriptor is assembled. 3.7 Integer Procedure SHAKEOUTBIGTAB It was noticed when working with the initial BIGTAB produced by the 1969 version of TRANQUIL that there was quite a large imbalance in the tree structure. Specifically, the initial BIGTAB contained 198 entries consisting of 109 keywords and 89 language terminals. A reflection on the properties of binary trees shows that in the worst case, all nodes could be strung out", requiring 198 levels, and in the best case, eight levels (riog 2 198l). The importance in the number of levels is in the speed of lookup--the more levels to the tree, the more comparisons that must be made to find an entry in the tree. On the 198-entry tree actually produced by the syntax preprocessor, the level number of the nodes varied from one (for the head of the tree) to eigh- teen. The average level of all 198 nodes was nine. For comparison, a fully balanced binary tree with eight levels could contain 255 nodes, will have a maximum level of eight, and an average level of 7.03. An algorithm was developed by this author that will balance the tree structure of any input BIGTAB-type tree, modifying the left and right tree pointers, but leaving all nodes in the same locations as previously. The maxi- mum level of the balanced tree will be riog ? (N-l)l, where N is the total number 32 of nodes in the tree to be balanced. Before the algorithm is discussed in detail, some observations can be made about the structure of the balanced tree. Given the nodes in lexical order in the sequential table TAB[1] to TAB[N], the tree developed by this algorithm will have all odd TAB entries, i.e., TAB[1], TAB[3], TAB[5], etc., as terminal nodes-with both left and right tree pointers null. Conversely, all even TAB entries, i.e., TAB[2], TAB[4], etc., will have at least one non- null link. Furthermore, as the tree is "grown", the left sub-tree of any node will always be complete. If the tree is not full, it will be the right sub- trees that will be partially empty. Figure 4 illustrates these points. The essence of the algorithm is to order the nodes into a linear list TAB, and then to visit each node on each level sequentially, from left to right, computing and setting the new BIGTAB tree links as each node is visited. To control sequencing of the algorithm, a queue is constructed, being ini- tialized both front and rear with the new head node. When the right and left tree links are computed for a node, the TAB address of these sub-nodes are inserted into the rear of the queue. This results in the visit of all nodes in a certain level before progressing to the next lower level. Use of this procedure has been made a control card option. If "BALANCE" appears on a control card, the initial BIGTAB is balanced. Thereafter, the procedure may be called from procedure TABLESEARCH, whenever a BIGTAB array row fills up. On a series of benchmark tests using a 1943-card-image input deck, using the 1969 TRANQUIL BIGTAB, the balancing added about 2.8 seconds to the two minute total scan time. But use of the balanced BIGTAB was able to increase throughput of the scanner between six and seventeen percent over that using the unbalanced BIGTAB. 33 Number of Nodes 1 2 Balanced Tree m m 21 E 3 Hi m ll 4 □ [TIE Figure 4. Examples of Balanced Tree Structures 34 ALGORITHM B - Balance the Tree Structure Let BIGTABWORD be considered a pointer with bits [22:13] as a left pointer field, and bits [35:13] as a left pointer field. Let each TAB entry have three fields: bits [15:10] as the queue link, bits [25:10] as the delta field, and bits [35:13] as the BIGTAB address. Bl Traverse the tree in postorder (see Algorithm T, Knuth [6], page 317), using an auxiliary stack, placing the BIGTAB ad- dresses of the nodes visited into a sequential array TAB[1] to TAB[N]. B2 Find the size of the smallest fully balanced tree that has less than or equal to N nodes. Let I be this number, with a value 1, 3, 7, 15, 31, 2 m -l(m>J). The exponent m is the maximum number of levels in the balanced tree. B3 Set SHAKEOUTBIGTAB to be the BIGTAB address of the root of the balanced tree. This will be found in TAB[[I/2]], for example, if 1=15, TAB[8] will contain the BIGTAB address of the root of the tree. B4 Set F «- R «- P + [1/21, the root of the tree. Set the delta field of TAB[P] +• P/2. B5 Is F=0? (output queue exhausted?). Yes , go to B13 (exit). B6 Set P •*■ F (front of queue). Compute right and left tree pointers for node P. Set DELTA *■ delta field of TAB[P]. Set Q <- BIGTAB address stored in TAB[P]. B7 If DELTA = 0, then node Pisa terminal node with both pointers zero. Set BIGTABWORD to zero. Go to B12. B8 DELTA f 0. Compute left tree pointer. Set LINK «- P - DELTA. Set left part of BIGTABWORD to the BIGTAB address field of 35 TAB[LINK]. Set the delta field of TAB[LINK] to DELTA/2 (as it is on the next lower level), and insert LINK into the rear of the output queue. B9 Compute the right tree pointer. Set LINK «■ P + DELTA. Since the right sub-tree may be incomplete, recompute DELTA. If LINK > N, then set DELTA <- DELTA/2, go to B9. BIO If DELTA = 0, then the right sub-tree is empty. Set the right part of BIGTABWORD to zero. Since there is no right sub-tree, nothing needs to be put into the output queue. Go to B12. Bll DELTA JO. Set right part of BIGTABWORD to the BIGTAB address field of TAB[LINK]. Set DELTA/2 into the delta field of TAB[LINK]. Insert LINK into the rear of the output queue. B12 Set BIGTAB[Q].[22:26] *- BIGTABWORD. Set F to the next node from the front of the queue. Go to B5. B13 Exit procedure. Entire tree is balanced. 36 APPENDIX A SYNTAX AND SEMANTICS OF SCANNER-DEFINED ITEMS <*!>, the Identifier Metaclass TBNF SYNTAX ALPHANUMERIC CHARACTER> A | B | C | D | ...|Z 0|1|2|3|4|5|6|7|8|9 | || |<*I> ALPHANUMERIC CHARACTER> SEMANTICS All identifiers are limited to 48 characters in length. Identifiers may not extend over card-image boundaries. The standard syntax of ALPHABETIC CHARACTER> or ALPHANUMERIC CHARACTER> may be augmented by control card option (see section 2.5) . <*R>, the Keyword Metaclass TBNF SYNTAX <*R> ;= <*I>|"#" <*I> SEMANTICS If the "RSWD" control card option is selected, a simple identifier may be used as a keyword - and cannot be re-declared by the compiler-user. The nominal state is the "SPWD" or special word option, where a "#" must precede the syntax-defined keyword identifier for it to have keyword meaning. The 37 programmer may then choose freely the identifiers he uses with no fear of encountering reserved identifiers he did not know about. <*N>, the Numeric Literal Metaclass TBNF SYNTAX FRACTIONAL PART> <*N> = list = | = * _ M it = * = "@" = [+|-]? _ n / ii _ ii \ ii = = [|]? | FIXED P0INT> = *| list = [|]? SEMANTICS A numeric literal may be split across card images. The length of a numeric literal must not exceed the following formula: I + N < 62, if base is decimal I + N < 61, for non-decimal base, 38 where I represents the total number of integer and fractional digits, not counting the radix point, and N represents the number of digits in the normal- ized exponent, not counting the exponent delimiter. No blanks may be embedded within a numeric literal. The exponent part may not exceed the range ± 999. The base part must be in the range 2-36. EXAMPLES INTEGER 1 0A3456 (must start with a decimal digit) FIXED POINT 1. l.ABCDE .34291 REAL 023 1.043 24897320-728 77A34Q.9L70+3 <*N> ABCDE(16) 3.489023(12) 1011110001110(2) <*S>, the String Literal Metaclass TBNF SYNTAX <*S> = [ but [" not "]]* = indicates that if the string quote appears within the body, it must be double. When the string is reduced to a BIGTAB entry, the redundant quote is deleted. The string literal must be completely contained in one card-image, and may not exceed 48 characters (not counting redundant double string quotes). EXAMPLES INPUT (stored as "INPUT" ) "EXPAND", "ILLIAC IV TRANSLATOR WRITING SYSTEM 1 THE TWS MACRO FACILITY TBNF SYNTAX DEFINE list separator "," ";" <*IxMACR0 FORMAL PARAMETER PART> " = " [ but but DEFINE]* "[" list <*!> separator "," "]"| "(" list *I separator "," ")" "?"|MEND| <*IxMACR0 ACTUAL PARAMETER PART> "[" list separator "," "]"| ")" list< MACRO ACTUAL PARAMETER> separator "," ")" 40 but. DEFINE but. ]* ::= "," SEMANTICS The limitation that a macro definition not contain "DEFINE" enforces the rule that macro declarations not be nested. Macro calls may exist either within the defined text, or in an actual parameter. The restriction exists that a macro may not contain a call on itself, either directly or indirectly. The unbracketed comma is a recognition that an actual parameter may contain virtually any text. Specifically, if another macro call is in the actual parameter part, or a call on a procedure containing its own actual parameters, delimited by commas, a way must be devised for defining the param- eter delimiter comma. The definition that has been implemented is: ::= A level zero comma where, when the initial "[" or "(" is recognized, the level is set to zero, and incremented by one for each subsequent "[" or "(" and decremented by one for each sub- sequent "]" or ")". This concept could be expanded by modifying the procedure PROCESS- MACROACTUALPARAMETERPART in a language-specific manner to accommodate such additional bracketing pairs that might occur in an actual parameter as INTEGER - ";" or REAL - ";". is a control card option "MACROEND X", where X may be "MEND" or a single special character. If "MEND" appears, then this keyword 41 will mark the end of the macro definition. If a single special character, then that will end the definition. EXAMPLES ::= ; ::= BEGIN * list separator ";" END ; ::= [INTEGER | BOOLEAN | LABEL] [ list <*!> separator ","|; ::= [