\ ■ riKfiflWffiffil UNIVERSITY OF ILLINOIS LIBRARY At urbana-champaign Digitized by the Internet Archive in 2013 http://archive.org/details/lingolreadablefo889kamp UIUCDCS-R-77-889 Of Z ik'tt LINGOL September 1977 A Readable Formalism for Programming Language Semantics by Garry R. Kampen UILU-ENG 77 1766 •5* •■* • * DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS u*E£*i LINGOL: A Readable Formalism for Programming Language Semantics by Garry R. Kampen Assistant Professor of Computer Science Department of Computer Science University of Illinois Urbana, Illinois 61801 September, 1977 Abstract This paper describes a metanotation for defining the syntax and semantics of a programming language in a rigorously formal manner. Definitions are operational: A semantic definition is a set of string transformation rules that operate on concrete representations of programs and their environments. The formalism is simple and easy to learn, and produces relatively readable language descriptions. To illustrate the formalism, and to facilitate comparison with other metalanguages, a formal definition of the simple programming language ASPLE is presented. The method is compared in detail with the W-grammar approach, and some techniques for verifying the consistency of definitions are discussed. Outline Pages 1. Introduction 1 2. Informal Description of LINGOL 3 3. Informal Description of ASPLE 7 4. Formal Definition of ASPLE 11 5. Evaluation 24 6. Verification 33 7. Summary 38 Figures 3.1. ASPLE Memory Structure 9 4.1. Syntax of ASPLE Programs 12 4.2. Syntax of ASPLE States 13 4.3. ASPLE Interpreter, Part I 16 4.4. ASPLE Interpreter, Part II 17 4.5. Expression Evaluation 17 4.6. Operators 18 4.7. Unary Functions 19 4.8. Transition Diagram 22 6.1. Derivation Tree 34 6.2. Domains of Productions 35 6.3. Internal Representation of Strings 37 1. Introduction Although BNF and similar syntactic metanotations have found wide acceptance, the same cannot be said about formal means of specifying the semantics of programming languages. A surprising variety of semantic metanotations exist, and some of these have been used to define full-size programming languages, but none have achieved widespread use. This is due in part to the difficulty of learning the notation, and in part to the size, complexity, and sheer unreadability of the definitions themselves. In this paper we describe a syntactic and semantic metanotation, LINGOL, which has several desirable properties: - It is complete. Any language whose sentences are strings of symbols from a finite alphabet and whose semantics are definable by a Turing machine can be entirely defined in LINGOL. - It is simple. A small number of familiar mathematical objects - sets, tuples, functions, and strings of symbols - are related by two kinds of production rule. Standard nota- tional conventions are used wherever possible. - It is readily adaptable to mechanical verification and pro- cessing. Definitions are operational, that is, they pro- vide an algorithm for executing programs in the defined language. To illustrate LINGOL, we will use it to define a simple programming language, ASPLE. The original definition of ASPLE is due to Cleaveland and Uzgalis [ 1 ]. Its use here is motivated by the fact that ASPLE has become a kind of benchmark for evaluating semantic formalisms. In a paper by Marcotty, Ledgard, and Bochmann [ 3 ], ASPLE is defined using four very different methods W-grammars, Production Systems, the Vienna Definition Language, and Attribute Grammars. To facilitate comparison, we have followed the style of [3] in our definition where possible. Since LINGOL is most nearly related to W-grammars, the W-grammar definition in particular has been used as a model. A longer example of LINGOL is given in [2], where a larger and more realistic programming language is defined. The language is interactive, hlock-structured, and self-extensihle, and it contains a full complement of data and control structures. For a survey of other formal definition methods and an extensive bibliography the reader is referred to [2] and [3]. The remainder of this paper is organized as follows: Sections 2 and 3 contain informal descriptions of LINGOL and ASPLE respectively. Section 4 contains a context-free grammar for ASPLE and its formal semantic definition in LINGOL. In section 5, we compare the LINGOL and W-grammar approaches to semantics by means of a detailed example, and in section 6 we discuss methods for verifying the consistency of formal definitions. 2. Informal Description of LINGOL The LINGOL formalism consists of two metalanguages, L and L~. L is a syntactic metalanguage that resembles an extended version of BNF; ^y i- s a semantic metalanguage whose 'programs' resemble Markov algorithms or SNOBOL string transformation rules. L 9 is used to define functions whose domain and range are strings or n-tuples of strings belonging to syntactic classes defined in L . We will illustrate LINGOL by defining the syntax and semantics of a very simple language consisting of arithmetic expressions. Its grammar is written in L as follows: Exp -»■ Int Partexp x: Partexp ■*■ (Op Int)* °P + ± I Z Int -> Digit+ Digit -*■ 0|l| . .. |9 Informally, an expression is an integer followed by a partial expres- sion. A partial expression is an operator-integer pair repeated zero or more times. An operator is either the symbol + or the symbol =, and an integer is a sequence of one or more digits. Depending on the media used for presentation, strings of terminal symbols may be indicated by using italics, underlining them, or enclosing them in quotes. When quotes are used, the quote terminal symbol is represented by a pair of quotes. Nonterminal symbols are the names of syntactic classes. Adopting a standard mathematical convention, we capitalize class names and use the same name in lower case (possibly with an integer suffix) to denote a member of the class; thus expressions from the class Exp will be denoted by exp, expl, exp2, and so on. In addition, the grammar above indicates that x will denote a terminal string in the class Partexp. The semantics of our simple language can be defined in various ways. One possibility is to define a function F that maps every expression exp into its value. Using L~, we write F: int ■* int int + intl ■* Sum(int, intl) int = intl ■+ Compare (int, intl) int x op intl -*■ F(F(int x)op intl) F(exp) is evaluated by scanning the left-hand sides of the produc- tions above starting with the topmost production. When a match with the string exp is found, the expression on the right is evaluated. Sum and Compare are functions that take strings of digits as arguments and return a string of digits as a result. The function Compare returns 1_ if its arguments denote the same integer, and JD otherwise; for example: Compare (008,8) = 1 Note that the last production will only be applied when exp contains two or more operators, as in the example below. F ( 4+4=8) = F(F(4+4) =8) = F(8=8) = 1 An alternative is to define the semantics of a language by an inter- preter that executes programs in that language. The interpreter is defined by a function I that maps the current state s. of the interpreter into its successor state s. in . Given an initial state s„, I determines the computation l+l s_,s, - ..., s ; when I(s ) is undefined, the computation is said to terminate 1 n n and s is the terminal state of the computation. For our example language, n states correspond to expressions and terminal states to integers. The function I is defined by I: int + intl x ■+■ Sum(int, intl) x int = intl x ■> Compare (int, intl) x The initial state s = 4+4=8 determines the following computation: I(s Q ) = s - Sum (4, 4) =8 = 8=8 T( - S l^ = S 2 = Compare (8, J3_) = 1 The definition of the functions F and I illustrate the L component of LINGOL. The syntax and semantics of the L_ metalanguage are described below: A description in L„ is a function-name followed by a sequence of semantic productions of the form p+e, where p is a pattern or an n-tuple (p ,p , ..., p ) of patterns and e is a string expression or an n-tuple of string expressions. A pattern is a sequence of string variables and strings of terminal symbols; a string expression is a sequence of string variables, terminal strings, and string-valued functions with string expressions as arguments . To find the value F(x) of a function defined by a set of L~ produc- tions, we match x with the left side of one of the productions and evaluate the right side. The values of string variables in the expression are determined by the pattern match. If no match can be found or if the string expression is undefined, F(x) is undefined. The semantics and context-sensitive syntax of L are specified further by the five rules below. Rules (3) through (5) assure that a given set of productions determines at most one value F(x). 1. Every variable must be associated with a syntactic class. 2. If the same variable appears more than once on the left side, it must match the same string of symbols each time it occurs. 3. Every variable that appears on the right side must appear on the left side. 4. If more than one rule matches a given string, the first rule in sequence is chosen. 5. If a pattern matches a string x in more than one way, the parse that assigns the longest substring of x to the first (leftmost) pattern variable is chosen. If there are several such parses, the ones that assign the longest string to the second variable are selected, and so on until a unique binding of pat- tern variables to strings is obtained. Patterns and tuples of patterns have the same semantics; in par- ticular, rules (4) and (5) apply to an entire tuple and not to individual patterns within the tuple. For example, suppose we wish to evaluate Compare ( 001 2 , 01 2 ) where the function Compare is defined by Compare: (zeros int, zeros2 int) -*■ 1 ( int, int2) -> and the syntactic class Zeros is defined by Zeros ->■ 0* Both productions match, but by rule (4) the first one is selected, variable zeros is bound to 00 by rule (5), int matches both occurrences of 1_2 by rule (2), and zeros2 is consequently bound to 0. Evaluating the right side, we have Compare (0012 , 012 ) = 1. 3. Informal Description of ASPLE An ASPLE program consists of a sequence of declarations followed by a sequence of executable statements. Declarations serve to associate a 'mode' with each identifier used in the program. There are five types of statement: assignment statements, if-then-else conditionals, while-do loops, input and output statements. Statements contain expressions composed of Boolean and integer constants, identifiers, and the operators +, *, =, and ^ . The operators + and * placed between integer values represent addition and multiplication respectively; between Boolean values they represent the logical 'or' and 'and' operations. The operators = and £ take integer arguments and return a Boolean value. Every identifier used in an expres- sion must appear in exactly one declaration. The example ASPLE program below is taken from [3]. begin int X, Y, Z; input X; Y := 1; Z := 1; if (X t 0) then while (Z t X) do Z := Z + 1; Y := Y * Z end fi; output Y end The program above reads a positive integer value X from an input file, computes its factorial, and prints the result Y on an output file. Variables X, Y, and Z are declared to reference only integer values; their 8 mode is thus reference-to-integer. Just as integers can be assigned to variables of mode reference- to-integer, references to integers can be stored in variables of mode reference- to-ref erence-to-integer , and so on for as many levels of indirec- tion as are desired. Consider the program below: begin ref ref int A, B; ref int C, D; int E; E := 50; C := E; A := C; D := A; input D; output D end In this program the integer 50 is assigned to variable E, a refer- ence to E is assigned to variable C, and a reference to C is assigned to variable A. Since D expects a value of mode reference-to-integer, A is 'dereferenced' twice and the resulting reference to E is stored in variable D. The input statement reads an integer value into the variable E, and the out- put statement prints the value of E. Assuming that the value input was 25, the final state of memory is as shown in Figure 3.1. B is undefined. ©» ®»— * — * « ► ! r 4 • * 25 ©*— • Figure 3.1. ASPLE Memory Structure The Boolean or integer constant obtained by repeatedly de- referencing a variable is called the 'primitive value' of that variable, and its mode is called the 'primitive mode' of the variable. In the example above, the primitive mode of variables A, C, D, and E is integer, and their primitive value is 25 at program termination. An assignment statement is legal if (a) the right side is defined, (b) both sides have the same primitive mode, and (c) if n and n are the number of occurrences of 'reference-to' L K in the modes of the left and right sides respectively, then n L " X - V A legal assignment statement is executed as follows: (1) If the right side is a constant or identifier and n -1 = n , then the value on the right is L R assigned to the variable on the left. (2) If the right side is an identifier and n -1 < n , L R the identifier is dereferenced until a value is obtained whose mode contains n -1 occurrences of Jj 'reference-to,' and this value is assigned. (3) If the right side is an expression other than a constant or identifier, n = 0. Identifiers in R the expression are replaced by their primitive values, the expression thus formed is evaluated, and the resulting constant is assigned. 10 In the example program below, the first three statements il- lustrate rules (1), (2), and (3) respectively; the last three statements are illegal since they violate conditions (a), (b) , and (c) respectively. begin ref int C. D I int E, F, G; bool H; E := 10; i-i = 1. "r-°1 F := E; I"L = 1, «,-« G := (E); I"L S 1, n,-0] C := D; [ "L = 2, n R =2] H := E; l \ = 1, n,-l] C := (E) I\ = 2, n R = 0] end The argument of an input statement must be an identifier whose primitive mode is the same as the mode of the next value to be input. The identifier is dereferenced to obtain a reference to a constant, and the input value is assigned to the referenced location. Expressions appearing in other statements are always evaluated or dereferenced to a constant as the first step in executing the statement, 11 4. Formal Definition of ASPLE A formal grammar for ASPLE programs is given in Figure 4.1. It is an almost direct translation into L.. of the BNF grammar on page 195 of [3], except that two-word nonterminals have been renamed, productions [B18] and [B19] are slightly changed, and some compression of the grammar has been achieved by using the Kleene * and + operators. Since we intend to define the semantics of ASPLE by means of an interpreter, we need to extend the ASPLE grammar to include a definition of the class of computational states. The definition consists of the pro- ductions [B23] through [B35] in Figure 4.2. [B36] through [B43] are stand-alone productions that define syntactic classes used by the interpreter but not by other syntax rules. In particular, [B38] through [B43] define implementation-dependent restrictions on the length of programs, memory, integer constants, identifiers, and files. , For convenience in defining classes of fixed-length strings, we let N*k denote a sequence of k instances of the syntactic class N. Thus Digit*10 is the class of all 10-digit integers, and Digit*10 Digit+ is the class of all integers with more than 10 digits. Interpreter States A state of the ASPLE interpreter is represented by a program or a sequence of declarations and statements, followed by a snapshot that describes the current contents of memory and of the input and output files associated with every program. Memory also serves as a symbol table: Each entry includes the mode of an identifier as well as its contents. An identifier may be undefined, or may contain (refer to) a Boolean constant, an integer constant, or a reference to another identifier as in the example below: me mory ; A refbool undefined; B r efint 12; C refrefint B; 12 [B01] Program "*" begin Decls j_ Stmts end [Declaration] [B02] Decls -> (Declaration j_)* Declaration [B03] Stmts -* (Statement j_)* Statement [B04] Declaration -» Mode Idlist [B05] Mode -> bool int ref Mode [B06] Idlist -> (Id j)* Id [Statements] [B07] Statement ■* Assignment | Conditional Loop | Transput [B08] Assignment ■*■ Id _^ Exp [B09] Conditional ■> if Exp then Stmts fi if Exp then Stmts else Stmts fi [BIO] Loop -> while Exp do Stmts end [Bll] Transput -> input Id output Exp [Expressions ] [B12] Exp -> Factor | Exp + Factor t B 13] Factor -*■ Primary | Factor _* Primary [B14] Primary ■* Id Constant j( Exp )_ _£ Compare ± [B15] Compare ■> Exp = Exp | Exp £ Exp [Constants and Identifiers] [B16] Constant ■+ Bool Int [B17] Bool -> true false [B18] Int ■+ Digits Digit [B19] Digits -> Digit* [B20] Digit + | 1 | ... | 9 [B21] Id -> Letter+ [B22] Letter ■+ A | B | ... | Z Figure 4.1. Syntax of ASPLE Programs 13 [States] [B23] State ■* Initial Declaring Executing | Final [B24] Initial ■» Program Snap [B25] Declaring ■* Decls jj_ Stmts j_ Snap [B26] Executing -*■ Stmts j_ Snap [B27] Final ■* Snap Lexemes error Lexemes [B28] Snap -*■ memory ; Loc* infile Record* outfile Record* [B29] Record ■*■ Constant j_ [B30] Loc ■+ Id Mode Box j_ [B31] Box -> Val undefined [B32] Val + Id Constant [B33]x ,y,x: Lexemes -»- (Box Operator Keyword Mode _)* [B34] Operator ■* ; := 1 + 1 * = If li I I [B35] Keyword ■f if then else fi while do end input output memory infile outfile [B36] Zero ■* 0* [B37] Con ■*■ Constant undefined [Limitations] [B38] Longprogram -*■ Lexeme*10000 Lexeme+ [B39] Longmemory -> Loc*2000 Loc+ [B40] Longint -> Digit*10 Digit+ [B41] Longid -> Letter*6 Letter+ [B42] Maxint -> 4095 [B43] Longf ile -> Record*500 Record+ Figure 4.2. Syntax of ASPLE States 14 In the W-grammar for ASPLE the same state of memory would be represented by memory loc A has ref bool refers undefined end loc B has ref int refers 12 end loc C has ref ref int refers B end We have chosen an abbreviated representation of memory in the belief that a generally useful formal definition should be tied closely to concrete programs and to concrete representations of memory of the sort that might be generated by a symbolic dump routine. Such a representation ought to be both compact and syntactically similar to the programs it accompanies. A compact state description permits example computations that are not excessively bulky . For example, the execution of the ASPLE program begin int X; X j^ J3 end is represented by the following sequence of states: [SI] begin int X; X jj^ 0^ end memory ; inf ile outf ile [S2] int X; X j_f_ 0j_ memory; inf ile outf ile [S3] X 2z. 0_L memory; X ref int undefined; inf ile outfile [S4] memory; X ref int 0; inf ile outfile The strings [SI], [S2], [S3], and [S4] belong respectively to the subsets Initial, Declaring, Executing and Final of the set of states. Interpreter Definition The interpreter for ASPLE is defined by a state transition function I and seven auxilliary functions E, Plus, Times, Equal, Unequal, Sue and Pred. The last two are the successor and predecessor functions for the class of non-negative integers. E is an expression evaluator, and the other functions define the ASPLE operators +» jS =, and £. The domains and ranges of the functions are as follows: 15 I: State - Final ■> State E: (Exp, State) -*■ Con Plus: (Con, Con) ■> Con Times: (Con, Con) -> Con Equal: (Con, Con) ■> Con (4.1) Unequal: (Con, Con) -> Con Sue: Int ■+ Int Pred: Int - Zero -> Int Con is defined by [B37] as the class of integer and Boolean con- stants together with undefined ; when arithmetic overflow occurs, or a binary operator is supplied the wrong arguments, or an undefined identifier is used in an expression, the result undefined is passed through the expres- sion evaluation process and ultimately returned by E. The definition of I consists of the semantic productions [101] through [129] displayed in Figures 4.3 and 4.4. The definition of function E is shown in Figure 4.5, Figure 4.6 contains the definition of Plus, Times, Equal, and Unequal, and Figure 4.7 contains the definitions of Sue and Pred. We will consider each of these definitions in turn. In the definition of I, productions [101] through [105] serve to enforce implementation-dependent limitations on ASPLE programs. [101] through [104] cause a transition to an error state when a 'compile time' error is detected: excessive program length, too many declarations, an oversize constant or identifier. [101], [103], and [104] apply only to the initial state, but [102] may be invoked at any time while declarations are being processed. [105] cause a transition to an error state when the output file overflows during execution. 16 I: [Interpreter] [Limitations] [101] longprog snap -»■ error PROGRAM TOO LONG [102] x longmemory y + error EXCESSIVE MEMORY REQUIRED [103] begin x longint y -*■ error OVERSIZE INTEGER [104] begin x longid y ■+ error IDENTIFIER TOO LONG [105] x outfile longfile ■* error OUTPUT FILE OVERFLOW [Declarations] [106] begin decls ^ stmts end x ■> decls j_ stmts j_ x [107] mode id _j_ idlist j_ x -*• mode id £_ mode idlist _^ x [108] mode id j^ x j_ id mode2 y ■* x j_ id mode2 y error id ALREADY DECLARED [109] mode id _^ x memory; y -*■ x memory; id ref mode undefined; y [Assignment] [110] id := int id refint box y id refint int y id refbool box y id refbool bool y id ref mode box y ±_ id2 mode val z id ref mode id 2 y j_ id 2 mode val z id2 mode val y j_ id ref mode box z id2 mode val y j_ id ref mode id 2 z id 2 mode val y ■*■ id jj^ val J_ x j_ id2 mode val y [115] id 2Z box J. x -> x error ILLEGAL ASSIGNMENT i d l^_ box [116] id jj2 exp j_ x -> id _£z_ E(exp, x) _|_ x [111] id 2Z bool j_ x -> x [112] id j^ id2 ^ x ■+ x [113] id _^ id2 j_ x ■> x [114] id := id2 ; x Figure 4.3. ASPLE Interpreter, Part I 17 [Conditions] [117] if true then stmts fi; x ■+ stmts ; x [118] if false then stmts fi; x -* x [119] if true then stmts else stmts2 fi; x -> stmts ; x [120] if false then stmts else stmts2 fi; x ->- stmts2 ; x [121] if con then x -> x error ILLEGAL CONDITIONAL [122] if exp then x -»• if E(exp, x) then x [Loops] [123] while exp jdo stmts end; x ■* if exp then stmts j_ while exp clo stmts end f i; x [Transput] [124] input id ^ x j_ id mode id2 j^ y ■*• input id2 j_ x \_ id mode id2 j_ y [125] input id j_ x inf ile constant j^ y -*■ id _£» constant j_ x inf ile constant j^ y [126] input id x x ■*■ x error ATTEMPT TO READ EMPTY FILE [127] output constant j_ x ■*■ x constant j_ [128] output undefined ; x ■> x error OUTPUT UNDEFINED [129] output exp j_ x ■> output E(exp, x) j^ x Figure 4.4. ASPLE Interpreter, Part II E: [Expression Evaluation] [El] (exp + factor, x) ■> Plus(E(exp, x) , E(factor, x)) [E2] (factor *_ primary, x) ■> Times (E(factor, x) , E(primary, x)) [E3] (id, x j_ id mode val j_ y) ■* E(val, x j_ y) [E4] (id, x) -> undefined [E5] (constant, x) -> constant [E6] ( _( exp )_ , x) -> E(exp, x) [E7] ( _( exp = exp2 )_ , x) ■* Equal (E (exp, x) , E(exp2, x) ) [E8] ( _( exp ± exp2 2 » x) -> Unequal (E (exp, x) , E(exp2, x) ) Figure 4.5. Expression Evaluation 18 Plus: [Addition and Boolean 'or'] [PI] ( int, zero ) ■> int [P2] (maxint, int ) ■> undefined [P3] ( int, int2 ) ■> Plus (Suc(int) , Pred(int2)) [PA] ( false , false ) ■*■ false [P5] ( bool, bool2) •> true [P6] ( con, con2 ) ■> undefined Times: [Multiplication and Boolean 'and'] [Tl] ( int, zero ) ■* [T2] ( i n t, digit) ■> Plus (Times (int, Pred (digit) ) , int) [T3] ( int, digits digit) -> Plus (Times (int 0, digits), Times(int, digit)) [T4] ( true , true ) -> true [T5] (bool, bool2) + false [T6] ( con, con2 ) •> undefined Equal: [Compare Integers for Equality] [EQ1] (zero int, zero2 int) -> true [EQ2] (int, int2) ■> false [EQ3] (con, con2) ■»■ undefined Unequal: [Compare Integers for Inequality] [Ul] (zero int, zero2 int) -> false [U2] (int, int2) -> true [U3] (con, con2) -> undefined Figure A. 6. Operators 19 Sue: [Successor Function] [SOI] digits ■*- digits 1 [S02] digits 1 ■> digits 2 [S09] digits 8^ ■* digits 9 [S10] 9 ■* 10 [Sll] int 9 ■* Suc(int) Pred: [Predecessor Functions] [PR01] digits JL ■+ digits 0^ [PR02] digits 2 + digits 1 [PR09] digits 9 ■* digits 8 [PR10] 10 + 9_ [PR11] int + Pred (int) 9 Figure 4.7. Unary Functions 20 The remaining productions are organized to reflect the structure of the grammar. [106] initializes the execution process by reducing the program to a sequence of declarations and statements. Other productions operate in one of two ways: They remove a declaration or statement from the left of the sequence, execute it, and modify the snapshot accordingly, or they replace a declaration or statement with equivalent ASPLE code to which another production applies. Both modes of operation are illustrated by the sample computation below. The productions used are listed in their order of application. begin int X,X end memory ; inf ile outfile [106] int X,X; memory; inf ile outfile [107] int X; int X; memory ; inf ile outfile [109] int X; memory; X ref int undefined ; inf ile outfile [108] memory; X ref int undefined . . . error . . . The order of productions in a definition may be significant. For example, the domain of [108] is a subset of the domain of [109]; if these two productions were interchanged the second one would never be applied and redun- dant declarations would not be detected. Productions [110] through [129] define the semantics of statement execution. Since the definition of the assignment statement is the most com- plex, we will discuss it at some length; the remaining definitions will be lefl to the reader. The productions that define assignment are arranged in three groups that correspond to the three cases of the informal description of assignment in section 3: 21 (1) If the right side of the assignment is a constant or identifier and n -l=n then the value on the L K. right is stored in the variable on the left. a) Integer constants are stored by [110]. b) Boolean constants are stored by [111]. c) Identifiers that are defined are stored by [112] or [113]- (2) If the right side is an identifier that has been defined but does not satisfy (1), [114] is applied to replace the right side with its value. If the resulting assignment statement still fails to satisfy (1), [114] will dereference the right side again, and this will continue until n -l=n R or the right side is a constant. (3) When the right side is an expression other than a constant or identifier or undefined , [116] causes the expression to be replaced by its value (or undefined ) . Single item expressions that fail to satisfy (1) or (2) are intercepted by [115], which generates an appropriate error message. The process of executing an assignment statement is represented by the state transition graph in Figure 4.8. Each state in the diagram corresponds to the set of interpreter states matched by one of the productions [110] through [116]. For example, if we are in state 16, production [116] will be applied and the resulting interpreter state will belong to state 10, 11, or 15 of the diagram. Either [110], [111], or [115] will be applied subsequently. From the graph it is easy to verify that an assignment statement will eventually be processed. The only circular path passes through state 14, and every time [114] is applied n decreases. When 1^=0 another state is reached and processing is complete. We will complete our discussion of assignment by providing a de- tailed example of the operation of [112]. The first step is shown below: a string that represents the current state of the interpreter has been matched vith the left side of [112]. The binding of pattern variables to substrings is indicated by vertical alignment. 22 Figure 4.8. Transition Diagram 23 id j_= id2 j_ x j_ id ref mode box y ± id2 mode val z A jj= B j_ memory ; A ref ref int G j_ .B ref int 9_ J_ • • • The second step is to evaluate the right side of [112] using the bindings above. The result is given on the second line: x j_ id ref mode id 2 y \_ id2 mode val z memory j_ A ref ref int jJ j_ B ref int 9^ \_ . . . If A and B had appeared in memory in the reverse order, [113] would be used instead. Note that B must be defined for assignment to take place; if B were undefined the match would fail since the pattern variable val cannot take undefined as a value. Note also that the variable y is bound to the empty string because A and B occupy adjacent locations in memory. Some care must be taken in writing semantic productions. For example, if the second and third instance of _; were omitted from the pattern, the fol- lowing situation could occur: id v=_ id 2 j_ x id ref mode box y id2 mode A \=_ B j_ memory ; P A ref ref int G j|M B_ ref int . . . In this example, the statement A := B assigns B to the variable PA if MB is defined, regardless of the mode or status of A and B. Auxilliary Functions The auxilliary functions defined in Figures 4.5, 4.6, and 4.7 require fewer productions than I but make use of recursion. To prove that the recur- sion terminates is not difficult: We simply note that E is applied to fewer symbols at each successive call, and that the second argument of Plus and Times is decremented at each call until it reaches zero and a value is returned. Note that production [E3] is applied repeatedly to obtain the primitive value of an identifier; since the semantics of ASPLE rule out circular chains of pointers, the declaration of id need not be passed on to the next call of E. 24 5. Evaluation A number of criteria for evaluating formal definition techniques are proposed in [3]. In particular, the authors point out that an impor- tant measure of a formal definition technique is its ability to provide the answer to detailed questions about the language it describes. A sample question is posed and each of four definitions is used to answer it. For purposes of comparison, we will show how the same question is answered by the definition in section 4. The remainder of this section is a detailed comparison of the LINGOL and W-grammar approaches to language definition. A question that might be posed about ASPLE is: In the example program below, is the assignment of an integer constant to the variable X valid? begin ref int X; X := 2 end To answer the question, we execute the program starting with the initial state begin ref int X; X ]f 2 end memory; . . . We ignore the input and output files since they are not used. Applying the interpreter productions [106] and [107] we obtain successively the states refint X; X ;= 2; memory ; . . . X_ }=_ 2j_ memory ; X refref int undefined; . . . Now we examine the productions for assignment. [HO] does not apply, since it requires a mode of refint; the next rule that admits an intege 25 on the right of the assignment is [115], and applying it we obtain the error state below. The assignment is clearly invalid. memory; x refref int undefined; . . . error ILLEGAL ASSIGNMENT X: = 2 Comparison with W-grammars As with LINGOL, the W-grammar method is based upon strings of symbols and rewrite rules, and this similarity suggests that a comparison between the two will be especially meaningful. An obvious comparison can be made by counting rewrite rules; if we do so we find that the SIBYL definition requires 43 syntax productions, and a total of 77 semantic pro- ductions, while the W-grammar definition in [3] requires 38 context-free productions (metaproductions) and 100 additional productions (hyperrules) , not including the 22 productions of a standard context-free syntax for ASPLE. This comparison is overly simplistic for several reasons. First, the semantic productions differ greatly in complexity; one fairly elaborate hyperrule can be equivalent to several simpler LINGOL productions, and both descriptions contain sequences of trivial productions. Second, there are differences of style as well as notation. The authors of the W-grammar definition have attempted to separate the context-sensitive and semantic aspects of ASPLE; in the LINGOL definition they are intertwined. A more fundamental difference is that the LINGOL definition is operational while the W-grammar definition is essentially axiomatic: In effect, a computation must be deduced from a set of relations rather than generated by an algorithm. We will compare the two methods by applying them both to a simple class of expressions. First, however, we must lay the notational groundwork for a description of W-grammars and their semantics. 26 We begin by using L to define another syntactic metalanguage W. The nonterminal and terminal symbols of grammars in W are defined as follows: Any non-empty sequence of lower-case letters followed by a comma is a nonterminal; the symbols 0, T, F, +, ( and ) are terminal symbols. W + Production+ Production ■* N + Form x,y,z: Form -> (N | T)* (5.1) n: N ■> (a | b | ... | z)+ ^ T -*o|llll±llll A sentence of W is shown below. Since we choose to regard it as a grammar rather than a character string, it is not underlined and spaces are inserted for readability. It defines a language containing two kinds of expressions, integer and Boolean: the type of an expression is the same as the types of its operands. exp, -*■ intexp, exp, -*■ boolexp, intexp, -*■ ( intexp, + intexp, ) boolexp, ■*■ ( boolexp, + boolexp, ) (5.2) intexp, ■*■ boolexp, ■+ T boolexp, ■*■ F Two integer expressions and a Boolean expression are shown below: (0+0) (T+(F+T)) 27 Now we introduce a new metanotation consisting of sequences of productions of the form p °* e, where p and e have the same syntax as L~ patterns and expressions. A sequence of these productions defines a binary relation on strings rather than a function, since a string may have any number of successors. Only the first two of the defining rules for L„ must be satisfied: (1) Every variable in p or e must be associated with a syntactic class. (2) A variable must match the same string of symbols each time it occurs. We can use these productions to model the semantics of various classes of grammars, including W-grammars and grammars in L.. . For example, the meaning of grammar (5.2) is defined by the relation Rl below. Rl : exp, *♦ intexp, exp, °* boolexp, intexp, =* _( intexp, + intexp, )_ boolexp, =* ( boolexp, + boolexp, ) (5.3) intexp, =* boolexp, =* T_ boolexp ^ =* F_ The relation Rl determines a larger relation Dl (derives) defined by x n z Dl x y z iff n Rl y where n is a nonterminal from the class N defined in (5.1) and x, y, and z are members of Form. The class of expressions defined by (5. 3) is the set of terminal strings derivable from exp , that is, the set of strings y such that y e T* and 28 exp, Dl x, Dl x„ Dl ... Dl x Dl y for some x. G T*. — — 13 n i For example, exp, Dl intexp, Dl (intexp, +intexp, ) Dl (0+intexp,) Dl (0+0) Because our new metanotation admits string-valued variables as well as string literals, we can give a somewhat more compact definition of the relation Rl, as follows: Intbool ** int | bool Rl : exp, ** intbool exp, intbool exp, "* j( intbool exp, + intbool exp, ) (5.4) intexp, =* () boolexp, ■* T | F This is an example of a two-level grammar or W-grammar. The first- level grammar defines a set Intbool of modes, and the second-level grammar uses the variable intbool to avoid writing a production for each mode of ex- pression. Since the set Intbool could have been defined to contain an infinite number of modes, we see that a two-level grammar can be used to represent an infinite number of context-free productions. The last produc- tion uses our standard abbreviation for two productions having the same left side. The same two-level grammar expressed in a more standard notation is shown below. We have followed the lead of [3] in using '+' instead of 'plus symbol' to denote the terminal symbol +, and similar abbreviations for the other terminals. 29 INTBOOL : : int; bool. exp: INTBOOL exp. INTBOOL exp: _£♦ INTBOOL exp, + , INTBOOL exp, )_. int exp: (). bool exp: T; F. We can use a two-level grammar to define the semantics of expres- sions as well as their syntax, but to do so we must adopt a different strategy. The first-level grammar will be used to generate an infinite set of nonterminals that includes as a proper subset an encoding of all legal expressions. For example, the expression (0+0) is encoded as the non-terminal int left zero plus zero right , . As before, we define a rela- tion between nonterminals and forms (R2), and extend it to a derivation relation (D2); but this time the set of terminal strings derivable from the nonterminal exp, is the set of expressions with their values. In the example derivation below, the initial choice of nonterminals permits the derivation of the terminal string 0^ 0^ ; hence JO is an expression and is its value. exp, D2 intzero, intzero, eval intzero giving zero, D2 ... D2 0^ The first-level grammar and the second-level rules (called hyper- rules) that generate the set of legal expressions are given below. Intbool -*■ int bool Exp -> left Exp plus Exp right J Value Value •*■ zero true f alse R2: exp, =* intbool exp_^ intbool value^ eval exp giving value^ intbool left expl plus exp2 right, (5.5) =*■ X intbool expl _^ + intbool exp2 _^_ )_ int zero, =*• () bool true, =* T bool false, =* J_ 30 The semantics of expression evaluation is defined by the additional hyperrules given below. These rules ensure that a nonterminal of the form eval exp giving value A will derive a terminal string (the empty string) only when intbool eX p derives an expression and intbool value _j_ derives the value of that expression. eval value giving value j_ ** eval left expl plus exp2 right giving value x °* eval expl giving value2 _^ eval exp2 giving value3 _j_ where value equals value2 plus value3 _j_ (5.6) where zero equals zero plus zero _j_ =* where true equals true plus value ±_ =*■ where true equals value plus true _g_ =* where false equals false plus false A =* Notice that most of the hyperrules above derive the empty string. To illustrate their use, we sketch the derivation of the expres- sion (0+0) with value j): exp, D2 int left zero plus zero right, int zero , eval left zero plus zero right giving zero, D2 ... D2 (0+0) 0^ eval zero giving zero, eval zero giving zero, where zero equals zero plus zero, D2 ... D2 (0+0) The first two hyperuules (5.6) are equivalent to the two axioms below. The statement EVAL (value) = value is true because the nonterminal ev; value giving value generates the empty string. The left and right sides 31 of the second axiom are logically equivalent because the left side of the corresponding hyperrule generates the empty string only when the right side does. Eval (value) = value Eval( lef t expl plus exp2 right ) = value iff Eval (expl) = value2 and (5.7) Eval(exp2) = value3 and Plus(value2, value3) = value The axioms (5.7) provide a recursive definition of the string-valued function Eval. The same definition written in L would look like this: Eval: value -> value left expl plus exp2 right ■+ Plus (Eval (expl) , Eval(exp2)) A syntactic and semantic definition equivalent to the W-grammar definition in (5.5) and (5.6) is given below using L.. and L ? . In this case Eval operates on concrete rather than abstract or encoded expressions, so its definition is somewhat shorter. Exp -> _( Exp + Exp )_ I Value Value -> Bool Bool -> T F Eval: value -> value ( expl + exp2 ) + Plus (Eval (expl) , Eval(Exp2)) Plus: (0, 0) ■+ (T, bool) ■> T (bool, T) -> T (F, F) + F 32 The class of legal expressions is defined to be the subset of Exp whose members are mapped to values by the function Eval. To enable the function Plus to discriminate between legal and illegal expressions, a production for the class Bool of Booleans has been included in the first- level grammar. 33 6. Verification Because formal language definitions tend to be large and complex, and because they presently must be checked by hand rather than by a compiler or interpreter, typographic and logical errors have a way of creeping in and remaining undetected. Clearly, it is important to identify those aspects of a definition that can be checked in a routine manner, and to develop mechanical means of verification wherever possible. For LINGOL, several forms of verification are possible. An obvious first step is to verify that a definition is well- formed. It must satisfy the context-free syntax of L and L and the context-sensitive restrictions given in rules (1) and (3) of section 2: Every variable in an L_ production must be defined by an L production, and every variable on the right of an L- production must also appear on the left. As a second step, we can attempt to verify that functions have the expected domain and range; see (4.1). In checking a function we make use of the properties of other functions. For example, the assertion that the range of E is the set Con rests on the assertion that Con is the range of the functions Plus, Times, Equal, and Unequal. The domain of E is actually a superset of (Exp, State), namely the set (Exp, Lexemes ) . To verify this, we note that the domains of the productions [El] through [E8] correspond to the leaf nodes of a tree generated from (Exp, Lexemes) by applying the syntax productions [B12] through [B15] (see figure 6.1). Since the grammar for expressions is unambiguous, the leaves of the tree form a partition of (Exp, Lexemes) . Production [E3] is omitted since its domain is a subset of the domain of [E4]. 34 (Exp, Lexemes) (Exp + Factor, Lexemes) [El] (Factor, Lexemes) *• (Factor ^_ Primary, Lexemes) [E2] -^(Primary, Lexemes) I r >(Id, Lexemes) [EA] I--*- (Constant, Lexemes) [E5] |--«»(_(_Exp2_)_, Lexemes) t E6 l ' — *-( (Compare) , Lexemes) -*- (XExp = Exp^, Lexemes) [E7] -»► (J_Exp £ ExpK Lexemes) [E8] Figure 6.1. Tree of Alternative Derivations We can compute the domains of a set of productions by taking their left sides and replacing each variable with the name of the syntactic class it denotes. If we express the result as a Venn diagram like the ones shown in figure 6.2, it is easy to determine which productions can be: interchanged without affecting the definition (those with disjoint domains); removed without affecting the definition (those whose domains are contained in the domain of an earlier production); removed without changing the domain of the function (those whose domains are contained in the domain of a later production). 35 El) (E2J fm) E4'J (E5) (E6) (E7) (E8 Figure 6.2. Domains of Productions Some of the information in figure 6.2 can be mechanically generated (or verified) using a tree of derivations like the one in figure 6.1. The fact that productions [112] and [113] have disjoint domains cannot, since it depends on the context-sensitive property of ASPLE that no variable can be declared twice (and thus id cannot both precede and follow id2 in memory) . Having computed the domains and ranges of the defining produc- tions for the interpreter, we can construct a transition diagram resembling the one in figure 4.8. Transition diagrams are a useful abstraction that reveal properties of both the definition (for example, the fact that while and input are defined in terms of _if and assignment) and of the defined language (for example, the fact that assignment can never be a non-term- inating computation) . An important property of a definition is locality: It is easier to trace the execution of a statement through the definition if the produc- tions involved are closely grouped, preferably on the same page of the defining document. We can use a transition diagram to identify rules that should be rearranged, and a Venn diagram to determine if the rearrangement is possible. 36 Since LINGOL descriptions are operational definitions, they can be used to guide the execution of an example program for the language being defined. If assignments actually assign and loops really loop, we have some additional assurance that the definition describe the language we intended. The process of executing test programs and generating sample computations can be mechanized, and in fact this has been done in a limited way. A portion of the definition of SIBYL was transformed into a SNOBOL program which was then applied to some sample computations; as a result several errors were detected in the original definition. Not surprisingly, the SNOBOL implementation was extremely inef- ficient. We can do much better by building an interpreter or compiler for LINGOL definitions that takes advantage of their structure. For example, the search for a matching production can be greatly speeded up if, for each production, we examine the parse tree(s) of the current state rather than the underlying character string. If productions are implemented as transformations on parse trees, we can minimize the amount of parsing and string manipulation required. We can also make use of the fact that productions can be mapped onto a state transition diagram. We need not scan all 29 productions of the ASPLE Interpreter (or the 108 productions that define SIBYL); instead we can limit the matching process at each cycle to just those productions reachable from the current state. 37 x Exp ABCD : = 15 (a) x Exp | Id |0perj] I \t |ABCD| := 15 OOOF (b) ^ Descriptors J Characters Binary Word Figure 6.3. Internal Representation of Strings Finally, efficient hand-coded versions of standard functions like integer addition can be provided in a library. A further step is to encode lexemes of standard types in a way that facilitates processing. In figure 6.3 (b), for example, the string '15' of type Int has been encoded in binary form. If we continued this process of replacing strings and string-functions with storage structures and hand-coded subroutines, our formal definition would gradually evolve into an interpretive implementa- tion of the language. 38 7. Summary The use of string transformations in semantic definitions appears to have several advantages: Definitions can be written that are reasonably compact and readable, at least by comparison with some existing formal approaches. Semantic productions can be grouped to form a highly modular description. The semantic metalanguage is simple and easy to learn. In addition, the notation lends itself to mechanical verifi- cation. Because definitions are operational rather than axiomatic, they can be used to drive an interpreter that generates example computations. Basing the definition on concrete programs represented by character strings rather than abstract programs represented by, say, labelled parse trees offers advantages as well as disadvantages. On the one hand, computations can be represented compactly and the reader is spared the effort of translating between concrete and abstract syntax. On the other hand, questions of syntax may become entangled with semantics, and care must be taken to avoid unintended results in the string transformation rules. In general, more of the burden is placed on the authors of a definition and less on the users. As- suming that the latter outnumber the former, this seems like a reasonable choice. 39 References 1. Cleaveland, J. and Uzgalis, R. What every programmer should know about grammars , Department of Computer Science, University of California, Los Angeles, California, 1973. 2. Kampen, G. "A Formal Definition of the SIBYL Programming Language," UIUCDCS-R-77-852, Department of Computer Science, University of Illinois, Urbana, Illinois, 1977. 3. Marcotty, M. , Ledgard, H. F., and Bochmann, G. V. M A Sampler of Formal Definitions," Computing Surveys 8:2, pp. 155-267. OGRAPHIC DATA T 1. Report No. UIUCDCS-R-77-889 2. 3. Recipient's Accession No. C ,md Mibt itlc- [NGOL: A Readable Formalism for Programming 5. Report Date September 1977 Language Semantics 6. uirl s ) Garry R. Kampen 8. Performing Organization Kept. N 1)IUCDCS-R-77-889 ng Organization Name and Address Department of Computer Science 10. Project/Task/Work Unit No. University of Illinois at Urbana-Champaign Urbana, IL 61801 11. Contract /Grant No. ii...nn»: Organization Name and Address Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 13. Type of Report & Period Covered 14. •]• iry Notes .riji.' s This paper describes a metanotation for defining the syntax and semantics a programming language in a formal manner. Definitions are erational: A semantic definition is a set of string transformation rules that erate on concrete representations of programs and their environments. The formalism is simple and easy to learn, and produces relatively readable nguage descriptions. To illustrate the formalism, and to facilitate comparison th other metalanguages, a formal definition of the simple programming language ! PLE is presented. The method is compared in detail with the W-grammar approach, id some techniques for verifying the consistency of definitions are discussed. 1 l/ords and Document Analysis. 17o. Descriptors saantics STiantic metanotation : rmal languages ngramming language ^grammar Btalanguage litax m it iers Open-Hndcd Terms ! Field/Group ability Statement Ji-imited 19. Security Class (This Report) UNCLASSIFIED 20. Security (lass (This Page UNCLASSIFIED 21. No. of Pag< 39 22. Price USCOMM DC 4032'4-H OCf 2 ^80 UNIVERSITY OF ILLINOIS-URBANA 510.84 IL6R no COO? no 886 893(1977 Generating binary trees lexlcographicall 3 0112 088403594 ■HP m ■■ a B. Hi ■■■i 18 v