LIBRARY OF THE
UNIVERSITY OF ILLINOIS
AT URBANA-CHAMPAICN
biO.%4
I SLUT
The person charging this material is re-
sponsible for its return to the library from
which it was withdrawn on or before the
Latest Date stamped below.
Theft, mutilation, and underlining of books
are reasons for disciplinary action and may
result in dismissal from the University.
UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN
OCT "' WTT"
L161 — O-1096
fS/ y Re P° rt No * 396
/^IdiAA^
TWINKLE- -A SYNTAX LANGUAGE FOR A
TRANSLATOR WRITING SYSTEM
. by
Robert Leroy Mercer
ILLIAC IV Document No. 218
Digitized by the Internet Archive
in 2013
http://archive.org/details/twinklesyntaxlan396merc
Report No. 396
WINKLE- -A SYNTAX LANGUAGE FOR A
TRANSLATOR WRITING SYSTEM*
by
Robert Leroy Mercer
May 15, 1970
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, Illinois 618OI
This work was supported in part "by the Advanced Research Projects
Agency as administered by the Rome Air Development Center under
Contract No. USAF 30(602)-lnM and submitted in partial fulfillment
of the requirements for the degree of Master of Science in Computer
Science, February 1970.
11
ABSTRACT
TWINKLE is a language designed to aid in the syntactic
specification of programming languages. In addition to the constructs
available in BNF, TWINKLE provides for easy specification of lists and
other frequently used linguistic structures. By providing a large number
of alternatives for its various constructs, TWINKLE allows the language
designer to specify a language in terms that approach natural language.
The implementation of a compiler for TWINKLE is described.
This compiler is the first phase of the ILLIAC IV Translator Writing
System.
Ill
ACKJTCWLEDGEMENTS
The author wishes to express his appreciation for the advice and
efforts of Dr. Robert S. Northcote who has helped immeasurably in the creation
of this paper.
Thanks are also due the author's colleagues Alan Beals, Nelson
Machado, and Jacques LaFrance whose contributions to the language described
herein and discussions on the translator writing system have been invaluable.
For financial support, I wish to acknowledge the National
Science Foundation for its award of a fellowship. I also acknowledge support
by the ILLIAC TV project for willing provision of the necessary computer and
other physical facilities.
Finally, deep gratitude is expressed to Mrs. Sandy McCabe and Mrs.
Shirley Brown for their time and effort in typing the manuscript.
IV
TABLE OF CONTENTS
Page
1. INTRODUCTION 1
1.1 Backus Naur Form 2
1.2 Translatable Backus Naur Form 3
2. THE TWINKLE METALANGUAGE FOR SYNTACTIC SPECIFICATION 1+
2.1 Syntactic Symbols 7
2.1.1 Terminals 7
2.1.1.1 Characters 7
2.1.1.2 Special Words 9
2.1.1.3 Character Mode Terminals 10
2.1.1. k Meta-Terminals 11
2.1.1.5 Blanks 11
2.1.2 Nonterminals 12
2.1.3 Any Symbols 12
2.1.4 Square Brackets l4
2.1.5 Maybe Symbols 17
2.1.6 Enclosures 17
2.1.7 Unordered List 19
2.1.8 Precedence Structures 19
2.1.9 Lists 20
2.1.9.1 Head 20
2.1.9.2 Type 22
2.1.9.3 Base 22
2.1.9.4 Separator 22
2.1.9.5 Tail 23
2.1.10 Seeded Lists 24
2.2 Semantic Symbols 25
2.2.1 Actions and Tests 25
2.2.2 Simple Calls 26
2.2.3 Declaration Calls 27
2.2.4 Implicit Calls 27
2.2.5 Parameterized Semantic Calls 28
2.2.6 Bit Actions and Tests 28
2.2.7 The Tail 29
2.2.8 Placement of Calls 29
2.3 Null and Empty Symbols 31
3. CONTROL OF THE TWINKLE TRANSLATION 32
3.1 The Translator Writing System 32
3-2 Control Statements 35
3.2.1 Language Name Designation 35
3.2.2 Print Options 35
3.2.3 The Parser Type Option 38
3-2.4 Zip Control 38
3-2.5 Program Parameter Control 39
Page
3.2.6 Executable Compiler Options 39
3.2.7 Miscellaneous Control Options 1+0
3.3 Burroughs B-5500 Control Cards for
Executing TWINKLE 1+1
k. IMPLEMENTATION OF THE TWINKLE TRANSLATOR 1+3
k.l NONTAB, SYMTAB, and OPRTAB 1+3
1+.2 PROTAB, PRODS and PDLIST ^5
if. 3 PRODSTACK and LPSTACK 1+9
k.k Any Patterns 53
1+.5 Grammar Transformations 53
1+.5«1 Back Context Absorption 53
k.5.2 Empty Removal 57
U.5'3 Back Substitution of Singly
Defined Nonterminals 58
k.^.k Dummy Insertion 59
k.6 TWINKLE Output Files 6l
1+.6.1 TABLESF 61
1+.6.2 ACTIONS 6k
k.J ZIP Files 65
5. SUMMARY 66
APPENDIX
A. Reserved Words for TWINKLE 68
B. The Syntax of TWINKLE written in TWINKLE
and in BNF 69
LIST OF REFERENCES 93
VI
LIST OF FIGURES
Figure Page
1. The syntax of the list with separator 21
2. Block diagram of the Translator Writing System 33
3. An entry in the NONTAB table kk
k. PROTAB word format U6
5. The format of a PRODSTACK k9
6. The format of a LPSTACK 50
7> Left recursive and right recursive lists 52
1. INTRODUCTION
The recent proliferation of digital computers has spawned an ever
increasing number of formal languages for computer programming and related
purposes. Creating a compiler for such a formal language is a decidedly non-
trivial task, often requiring several man-years of effort. Therefore, from
this "bourgeoning stock of languages and compilers, several widely applicable
compiler writing techniques have been extracted which at once lead to a deeper
understanding of the compiler writing process and to a considerable reduction
of the effort involved.
Because of its importance in obtaining a clear and precise definition
of a formal language, the development of syntax metalanguages has been inti-
mately related to the development of compiler writing techniques. These meta-
languages range from Backus Naur Form (BNF) [1], and its many variants
[2, 3> ^> 5]> to languages more suitable to syntactic recognition, such as
the Floyd production language (FFL) [6] and operator precedence tables [7]>
and even to the conventional programming languages, FORTRAN [8] and PL1 [9]-
Each of these languages has certain advantages: relative compactness and clarity
of syntactic structure in the case of BNF and its derivatives; a very clear and
explicit statement of the recognition algorithm in the case of FFL and operator
precedence tables; and, finally, virtually immediate implementation in the
case of FORTRAN and FL1. As is to be expected, ease of producing a linguistic
description decreases rapidly as the description itself approaches an imple-
mented compiler.
The primary aim of the TWINKLE metalanguage is to provide a major
increase in the ease with which a syntactic specification may be created by
a language designer and in the ease with which that syntax may be understood
"by a user of the language unfamiliar with metalanguages in general. This has
"been achieved through the introduction of a wide variety of syntactic symbols
for designating many of the common syntactic structures such as lists, en-
closures, etc., and through the provision of numerous English words and
phrases, which may be used with commonly understood meanings, as an integral
part of the syntactic specification.
As a secondary aim, TWINKLE has been designed to present a unified
front for the University of Illinois Translator Writing System (TWS). TWINKLE
is the input language and the TWINKLE translator is the first phase of the TWS.
Thus, TWINKLE combines BNF as described by Beals [3] and translatable BNF
(TBNF) as described by Trout [h]. Before progressing to a detailed descrip-
tion of the TWINKLE language and translator which occupies the remainder of
this thesis, a brief description is provided of both BNF and TBNF.
1.1 Backus Naur Form
The basic unit of the BNF description of a language is the produc-
tion. A production consists of a nonterminal (the left hand side) followed
by the symbol, triple ":: = ", followed by a list of terminals and nonterminals
(the right hand side). Each nonterminal consists of a string of characters
enclosed in either quotes (" ") or angle brackets (< >) . The string may
not include ", <, or > and may not start with *. Terminals are special words
(strings of alphanumeric characters preceded by #) or characters (A, B, C,
etc)- Characters used in the metalanguage (#, ", <, /, etc.) must also be
preceded by a # when used as terminals. Two productions with the same left
hand side may be combined into one by including the right hand side of the
second with the right side of the first and separating it from the latter with
the metacharacter, "/ ". Productions themselves are separated from one another
the metacharacter, "j ".
1.2 Translatable Backus Naur Form
TBNF is, in itself, a large step toward simplifying syntactic spec-
ification. In addition to the BNF structures described in the last section,
TBNF allows:
(i) Kleene star--
* = \ | | | . . . ;
(ii) Ampersand for optional presence of some symbol--
& = | \ ;
(iii) Square "bracket construct to delimit groups of symbols and
alternatives --
::= [ | ] z
is equivalent to
::= z
::= | ;
(iv) list = * ;
(v) list separator = [ ] * ;
(vi) = Any symbol at all ;
(vii) "but used in conjunction with to reduce its generality.
Thus TBNF is considerably more general than BNF. Note, however, that TBNF
does not allow left recursion because the parser generated employs recursive
descent.
2. THE TWINKLE METALANGUAGE FOR SYNTACTIC SPECIFICATION
When work was begun on the TWINKLE language, two metalanguages,
BNF and TBNF, were already in use at the University of Illinois as input
languages to the TWS. The BNF input yielded a deterministic parsing algor-
ithm based on the Floyd Production Language (FPL), as described by Beals
[ 3 ]; while TBNF input yielded a recursive descent (KD) parsing algorithm,
as described by Trout [ k ] . Each parser has certain advantages but,
once either BNF or TBNF has been chosen as the metalanguage, it requires
a major effort to convert the description to the alternate form. Thus,
although it would be desirable, because of its relatively rapid generation,
to create a RD parser during the debugging phases of the language descrip-
tion, it would be equally desirable, because of the rigorous exclusion of
ambiguity inherent in the nature of the deterministic FPL parser, to
create a FPL parser when the last phase of the compiler creation is reached.
Unfortunately this ideal has not been attainable, in the past, primarily
due to the difficulty of translating TBNF into BNF by hand. These considera-
tions, therefore, dictate that TWINKLE be a superset of both BNF and TBNF,
so that existing language specifications may be accepted by the new system
with little or no change, and that the TWINKLE translator output be either
BNF or TBNF.
The basic form of TWINKLE, therefore, is cast in the familiar
BNF mode. That is, a TWINKLE syntactic specification consists of one or
more productions, each of which has a left hand side, which is the non-
terminal being defined (wholly, or in part) by the production, and a
:t hand side which comprises a set of alternative definitions for the
nonterminal in the left hand side. Each definition, in turn, comprises
a string of TWINKLE syntactic and semantic symbols. In BNF the produc-
tions are separated from one another by semicolons; the definitions by
vertical bars which are actually rendered in the implementation by a
slash; and the left and right hand sides of a production by the character,
triple "::=" . It is one of the aims of the TWINKLE project to make
possible the rigorous specification of a language syntax in a way that is
at once acceptable to both human beings and computers. To this end,
English phrases have been provided for the replacement and/or embellishment
of the metalinguistic symbolism of BNF and TBNF. In addition, several
features not present in either BNF or TBNF have been introduced. For
example, the BNF productions,
: : - #BEGIN #END /
#BEGIN #END;
: : = / #;
;
may be written in TWINKLE in the much more readable form:
A CONSISTS OF A POSSIBLY EMPTY LIST OF S
(1)
SEPARATED BY SEMICOLONS ENCLOSED IN #BEGIN AND #END;
At first glance this might appear to have two distinct interpretations:
particularly, a program may be either
BEGIN s ; s ; s ; ... ; s END
or
s BEGIN; END s BEGIN; END s ... BEGIN; END s
where "s" here stands for . In fact, the former interpretation
is made. If the language designer wishes to express the latter form of
program, he may write :
A CONSISTS OF A POSSIBLY EMPTY LIST OF S
(2)
SEPARATED BY [SEMICOLONS ENCLOSED IN #BEGIN AND #END] ;
In TWINKLE, the square brackets ([]) serve the function of delineating
clause and phrase structure in productions. The enclosure operator always
acts on the immediately preceding syntactic symbol which is at the same
nesting level as the enclosure operator, itself. Thus, in production (i),
it is the possibly empty list that is to be enclosed and not the semicolons.
In production (2), on the other hand, the square brackets associate the
enclosure operator with the semicolons and indicate that this construct,
taken as a whole, is to be the separator of the possibly empty list.
It is to be noted, however, that the TWINKLE language does not
enforce strict grammatical usage of English but, rather, allows for such
usage by the language designer. Thus, it will be found that, in the TWINKLE
syntax, the articles "A" and "AN" are treated equally with the result that
a determined corruptor of English might write (l) as
AN CONSISTS OF AN POSSIBLY EMPTY LIST OF S
SEPAPATED BY SEMICOLONS ENCLOSED IN #BEGIN AND #END; ^'
The TWINKLE program, in its most general form, is made up of
three distinct portions, of which the language syntactic description is
the second. The first portion is a set of control statements which deter-
mine, among other things, the nature and volume of the TWINKLE output and
the processing options for the remainder of the TWS. The construction
and use of this portion is dealt with in Chapter 3« The third portion,
the semantic tail, conveys the necessary semantic information about the
language being described. This is discussed briefly in 2.2.5, and more
fully by Machado [10]. The remainder of the current chapter is a
discussion of the syntactic and semantic symbols available to the language
designer for the syntactic portion of the TWINKLE system.
2.1 Syntactic Symbols
2.1.1 Terminals
Terminals are those symbols of which the object program is
ultimately composed. They may occur only on the right hand side of a
production and are represented in the syntax in a number of different
■ways. Terminals fall into the following classes: characters, special
words, character mode terminals, meta- or special class terminals, and
blanks.
2.1.1.1 Characters
The simplest form of the terminal is the character. Here
character refers to any of the twenty- seven special characters accepted
by equipment of the Burrough's B-5500 computer system. These are included,
together with the other thirty- seven B-5500 characters, in Table I- In
the syntax, a character may be represented by prefixing it -with a sharp
(#) when the character standing alone would have some special significance
to TWINKLE (i.e., when the character is a meta- character ) . For example,
a comma is represented in the syntax by the symbol pair "#, ". However,
since more than half of the special characters available are meta- characters,
it is probably safest to include the sharp in all cases. This is a point
at which TWINKLE diverges from the standard BNF and TBNF. The latter have
Table ]
The
Burroughs
B-5500
Character
Set
Classes of Any Base Symbols
H
a
■H
EH
CD
T!
O
o
H
cd
0)
-P
H
*
co
CD
co
CD
pq
l»
CO
CO
pq
>>
c
<
H
co
a
•H
CD
EH
CD
o
O
H
CO
C
fn
CD
P
E!
H
*
co
QJ
CO
CO
pq
>>
<
CD
•B
o
o
r-H r-H
co co
•H U
g CD
S-H -P
CD C!
Eh H
*
w
CD
CO
co
pq
>>
<
0,1,3
A
17
0,1,2
K
34
0,1,2
T 51
0,1,2
1
1
0,1,3
B
18
0,1,2
L
35
0,1,2
U 52
0,1,2
2
2
0,1,3
C
19
0,1,2
M
36
0,1,2
V 53
0,1,2
3
3
0,1,3
D
20
0,1,2
N
37
0,1,2
w 54
0,1,2
4
k
0,1,3
E
21
0,1,2
38
0,1,2
X 55
0,1,2
5
5
0,1,3
F
22
0,1,2
P
39
0,1,2
Y 56
0,1,2
6
D
0,1,3
G
23
0,1,2
Q
4o
0,1,2
Z 57
0,1,2
7
7
0,1,3
H
24
0,1,2
R
4l
0,1,2
, 58
0,1,4
8
8
0,1,3
I
25
0,1,2
S
42
0,1,4
i 59
0,1,4
9
9
0,1,3
•
26
0,1,4
-X-
43
0,1,4,6
/ 60
0,1,4,5
#
10
0,1,4
[
27
0,1,4
-
44
0,1,4,6
= 61
0,1,4,5
@
ii
0,1,4
&
28
0,1,4
)
45
0,1,4
] 62
0,1,4
1
12
0,1,4
(
29
0,1,4
j
46
0,1,4
" 63
0,1,4
13
0,1,4
<
30
0,1,4,5
<
47
0,1,4,5
special
"word-0,7
>
Ik
0,1,4,5
<—
31
0,1,4
48
0,1
<*!> -
0,7
>
15
0,1,4,5
X
32
0,1,4,6
/
49
0,1,4,6
<*N> -
0,7
+
lb
0,1,4,6
J
33
0,1,2
S
50
0,1,2
<*S> -
0,7
= Any Terminal
1 = Any Character
2 == Any Letter
3 = Any Digit
4 = Any Special Character
5 = Any Relational Operator
6 = Any Algebraic Operator
7 = Any Non-Character
fewer meta- characters and, as such, require fewer sharps. Although it is
very easy to insert the necessary sharps, it may be desirable to make a
preliminary run through the TWINKLE translator alone to isolate what trouble
spots there may be.
An alternative method for indicating a character in the syntax
which avoids the details of the sharp, and which provides for a more
readable syntax, consists of writing down the English word or phrase which
identifies the character in question. Thus, in production (l) above,
SEMICOLONS is used in place of the equally acceptable "#;". While this
form is not available for all of the special characters, it proves quite
useful in practice. A complete list of these alternatives is given in
the TWINKLE syntax (see Appendix B).
2.1.1.2 Special Words
Many times it is convenient to consider a group of letters and
digits as being, conceptually, a single terminal. Thus, in languages of
the ALGOL family, the letter strings BEGIN and END are each taken as
single terminals. This approach has an advantage in the milieu of the
TWS in that these conceptual units, or special words, are compiled rela-
tively quickly by the scanner as opposed to the more laborious and time
consuming letter by letter compilation through the syntax and semantics.
Any string of letters and digits which begins with a letter may
be used as a special word. It must not have embedded blanks, and the
character immediately after it must not be alphanumeric. As with characters,
a special word must be prefixed by a sharp in the syntax if it would
otherwise be of special significance to the TWINKLE translator. Since
there are well over one hundred such words in TWINKLE (see Appendix A),
it is safest to use sharps literally. Again, there is a divergence of TWINKLE
at this point from BNF and TBNF which is easily overcome.
2.1.1.3 Character Mode Terminals
While the special word is often the better way of entering
alphanumeric information, there are times when character by character
input is actually preferable. For example, the parsing of FORTRAN and
B-5500 ALGOL FORMAT statements is simplified if done in the character mode.
More generally, any time there is an abundance of single character signifi-
cance in a syntactic entity, it is better parsed and more compactly
described in the character mode.
If the sequence of characters to be dealt with consists entirely
of digits, it may be written into the syntax directly because an unadorned
number has no special significance to the TWINKLE translator. If, however, a
more general sequence must be handled, the sequence must be preceded by the
word ALPHA, which indicates to TWINKLE that it must consider the following
sequence of characters specially. Since ALPHA is a bit long, it behooves one to
provide a means of keeping its use to a minimum. To this end, the construct,
[ALPHA A / ALPHA B / ALPHA C / ... / ALPHA 2] ,
is equivalent to the more compact form,
ALPHA [A / B / C / . . . / Z]
Another, more obscure, method of specifying alphanumeric
characters (or, in fact, any of the Burroughs B-5500 characters) is the
code construct which is based on the internal binary representations of
the various characters. This form of character representation is a
11
carry over from TBNF where it was adopted primarily "because the question
mark is not a valid character on punched cards in the B-5500 system. It
consists of the word CODE followed by an integer between zero and sixty-
three which is the internal code of the character being indicated.
2.1.1.U Meta-Terminals
Because of the advantages attendant to allowing the scanner to
perform a certain amount of simple syntactic analysis immediately on the
input string (as, for example, in the recognition of special words),
the TWS scanner also recognizes members of three special classes of
terminals: identifiers, numbers, and strings. These meta-terminals are
represented in the syntax by the symbols <*I>, <*N>, and <*S>, respectively.
In an English-like syntax, they may be represented by IDENTIFIER (or
IDENTIFIERS), NUMBER (or NUMBERS), and STRING (or STRINGS). An identifier
is any sequence of alphanumeric characters beginning with a letter,
provided the sequence is not a special word of the language. A string
is any sequence of characters (excluding the quote) enclosed in quotes.
A discussion of how the scanner handles these items has been given by
Machado [lo] •
In addition to these three meta-terminals, TWINKLE allows for
the syntactic specification of up to twenty other meta-terminal symbol
classes. In the syntax, these are represented by the symbol <*n> where
n is a digit between four and twenty-three and identifies the meta-
terminal. A special scanner is necessary to take advantage of this
facility.
2.1.1.5 Blanks
Blanks are specified in the syntax by the word BLANK. A blank
can only be scanned in the character mode.
12
2.1.2 Nonterminals
Nonterminals are specified in TWINKLE as strings of characters,
called nonterminal names, enclosed in either angle brackets or quotes
(< > or " " ). For obvious reasons the nonterminal name may
not include either angle brackets or quotes. Furthermore, any blanks
which appear in the nonterminal name are disregarded. Thus, the nonter-
minals, and , are treated identically. In
the BNF or TBNF output resulting from a TWINKLE translation, the blanks
in nonterminal names displayed are, in fact, removed. To retain a
modicum of readability in this compact form it is advisable to hyphenate
multi-word nonterminal names; for example, use
in place of , which is any of the twenty- six letters
of the alphabet, one must write--
13
::=A/B/C/.../X/Y/Z
Trout, in TBNF, introduced the pseudo-nonterminal, , which stands for
any terminal symbol. If not all terminals are to be indicated, the
exceptions, if small in number, may follow the pseudo-nonterminal — each
preceded by the special word, BUT. For example, any terminals except
BEGIN and END may be written:
BUT #BEGIN BUT #END
This construct has been used primarily in error recovery in TBNF languages.
In TWINKLE, the any symbol has been generalized and has become
a powerful programming tool. The syntax of is shown below:
( #terminal
Character
#letter
#DIGIT
#ANY / #S FECIAL #CHARACTER
#RELATIONAL #OPERATOR
#ALGEBRAIC #OPERATOR
#NONCHARACTER
\ <^i\TriT\iniTr'DMTi\Tfl t/s *
LIST OF {#BUT }
)
X
#BUT #[LIST OF S SEP( #/ }# ]
^
J
V_
'
v
EXCEPTION LIST
BASE
Use has been made of a rather simple two-dimensional extension of TWINKLE:
square brackets have been replaced by vertical braces with the alternatives
occupying one line each; the Greek letter 'V is used instead of the special
word, LAMBDA.
The terminals in each of the bases, except for , are
shown in Table I (page 8). The base is unique in that, first,
14
the elements that it contains depend upon the actual nonterminal symbol
used and, second, they are not restricted to terminals but include all
of the alternative definitions of the nonterminal. Any terminals which
are in the base, but "which are not desired, may be written in the excep-
tion list following the . The TBNF form of the exception list
is still accepted by TWINKLE but, as with the ALPHA list, it is possible
to use only one BUT and enclose the terminals of the exception list in
square brackets immediately following it. Thus, in place of
BUT #BEGIN BUT #ENL BUT #LEFT BUT #RIGHT ,
one may write
BUT [#BEGIN / #END / #LEFT / #RIGHT]
2.1.4 Square Brackets
In English, clauses are separated at one level by commas and
at a higher level by semicolons. Beyond this either more than one sentence
is used or the clause separation must be done by the reader from context.
Even this, however, does not prevent ambiguity beyond four levels or so. A
language such as TWINKLE, in which it is necessary to indicate clause
nesting to an arbitrary level, must have a more powerful mechanism
available.
nee the semicomma and the demisemicolon do not yet exist, it
decided that clauses and other such ensembles, which are intended as
.ngle syntactic entities, be enclosed in square brackets. Two examples of
• e already been encountered: the terminal list following ALPHA,
.1st following BUT. Beyond these the square brackets find
15
several other uses; whenever one or more symbols, or groups of symbols,
appear at one spot in a production they may he enclosed in square brackets,
Thus, the productions,
AN CONSISTS OF #ANY #TERMINAL;
AN CONSISTS OF #ANY #CHARACTER;
AN CONSISTS OF #ANY #LETTER ,
may be written more compactly as
AN CONSISTS OF #ANY FOLLOWED BY [^TERMINAL OR
#CRARACTER OR #LETTER] .
For purposes of adding semantic symbols, which will be discussed later,
the special word,ANY, and the bracket construct, taken as a whole, are
considered to be at level zero of the production, while the special
words; TERMINAL, CHARACTER, and LETTER, are considered to be at level one.
Alternatively, an entire production may be nested in square
brackets and one may write- -
AN CONSISTS OF #ANY FOLLOWED BY AN [
CO
WHICH IS DEFINED TO BE #TERMINAL OR #CHARACTER OR #LETTER] .
Note that although all previous forms of arrow are still valid in the
nested production, it is also permissible to include the special word,
WHICH, so that the construct will look more like an English clause. When
a nonterminal, such as , is defined in a nested production it
16
may then appear anywhere else in the syntax just as if it had been de-
fined in the usual manner. There are, however, some precautions to be
ta'^en with nested productions. These stem from the fact that the non-
terminal so defined may have additional definitions elsewhere in the
syntax. In this case the nonterminal represents the totality of its
alternative definitions, except when it appears on the left hand side
of a nested production in which case it represents only those definitions
which appear on the right hand side of the same nested production. To
illustrate this, suppose that in addition to the production above, in
which is defined, one also has the production:
A CONSISTS OF #SOME FOLLOWED BY AN
(5)
[ WHICH IS #DIGIT OR #DIGITS]
The two productions (k) and (5) are then equivalent to the BNF productions:
: : = #ANY ;
: : - #SOME ;
: : = / ;
: : = ^TERMINAL / #LETTER / #CHARACTER ;
: : = # DIGIT / #DIGITS
The square bracket construct may also be applied to a single
string of symbols to indicate that the string is to be taken as a unit
itself. This is useful in constructs such as the maybe symbol, enclosures,
. lists described below.
IT
2.1.5 Maybe Symbols
Frequently a syntactic structure has one or more substructures
which may be omitted without syntactic error. Thus, for example, in ALGOL
a list of labels preceding a statement is optional. Many other examples may
be found in algorithmic languages; several may be found in TWINKLE,
itself. To make the specification of such structures as easy as possible,
Trout [h] adopted the Brooker and Morris question mark [ll] --changing it, in
the process, into an ampersand because the question mark is an illegal char-
acter on B-5500 cards. In TWINKLE, either an ampersand or a question mark may
be placed after a symbol (or group of symbols enclosed in square brackets) to
indicate that it is optional. The English-like form of this construct con-
sists of preceding the optional symbol by the special word phrase, POSSIBLY
ONE. This form is more general, in that it may be applied directly to lists
and enclosures; whereas, they must be enclosed in brackets when followed by a
question mark or ampersand.
An example of the English form of follows. The
production,
AN CONSISTS OF AN FOLLOWED BY
POSSIBLY ONE ,
is equivalent to the two BNF productions:
: : - ;
: : =
Under more complex conditions, the maybe symbol can account for a considerable
increase in readability of the syntax.
2-1.6 Enclosures
Another very common construct in computer languages is that in which
some structure, such as a list of subscripts, is enclosed in delimiters, such
as parentheses. The delimiters may be different: e.g., the special words,
BEGIN and END, which bracket compound statements in ALGOL; they may be different
but very closely related: e.g., the left and right parentheses which enclose
subscript lists in FORTRAN and PL1; they may be identical: e.g., quotes which
delimit strings in ALGOL or periods which enclose logical operators in some
dialects of FORTRAN. Corresponding to these possibilities there are three
forms of enclosure. The general representative of the first form may be
symbolized as:
#ENCLOSED #IN #AND
where represents any basic TWINKLE symbol or group of symbols en-
closed in brackets. The latter two symbols, if not enclosed in brackets, may
not be enclosure symbols themselves. Using this construct and the list symbol
discussed below a compound statement in ALGOL may be defined by the very clear
production:
A IS DEFINED TO BE A POSSIBLY EMPTY LIST OF
S SEPARATED BY SEMICOLONS ENCLOSED IN #BEGIN AND #ENDj .
The second form of enclosure actually applies to only three sets of characters
in the Burroughs character set: parentheses, square brackets, and angle brack-
ets. It may be symbolized, using the two dimensional bracket construct de-
scribed earlier, as:
C\ ^PARENTHESES ]
#ENCLOSED #IN { #ANGLE #BRACKETS f
( #SQUARE #BRACKETSJ
Finally, the third form of enclosure symbol is simply:
#ENCL0SED #IN
word, S, may be added to make the second symbol plural.
19
2.1.7 Unordered List
The unordered list provides a means of indicating that a group of
items may appear in any order. One very simple example of the use of this is
the PL1 iterated DO loop. The leading statement may have an initial value,
increment, and final value for the control variable; the increment and
final value may appear in either order. Symbolically the unordered list has
the form:
#UNORDERED #LIST #0F [#AND- • • #AND ] ;
or
[ #AND • • • #AND ] #IN #ANY #ORDER
2.1.8 Precedence Structures
The precedence structure was introduced in TBNF to allow for the
specification of operator precedence in a BNF environment. Since the prece-
dence structure does not lend itself readily to a simple English alternative,
its syntax has not been expanded from the TBKF version. A precedence struc-
ture consists of: the special word, OPERATOR, followed by a list of precedence
groups enclosed in square brackets; followed by the special word, ON; and,
finally, by the operand on which the precedence is based. Each precedence
group consists of a list of symbols, the operators, followed by the special
word, PRECEDENCE, and a pair of integers separated by a comma enclosed in paren-
theses. The integers indicate the precedence of the preceding operators in
the stack and in the input stream, respectively. Succeeding precedence groups
are separated from one another by slashes. The following is an example of
the use of a precedence structure :
ARITHMETIC EXPRESSION> :: =
OPERATOR[#f #- PRECEDENCE (1,1)/ #/ #X PRECEDENCE^, 2)] ON
20
2.1.9 Lists
The list is one of the most useful of the TWINKLE constructs. Con-
sequently it occurs in a wide variety of English-like and symbolic forms. To
indicate a simple list without separators, the right square bracket of a nested
production or square bracket construct may be followed by an asterisk or by a
plus sign. The asterisk denotes a possibly empty list while the plus sign
denotes a list that must have at least one element.
In BNF, lists are usually formed either by left recursive productions-
such as :
: : = / ;
or by right recursive productions such as:
: : = /
This underlying structure is masked by the simplicity of the TWINKLE constructs
but may be made explicit, if desired, by the insertion of the qualifying spe-
cial word, OPEN, for left recursion or CLOSE, for right recursion between the
right square bracket and the asterisk. If no qualifier is present, left re-
cursion is implied.
The syntax of list structures allowing for separators is shown in
figure 1. As indicated it consists of five portions: head, type, base, sepa-
rator, and tail. Only the type and base are, necessarily, nonempty but at least
one of the head and tail must be empty. The functions of these various com-
ponents are discussed below.
2.1.9.1 Head
In the absence of the list tail, the list head determines
whether the list is left or right recursive. If it is
empty (the 1 1 word, REDUCED, or the phrase, LEFT RECURSIVE),
.1st is left recursive. If it is R, EXPANDED, or RIGHT
: :; right recursive.
21
O
•P
03
U
d
0)
-P
K)
•H
O
05
t
CD
H
2.1.9-2 Type
Lists may be either possibly empty or nonempty , as noted
above. It is the function of the list type, which
may not be empty, to determine this characteristic. A non-
empty list is indicated by the list types : L, LIST, STRING,
SEQUENCE, NONEMPTY STRING, NONEMPTY LIST, and NONEMPTY
SEQUENCE. For a possibly empty list, the available
types are : EL, KLEENE, KLEENE STAR, POSSIBLY EMPTY LIST,
POSSIBLY EMPTY STRING, and POSSIBLY EMPTY SEQUENCE. To
improve readability, the list type may be followed by
the special word, OF. If the list type is either
STRING or KLEENE, OF is necessary to avoid syntactic
ambiguity.
2.I.9.3 Base
The base may be a single symbol or group of symbols
enclosed in square brackets. It must not be a list
itself and is considered to be nested one level deeper
than the list of which it is the base. An S may follow
the base, in some cases, to indicate its plural character
in the grammatical structure of the list. In addition,
phrases which may be used as character terminals
all have plural forms which may be used profitably here.
• .j.h Separator
The power of the list construct is greatly enhanced
by the possibility of specifying a separator. Like
the base, the separator may be either a single symbol
jp of symbols enclosed in square brackets. The
•irator, basi , coni LderecJ bo be at the n^xt
23
higher bracket nesting level from that of the list, itself.
The plural forms valid in the base are also valid in the
separator. The appearance of the separator between
successive base items is either required or optional-
according as the separator type is definite or
questionable. Definite separator types are indicated
by SEP, SEPARATOR, and SEPARATED BY; questionable
separator types are indicated by Q,, and by the special
word, POSSIBLY, followed by any one of the definite
separator types.
2.I.9.5 Tail
In the absence of a list head, the tail determines
"whether the list takes the left recursive or right
recursive form. If it is empty, or the special word,
CLOSE, the list is left recursive. If it is the
special word, OPEN, the list is right recursive.
Note once again that either the tail or the head must
be empty in any list structure.
The examples below illustrate the various aspects
of the list structure:
A S SEPARATED
BY SEMICOLONS;
ARITHMETIC EXPRESSION ::= LIST [LIST
SEP [#* /#/]] SEP [#+/ #-];
2k
: := [ #:] * CONSISTS OF A LEFT RECURSIVE LIST OF
S POSSIBLY SEPARATED BY SLASHES
ENCLOSED IN SQUARE BRACKETS;
2.1.10 Seeded Lists
Any of the list structures, the syntax for which appears in figure 1
above, becomes a seeded list when followed by a list seed. The syntax of the
list seed is
#STARTING ~
#BEGINNING_
}
#WITH ,
where may be any TWINKLE symbol, or group of symbols, enclosed in
square brackets, with the exception of enclosures and maybe symbols.
The seed list may be used to indicate that the first element of a
particular list is distinguished in some way from the rest. For example,
the syntax of an ALGOL block may be written:
A CONSISTS OF A POSSIBLY EMPTY LIST OF S
SEPARATED BY SEMICOLONS STARTING WITH A POSSIBLY EMPTY
LIST OF S SEPARATED BY SEMICOLONS ENCLOSED
IN #BEGIN AND #END
[ote that, since the list seed may not be an enclosure, the enclosure operator
applies to the entire seeded list and not simply to the list of S .
25
2.2 Semantic Symbols
The TWINKLE symbols and constructs described up to this point may
be used in the syntactic specification of a language. A compiler, however,
must be more than simply a recognizer for a language. It must assign appro-
priate meanings (e.g., in the form of equivalent machine code) to the various
syntactic entities that it recognizes from the input stream. These meanings,
taken as a whole, make up the semantic description of the language or, more
simply, the semantics of the language. In the TWS, the semantics is written
in the Illinois Semantics Language (ISL), a complete description of which is
provided by Machado [10 ] • For the purposes of this discussion, it suffices
to consider the ISL semantic description as a number of individual semantic
blocks, each of which is associated with a semantic name through which it may
be accessed by the parser. Before describing the manner in which the parser
is directed to initiate a semantic block, it is necessary to consider the pos-
sible results of the execution of a semantic block.
2.2.1 Actions and Tests
Based on their effect on the parser, semantic blocks may be divided
into two groups: semantic actions and semantic tests. Actions have no direct
effect on the parser and are used primarily for such functions as table cre-
ation, code emission, etc Tests, on the other hand, are actually more of a
syntactic character than a strictly semantic character. Thus, a test is
called by the parser when the nature of a particular entity is syntactically
undeterminable and requires investigation on a higher level. For example,
in an ALGOL assignment statement a boolean variable must be assigned the
value of a boolean expression while an integer or real variable must be
assigned the value of an arithmetic expression. Since the declaration in
which the type of the variable in question was determined may precede its
26
usage by an arbitrary amount, and since an arithmetic expression and a
boolean expression may, in general, be identical for arbitrarily many symbols,
the parser is unable to determine whether an arithmetic or boolean assignment
is being made. The question is resolved by calling a test which compares
the variable with tables made when the declarations were recognized. The
result of this comparison is communicated to the parser which then is able
to proceed correctly with the parse.
Communication between test and parser is by means of the globally
declared boolean variable, SEMANTICTEST. If the value of SEMANTICTEST after
the execution of the test is true, the parser continues along the
indicated branch of the parsing tree. Otherwise the parser abandons this
branch and must decide among the remaining branches, possibly invoking further
tests in the process. It is important, therefore, that each block, which may
be called as a test, set SEMANTICTEST at some point in its execution. If
this is not done, the parser -will determine its future course of action
from the previous value of SEMANTICTEST with the attendant possibility of
an erroneous parse. Since any block may be called, both as an action and as
a test, at different times during the parse, it is permissible for an action
to set SEMANTICTEST although the variable is disregarded under these circum-
stances.
An indication to the parser of the points in the syntactic recogni-
tion at which a particular action or test must be involved is given by placing
semantic call symbols at appropriate points in the syntax. These calls may
have any of the three forms described below.
' . .'- Simple Calls
'nple action (test) calls consist of the special symbols, "^S"("@T"),
olloved by the name of the action (test) being called. Any string of digits
* alphanumeric characters beginning with a letter may be used as a name.
27
A particular semantic block may be called from as many places in the syntax
as desired and may be called as a test of one point while being called as an
action at another point. Several forms of the simple call linger from earlier
versions of the TWS. Thus, in addition to "@S", an action name may be pre-
ceded either by "@Q" or "#", and a test name may be preceded by "#" and
enclosed in either quotes or angle brackets. When "#" is used in an action
call, the action name must be a string of digits. The reason for this is that
the call would appear to be a special word if the action name were to begin
with a letter.
2.2.3 Declaration Calls
Any of the simple calls (except the "#" form) may be
extended into a declaration call. The action or test name is followed by a
colon and then a description of the action or test in ISL code. This descrip-
tion, or declaration, must be enclosed either in square brackets, or in the
special words, BEGIN and END. Each name may be declared no more than once in
this way, although such a declaration is not necessary. Any name so declared
may be used in the syntax in simple calls, both before and after the appearance
of the declaration. When the semantic block in question is brief, the overall
clarity of the linguistic description may be considerably enhanced by using the
declaration call.
2.2.U Implicit Calls
When a block is of such a nature that a declaration call would be
appropriate, and yet is used in only one place, it is clearly not necessary
to give a name to that block. Under these circumstances the name and colon
in the declaration call may be omitted, thereby creating an implicit call. For
each implicit call in the syntax, the TWINKLE translator generates a unique name
through which the relevant block of code may be referenced by the ISL translator.
28
2.2.3 Parameterized Semantic Calls
Semantic calls, -with the exception of implicit calls, may be modified
by a list of integer constant parameters separated by commas enclosed in paren-
theses and placed immediately following the action or test name. These constants
are used by the parser to set a group of global variables (the array row PARAM)
that may be referenced by the semantic routine when it is called. This is
frequently very useful when a number of portions of the syntax, which would
otherwise require different semantic actions, may be serviced by a single,
appropriately parameterized, action. For example, in recognizing written out
characters, TWINKLE employs a single semantic action whose parameter is the
internal code number of the symbol recognized.
2.2.6 Bit Actions and Tests
Frequently, a semantic action involves nothing more than the setting of
a single bit. Similarly, a semantic test is frequently based on the condition of
such a bit. Calling a semantic block to perform these manipulations requires a
disproportionate amount of overhead and it was, therefore, considered appropriate
to introduce special action and test types specifically for performing bit opera-
tions. The syntax of the bit action is:
f#SET
#g #s [#reset| #bit
where is either a number, identifier, or TWINKLE special word. The de-
signated bit is correspondingly either set or reset. The syntax for the bit test
r x
J#ON
#§ #T #BIT |#OFfJ
Condition Action
The test is true if the designated bit is in the condition specified (i.e., ON or
F). The default condition is ON. If the test is true the indicated action is
the designated bit. Up to kQ different bit names may be used by the lan-
These are assigned by the TWINKLE translator to the hd bits of
Le, ACTIONBITS.
29
2.2.7 The Tail
In most languages it will not be desirable to declare each block
(implicitly or otherwise) directly in the syntax specification. Also, all
but the simplest of semantics will require a number of variables and procedures
declared globally to the individual semantic blocks. For the sake of complete-
ness, these global declarations and undeclared blocks may be enclosed between
the special words, BEGIN and END, (thus forming the semantic tail) and appended
to the syntax specification. This tail will then be passed directly to the
ISL translator upon completion of the TWINKLE translation. In this way a
language may be completely processed by the TWS from a single complete specifi-
cation of the language. During the debugging phase of the language development,
it is more natural to process the syntax and semantics separately. The
details of coordinating the ISL translator with the rest of TWS in an
independent run are discussed by Machado [lo] •
2.2.8 Placement of Calls
A semantic call may appear anywhere in the syntax that a syntactic
symbol may appear, except at the beginning of an alternative. Thus, a semantic
call may not appear immediately after the arrow of a production, the left square
bracKet of a square bracket construct, or the separator ("/" or OR) of a
list of alternatives. The reason for this is that a semantic call cannot be
made by the parser until it has determined exactly what stage the parse has
reached. Clearly the parser cannot, in general, determine this at the
beginning of an alternative.
Unfortunately, there is a good deal more to placing semantic calls
than simply knowing where they will be legal. The ideal time to place them
would be after the FPL form (see chapter 3 and the paper by Beals [3 ]) had
been generated. The condition of the stack and the phase of the parse
30
would then "be known explicitly. It is, however, fairly straightforward to
place them directly into BNF as was done in earlier versions of the TWS. The
problems present in TWINKLE, with respect to call placement, arise chiefly from
the complex structures (lists, enclosures, etc.) available and from the large
amount of grammar transformation that is inherent in TWINKLE translation.
Thus, virtually all of the TWINKLE constructs not present in BNF employ some
form of TWINKLE generated nonterminals in their implementation. Because of
this, the configuration of the stack at the moment of a semantic call may not
be easy to determine. The following guidelines will be helpful in creating
and placing semantic calls to achieve a given end:
1. A semantic call is made following the recognition of the symbol
or construct at the same bracket nesting level immediately
preceding its occurrence in the syntax. For example, if one
writes
: : = LIST SEP @S1 ,
semantic action "1" will be called after the entire list
has been recognized and not after each separator, . The
latter effect may be achieved by
: : = LIST SEP [ @Sl]
2. A semantic routine should not reference the stack for symbols
at the same nesting level as its call — provided that the call
and the symbol are separated by one of the non-BNF TWINKLE
constructs. For example, in the TWINKLE production,
: := LIST @S A ,
the semantic action, "A", may reliably reference the nonterminals,
and , but may not reference the nonterminals, and ,
from which its call is separated by the list construct.
31
3« A semantic routine should not reference symbols which occur at
different nesting levels from its call. The only exception to
this rule is the case in which a, semantic call immediately follows
the right square bracket of a square bracket construct. The call
is then, in essence, copied onto the end of each of the alternatives
within the square brackets.
A detailed example showing the placement of semantic calls in a TWINKLE
grammar for a subset of ALGOL is provided by Machado [lo]«
2.3 Null and Empty Symbols
Several forms of context analysis which are performed automatically
by the TWS must be provided by the user under the TBNF system. The three
special words (BACK, AHEAD, and NOT), which are used in TBNF to provide con-
textual information, have no meaning to the BNF half of the TWS. Therefore,
when translating into BNF, these special words and the constructs that they
herald are meaningless to TWINKLE and are referred to as null symbols. In
addition to these null symbols, TWINKLE provides two forms of comments which
are meaningless and therefore qualify as null symbols. Any string of symbols
enclosed in parentheses, the left-most parenthesis of which is not preceded
by either a sharp or a semantic call, constitutes a comment and is deleted
by the scanner. Any string of symbols preceded by the special words, COMMENT
or C, and not including the symbols [,],;,., or the special words, BEGIN and
END, also constitutes a comment.
An empty symbol denotes a string of zero length. It is written as
either one of the special words, EMPTY or LAMBDA, or as an adjacent pair of
left and right nonterminal parentheses (i.e., < > or " ") .
32
3- CONTROL OF THE TWINKLE TRANSLATION
The TWINKLE translation is merely the first step in a chain of opera-
tions undertaken in generating a compiler for a language. Control options
specified in the TWINKLE input may be intended for use in a later phase of
the TWS. To make the meaning of these options clear a brief description of
the entire TWS is now given.
3«1 The Translator Writing System
Figure 2 presents a block diagram of the interrelations between the
programs which make up the Translator Writing System when creating a compiler
for a language, L. As indicated, the TWS can generate either a recursive
descent compiler or a deterministic Floyd production compiler, the decision
being made by the user through the PARSER control card. Consideration will
be given first to the Floyd production section of the TWS which comprises the
TWINKLE translator, the ISL translator (ISLTRAN), BNF2FPL, FPL2PAR, PAR2ALG
and finally, the ALGOL compiler.
A unified syntactic and semantic description of L is provided as
input to the TWINKLE translator. The translator extracts the syntactic infor-
mation which it transforms into BNF and places, together with several other
tables, it in a disk file labelled L/TABLESF. Similarly, the semantic portion
the input is placed in file, L/ ACTIONS, for use by ISLTRAN. The TWINKLE
translator then initiates execution of both BNF2FPL and ISLTRAN. The BNF
syntax of L is transformed by BNF2FPL into Floyd productions (FPL) which are
ilaced in L/FLOYDP while additional tables are placed in L/TABLESF and the
information in the first record of L/TABLESF is updated. BNF2FPL
.itiates execution of FPL2PAR which transforms the FPL syntax from
nto a stream of pseudo-orders which are returned to L/TABLESF.
itrol information in the first record is updated.
33
r<
O /
-P
•H £
5h
W >
R<
CD
o
H
P
H
o
fe
o
l-H
o
rH
0)
H
•H
o
o
nasavd/T
aviTna/i
CO
H
EH
5n/i
CO
O
l-H
J
O
Jh
C\
o
^
CD
fx
( rH
^
Si
|S
; h
O
0)
PC
1
rH
Ph
C5
>
CO
bO
Si
•H
•P
■H
fH
IS
fH
o
-P
05
rH
W
rl
EH
CD
-P
;
where is any string of letters and digits beginning with a
letter. These characters, or the first seven if there are more, are used as
a prefix for all the interlinking and output files generated by the TWS,
including the TWINKLE translator.
3.2.2 Print Options
The print control statement consists of the special word, PRINT,
followed by a colon, followed by a list of print options separated by commas,
followed by a semicolon. The options available are defined below.
1. TABLES SIFESF: causes the printing of a table displaying the sizes and
locations of all of the tables in the f ile, TABLESF •
2. TERMINALS ALPHABETICALLY, or TRMALF: causes the printing of an alphabetic
y
list of all of the special -words used in the language. If BNF is being
generated then the list includes an index of each occurrence of the
special words in PROTAB.
3. TERMINALS NUMERICALLY, or TRMNUM: causes the printing of a numerically
ordered list of all of the special words used in the language.
h. CHARACTERS , or TRMCHR: causes the printing of an index of the occurrences
of the 6k characters in PROTAB.
5- TERMINALS: is equivalent to 2, 3, and k taken together.
6. NONTERMINALS ALPHABETICALLY, or NTALF: causes the printing of an alpha-
betical list of all of the nonterminals used in the language. If BNF is
generated, the list includes an index of each occurrence of the
nonterminals in PROTAB-
7- NONTERMINALS NUMERICALLY or NTNUM: causes the printing of a numerically
ordered list of all of the nonterminals used in the language.
. NONTERMINALS: is equivalent to 6 and 7 taken together.
9- SYNTAX, or INPUT: causes the printing of the TWINKLE input as it is read.
0. INDEX, or XREF, or CROSS REFERENCE: causes the printing of an index of
occurrences of all nonterminals, terminals, and actions in the syntax
by card number.
1. AC! IONS ALPHABETICALLY, or ACTALF: causes the printing of an alphabetical
all of the semantic actions and tests used Ln the language. If
I generated, the list includes an index of each occurrence of
d tests in PROTAB.
37
12. ACTIONS NUMERICALLY, or ACTNUM: causes the printing of a numerically
ordered list of all of the semantic actions and tests used in the language.
13. ACTIONS: is equivalent tc 11 and 12 taken together.
Ik. PROTAB: is the name of the table into which TWINKLE places the BNF
equivalent of TWINKLE syntax in the input. This option causes the
printing of this table.
15. FLOYD: BNF2FPL transforms PROTAB into a set of Floyd productions in the
disk file, L/FLOYDP. This option causes the printing of these Floyd
productions.
16. COMBINED GROUPS: causes the printing of the components of all of the
combined groups required by the language. For a discussion on the use
of combined groups, see the paper by Beals [ 3 ] •
17. PARSER: FPL2PAR transforms the Floyd productions in FLOYDP into a stream
of pseudo-orders which make up the parser. This option causes the
printing of this stream of pseudo-orders.
18. PATTERNS: causes the printing of a table of the patterns
created in the TWINKLE translator processing of L, as well as any
additional patterns created by either BNF2FPL or FPL2PAR.
19- STANDARD: is the union of options 1, 5> 8, 9, 10, 13, 1^, and 18.
20. DEBUGGING, or DEBUGN: is the union of options 15, l6 and 19, i.e., of
everything but option 17 •
21. EVERYTHING: this is the union of all options.
22. NOTHING: this option, when used by itself, inhibits all printing.
38
If no print control statement appears, the print options are set
to the default option, STANDARD.
3-2»3 The Parser Type Option
As mentioned several times above, the TWS is equipped to produce
compilers based on either recursive descent or Floyd production language parsers.
It is appropriate, therefore, to have a control statement for determining -which
is to be generated. The relevant control statement is :
PARSER;
where is either RECURSIVE DESCENT or FLOYD PRODUCTION.
3-2.U Zip Control
In Burroughs B-5500 ALGOL it is possible for one ALGOL program
to initiate execution of another by executing a zip statement (i.e., by
zipping to the other program). The component programs of the TWS use the
zip statement to initiate their successors. In normal operation zipping
continues through final compilation by the ALGOL compiler. Frequently, a
user does not desire execution of the entire TWS but may wish, for example,
to check just the syntax, or just the semantics, of the input. This possibility
is allowed for in the TWS by the zip control statements which are listed below:
ZIP TO ISLj
DONT ZIP TO ISL;
DONT ZIP;
ZIP THROUGH ;
re
::= TWST/BNF2FPL/FPL2PAR/PAR2ALG/ISL/ALG0L.
use of these control statements is self-evident.
39
3-2. 5 Program Parameter Control
Each of the programs in the TWS has certain program parameters
which are normally assigned default values that permit compiler generation
for many small languages. It is possible, however, that a particular language
may require more execution time, a larger stacksize, or a higher B-5500 core
memory estimate to run successfully through some phase of the TWS. Corres-
pondingly, it may be desirable when processing some smaller languages to
decrease the values of some of these program parameters. This can be done with
the three program parameter control statements shown below:
PRIORITY = <*N>;
CORE - <*N>;
STACK = <*N>;
where was defined in the last section and <*N> is a positive
integer. These set the priority (and, implicitly, the time limit), the core
estimate, and the stacksize, respectively, of the program designated. These
parameters are then used in zipping to the program. If the
is COMPILER, the parameters are passed to the ALGOL compiler and become
the default parameters for the generated language compiler.
3> 2 .6 Executable Compiler Options
It was noted in section 3-1 that a Floyd production parser generated
by the TWS may be, to a greater or a lesser extent, an executable parser.
The default option is a parser which is wholly interpretive, but an executable
version of any of the three parser sections may be requested by use of the
control statements shown below:
EXECUTABLE LOOKAHEAD;
EXECUTABLE FILL TABLES;
EXECUTABLE FLOYD PRODUCTIONS; .
1+0
If the lookahead and fill tables portions of the parser are interpretive,
the resultant compiler, L/DISK, may only he executed if L/TABLESF is resident
on disk. By making these two portions executable, the parser becomes
a self-contained unit and compilation in L requires only L/DISK.
3'2.7 Miscellaneous Control Options
CLOSE, CLOSE LP, CLOSE LINEPRINTER, or CLOSE LINE PRINTER: applies
to BNF2FPL; it causes a separate file of output to be created each time an
error occurs during execution of BNF2FPL. In this way a user can ascertain
the cause of some errors before BNF2FPL runs to completion.
LONG LOOKAHEAD: applies to BNF2FPL; it specifies a four symbol
lookahead to be used in differentiating before deciding that the group cannot
be built. If the Floyd productions of a group being generated cannot be
differentiated by a three symbol lookahead and if combination is not possible,
the group is not normally built and the BNF2FPL translation fails. In
practice it has been found that when a lookahead of three symbols fails, no
additional amount of lookahead will help.
COMBINE FIRST: applies to BNF2FPL; it specifies that Floyd produc-
tion combination be attempted after a. one symbol lookahead has failed to
differentiate, but before attempting a two or three symbol lookahead; if combina-
is not possible, two and three symbol lookaheads will be attempted
before abandoning the group.
FLOYD PRODUCTIONS PER PROCEDURE: <*N>: applies to PAR2ALG- When
creating an executable parser, PAR2ALG generates procedures -- each containing
some specified number of the Floyd productions of the language. This number
ically 100, but may be set by the language designer to any desired
•
kl
GROUPS PER PROCEDURE: <*N>: applies to PAR2ALG; determines the
number of groups of Floyd productions in each executable parser procedure.
PROGRAM SYMBOL: : is followed by a nonterminal name, say ,
■which is taken to be the unique objective symbol of the language in question;
if this option is not used, the first nonterminal to appear in the syntax
specification is taken as the unique objective symbol for the language.
SPECIAL SYMBOLS: : may be used to force a
particular ordering of the special words of the language which are otherwise
numbered in the order in which they first appear in the syntax.
3-3 Burroughs B-5500 Control Cards for Executing TWINKLE
The TWINKLE translator is executed like a compiler on the B-5500
system. When the syntax to be translated is on cards, the following deck
set up may be used:
? USER - Language designer's user code
? COMPILE A/B WITH TWINKLE LIBRARY
? DATA CARD
input syntax
? END.
Since TWINKLE does not create executable code, the file, A/B, is not used,
and the name may be specified arbitrarily by the language designer. Because
this file is not used, either of the following forms may be used when the
input syntax is a file on disk, say PLl/SYNTAX:
? USER = Language designer's user code
? COMPILE A/B WITH TWINKLE LIBRARY
? TWINKLE FILE CARD = PLl/SYNTAX SERIAL
? END:
or
k2
? USER = Language designer's user code
? COMPILE PL1/ SYNTAX WITH TWINKLE LIBRARY
? END.
In the latter case, TWINKLE discovers that the input is not on cards and that
no file has been equated to file, CARD. It then investigates the code file and,
if it exists on disk, takes it as the file, CARD. In the former case a file
has been equated to f ile, CARD, so this is taken as the input syntax. In this
case, the code file, A/B, is not used and may be named arbitrarily.
^3
k. IMPLEMENTATION OF THE TWINKLE TPANSLATOR
The TWINKLE translator has been implemented with the TWS in a
bootstrapping fashion. The preliminary version of the translator was written
in BNF and processed on the portion of the TWS then existing, which was
essentially equivalent to the BNF2FPL, FPL2PAR and PAR2ALG stages of the
current TWS. Each subsequent revision to the TWINKLE translator was imple-
mented with the aid of its predecessor. Thus, although the present syntax
is much more sophisticated than the initial syntax, it is also shorter and
considerably more readable. The following sections detail some of the
salient features of the TWINKLE translator.
k.l NONTAB, SYMTAB and OPRTAB
As each nonterminal is read from the input syntax, its name is com-
pared against all those presently entered in NONTAB. If a match is found,
the corresponding nonterminal number is extracted from the relevant
field of the header word for the matching table entry. If no match is dis-
covered, a new entry is made. The entries are linked through the header
words in a binary tree which is alphabe* J cally ordered by the nonterminal
names. The format of the entries is shown below in figure 3* Associated
with each new entry into NONTAB is an entry into NTINDX pointing to the header
word of the nonterminal in NONTAB which facilitates printing out the non-
terminal names when necessary. Also, if the CROSS REFERENCE print option has
been activated by the user, the NTINDX word for a given nonterminal contains
a pointer to the base of an inter-linked list of the occurrences of that
nonterminal in the input syntax by line number. The actual repository
for this list, as well as those for the other nonterminals, terminals, and
hh
action symbols from the input syntax, is an array called OVERALLINDEX, each
of whose entries contains a line number, a bit showing whether the specified
occurrence was on the right or left hand side of a production, and a pointer
to the entry for the next occurrence of the item in whose list the entry
resides.
16 19 22
HEADER
X,> n WORDS
m chars
Figure 3- An entry in the NONTAB table
Figure 3 shows the details of a single entry in NONTAB. Consider,
first, the header word. Nonterminals (with the exception of the unique
objective symbol) used in the syntactic input must appear on the left hand
side of at least one production and on the right hand side of at least one
(not necessarily different) production. The INLHS bit is set on recognizing
the nonterminal as the left hand side of a production and the INRHS bit is
set on recognizing it in the right hand side of a production. These bits are
checked at the conclusion of syntax input and any discrepancies are reported
as errors on the TWINKLE output file, LINE. The SYMBOLVALUE field contains
rial number (code) of the nonterminal symbol represented by this
^5
NONTAB entry. Given a nonterminal name of k characters, n of the WORDS
field and m of the CHARS field are given by n = [k/l6] and m = k - 6* (n+l).
These two fields, taken together^ determine the extent of useful information in
the remaining words of the NONTAB entry. Finally, the LEFTPOINTER and RIGHT-
POINTER fields contain pointers to subsequent entries in the alphabetic binary
tree which NONTAB comprises. The remaining words in the NONTAB entry consist
of n words containing 6 characters each of the nonterminal name right justified
with 2 unused characters at the left, and, in the last word, 2 unused charac-
ters, m characters from the nonterminal name, and blanks filled to the right.
Corresponding to NONTAB and NTINDX for nonterminal storage
are the pairs of tables (SYMTAB, STINDX), and (OPRTAB, OTINDX) for storing
special symbols, and semantic symbols, respectively. As mentioned above, the
line by line index information for both these types of symbols is stored in
OPERALLINDEX along with that for nonterminals. While STINDX and OTINDX are
identical counterparts to NTINDX, entries in SYMTAB and OPRTAB differ slightly
from those in NONTAB and, in fact, from one another. In the case of SYMTAB,
there is, clearly, no need for the INLHS and INRHS bits since a special symbol,
if it appears at all, must appear on the right-hand side of a production.
Consequently, in the header words for SYMTAB, these bits are included as a
portion of the SYMBOLVALUE field. In OPRTAB header words it is also clear
that the INLHS and INRHS bits are unnecessary, but here the first bit is unused
and the second bit becomes the USED bit denoting an action or test symbol
that has been declared in the syntax. This bit is read by ISL/DISK to deter-
mine which actions it must get from the file,/ACTIONS . It is
also used by TWINKLE to catch duplicate declarations of the same semantic name.
k.2 PROTAB, PRODS and PDLIST
The primary table into which TWINKLE collects the BNF productions which
it produces from its TWINKLE input is PROTAB. Figure k shows the fields of a
he
PROTAB word. The FLAGS field comprises a set of six one bit flags carrying
6 12 18 30 36
FLAGS
NEXT
LHS
TYPE
ENTRY
SYMBOL
Figure h. PROTAB word format
various pieces of information. These flags are referred to as: IREC, NOBACK,
TRMDER, LASTNT, SFLAG and REC The IREC flag is set only in the first word
of a production and denotes an indirectly left recursive production. That is,
a production of the form:
: : = a
for which is a headsymbol of .
When the NOBACK flag is set, the symbol in the SYMBOL field is not
to be back-substituted into this PROTAB location. If the symbol in the SYMBOL
field has a terminal derivation, then the TRMDER flag is set. The LASTNT
flag is set when the symbol in the SYMBOL field is a nonterminal and, except
for possible trailing semantic symbols, is the last symbol in the production.
The SFLAG flag heralds a semantic symbol as the next symbol in the production.
Finally, the REC flag is set in the first symbol of a production if the symbols
in the LHS and SYMBOL fields are the same (i.e., if the production is left
recursive). The remaining fields are : the NEXT field which gives the number
of words to the beginning of the next production, the LHS field which contains
the number of the nonterminal being defined, and the TYPE and ENTRY fields of
this particular right-hand side symbol.
TWINKLE performs a number of grammar transformations on a local level.
Frequently more than one production is being built simultaneously, as happens
when nested definitions are being translated, or when lists are being imple-
ments. These difficulties make it very cumbersome for TWINKLE to put produc-
tions directly into PROTAB. To circumvent these problems TWINKLE uses a
hi
556 element entry table (PRODS ) as a directory and status table for 255 thirty-
two- symbol productions which are stored in the PDLIST array. The words of
PRODS form a list linked forward through the NEXTPD fields and backwards
through the IASTPD fields. The first entry of PRODS acts as a base for the
productions currently in use. The base of available productions is given by
the integer variable, FIRSTPDA VAIL. To manipulate these two structures, TWINKLE
uses two procedures: GETPROD and GIVEUP. GETPROD is an integer -typed pro-
cedure which has no arguments and, when called, returns the address of the next
available element of PRODS after incorporating it into the link structure and
making the necessary modifications to the various list pointers. GIVEUP has
as its sole argument the address of an element of PRODS to be removed from
the link structure and returned to available pool. The PDLIST symbols cor-
responding to PRODS(N) are PDLIST (32 x N) through PDLIST (32 X N + 31).
In addition to serving as the link structure for the productions in
PDLIST, each entry of PRODS contains the following information about the
production with which it is associated:
(i) COMPLETE: a flag which indicates that the associated production is no
longer being extended;
(ii) LEVEL: an eight bit field recording the level of bracket nesting at
which the production originated;
(iii) LAMDA: a flag which indicates whether the production has participated
in empty context absorption;
(iv) NEXT SYMBOL: a five bit field containing a count of the symbols currently
in the production;
(v) DORNO: a three bit field which identifies whether the left-hand side
of the production is a simple nonterminal or one of the several
TWINKLE generated nonterminal types;
(vi) LHS: a twelve bit field containing the nonterminal number of the
48
left-hand side of the production.
Each word of PDLIST contains only the left-hand side of the produc-
tion in the fields, IjHSDORNO and LHS, and one symbol from the right-hand side in
the fields: DORNO, TZPE, and ENTRY. The remaining information necessary in
PROTAB is not filled in until a production actually enters PROTAB.
Productions which are being created in PDLIST may be extended by
calling the procedure, ADDON. ADDON has two arguments: the first is the number
of the production to be extended; the second is the symbol by which it is to
be extended. If the symbol to be added is a TWINKLE -generated nonterminal
representing the alternatives of a nested definition set, the procedure
for adding it to a production is somewhat complex. Each of the alternatives
must be added on individually and, if more than one alternative is present,
new productions must be created. Thus, for example, if Oi represents the
string of symbols in the production being extended, and f3 through p represent
the strings of symbols in the alternatives of the symbol being added on,
ADDON will replace the production:
: : - a
by n productions:
: := a&
: : = ap g
: : = OB
K n
At the same time ADDON removes the n productions:
.:= p x
: : = P 2
: : = B
K n
where - the TWINKLE -gene rated nonterminal referred to above. By expanding
U 9
nested definitions in this way, TWINKLE ensures that each of the productions
which it creates retains all of the context that the language designer
specified in the original TWINKLE production.
If the symbol being added on is any other type of symbol, say the
symbol, a, ADDON replaces the production:
: : = a
by the production:
: : = aa
where GC is as above.
When it has been arcertained that production, P, is completed and is
ready to be put into PROTAB, the procedure, PUTINPROTAB, is called with the para-
meter, P. PUTINPROTAB fills in the NEXT and FLAGS fields of the production
and writes it directly into the next available locations in PROTAB. Once this
has been accomplished the production is returned to the free pool.
k.3 PRODSTACK and LPSTACK
The TWINKLE language offers the user only two recursive constructs.
These are nested definitions and list structures, which are implemented with
PRODSTACK and LPSTACK, respectively. These are dimensioned to allow nesting
of either definitions or lists to a depth of thirty, but this may, of course,
be easily altered in the unlikely event that it needs to be. The formats of
words in these two stacks are shown below (in figures 5 and 6, respectively).
SYMBOL
EXSYMBOL
Fieure 5. The format of a PRODSTACK entrv.
50
SEPD0RN0
SEP
LSITTYPE
3^1 9
LBDORNO
15
27
SEPTYPE
SEPENTRY
30
36
LBTYPE
LBENTRY
SEPSYMBOL
LBSYMBOL
EXSEPSYMBOL
EXLBSYMBOL
Figure 6. The format of a LPSTACK entry.
PRODSTACK is simply a stack of single symbols- -the top symbol being
the left-hand side of the set of productions currently being built. Whenever
the left-hand side of a TWINKLE production is encountered, the left-hand side
nonterminal symbol is pushed into PRODSTACK. Similarly, when the left square
bracket of a square brackets construct is encountered, a TWINKLE- generated
temporary dummy (the DORNO field is set to one) is created and pushed into
PRODSTACK. Right square brackets and semicolons, which end square bracket
constructs and productions, respectively, cause the top of PRODSTACK to be
popped. PRODSTACK is used in assigning names to TWINKLE- gene rated permanent
dummy nonterminals in the following manner. When such a dummy is required
(e.g., in the generation of lists, see below), PRODSTACK is searched downward
from the top for a natural nonterminal (which is identified by a zero in the
DORNO field) . There must always be such an entry because PRODSTACK always
extends to the beginning of some TWINKLE production which must start with a
natural nonterminal. The alphanumeric characters which make up the name of
the nonterminal are obtained from NONTAB. The desired name is then created
to these characters a blank, the characters "DUMMY" another blank,
and finally a number unique to this nonterminal. Since blanks cannot appear
in natural nonterminal names, these serve to ensure that no duplication of
51
nonterminal names can arise by this procedure. As an example of this, the list
structure in the TWINKLE production:
A CONSISTS OF A LIST OF S ;
would be implemented with a permanent dummy nonterminal named PROGRAM DUMMY 1.
Each entry of LPSTACK eventually contains all the information neces-
sary to construct the BNF equivalent productions for the list structure which
generated it. Whenever the list type of a list structure is encountered, a
new word is pushed into LPSTACK with an appropriate setting of the LISTTYPE
bit (i.e., 1 for a nonempty list and for a possibly empty list). When a
list base is recognized, the sub-fields of the EXLBSYMBOL field are set in the
top word of LPSTACK to identify the base. Similarly, recognition of a list
separator causes the subfields of the EXSEPSYMBOL to be filled in the top
word of LPSTACK; the SEP bit is set to zero for a definite separator
and to one for a possibly empty separator. If the list does not have a
separator, the EXSEPSYMBOL field is set to indicate an empty separator
and the SEP bit is set to zero.
The LPSTACK entry does not reflect the type of recursive desired for
the list, i.e., left recursive or right recursive. However, this is determined
syntactically and is transmitted to the semantics via the choice of the action
called. Given that EXSEPSYMBOL contains the symbol, S, and that EXLBSYMBOL
contains the symbol, B, figure 7 shows the productions which are generated
to implement the list for various choices of SEP and LISTTYPE where is
the TWINKLE- generated permanent dummy which implements the list structure
and is a TWINKLE -gene rated temporary dummy. Recall that a SEP bit of
one designates a definite separator and a zero bit designates a possibly
empty separator; whereas, a LISTTYPE bit of one indicates a nonempty list and a
zero bit indicates a possibly empty list.
52
CPE
1
I
i
i 1
SEP
1
i °
I
1
yes
yes
J no
1 no
i
yes
yes
i no
no
yes
yes
lyes
i
yes
yes
no
yes
no
yes
yes
yes
yes
-i
:
:=
:
:= < >
:
:= S B
:
:= B
:
:= B
(a) Left
LISTYPE
1
1
:
:
:
:
:
SEP
1
'
1
yesj yes
no
no
:=
yes; yes
no
no
:= < >
yes
yes
: yes
yes
:= B S
yes
no
jyes
no
:= B
yes
yes
i
1 yes
-i
yes
:= B
(b) Right
Figure 7- Left recursive and right recursive lists.
Note that possibly empty lists are characterized by the nonterminal;
whereas, nonempty lists are characterized directly by the list implementing
nonterminal .
53
k.k Any Patterns
An any pattern comprises 38^ bits (eight words of kQ -bits each) for
which there exists a one-to-one correspondence between the lower numbered bits
and the terminal symbols of the language. Bits zero through 63 correspond to
the Gh characters of the Burroughs B5500 system;, bits 66 through 85 correspond
to the possible special terminal classes; bits 86 and above correspond
to the special words of the language. Bits Gh and 65 are never set in an any
pattern because they correspond to a terminating character and an illegal
character, respectively. Any patterns are stored end-to-end in a 512 -word
array, ANYPAT. Each bit that is set in an any pattern indicates that
the corresponding terminal is represented by the any pattern.
During the preliminary translation, any patterns are actually created
in the negative- -that is, the bits are set if the corresponding terminal is
not represented by the particular any pattern. This condition is rectified
when the syntax input has been completed. Each pattern is transmitted to the
procedure, CLEANUP, which converts it to the required form which may involve
recursive calls of CLEANUP if the any base for the pattern was a nonterminal.
k.5 Grammar Transformations
The PROTAB generated by a TWINKLE translation is not, in general, in
a form acceptable to BNF2FPL. A number of transformations must therefore be
performed on a BNF grammar to increase the probability of its acceptance to
the overall TWS. In addition, a few transformations are applied to increase
the efficiency of the resultant compiler. These are described below in
the order in which they are performed.
h.^.1 Back Context Absorption
In translating from TWINKLE into BNF, it is important that TWINKLE
retain any context in the BNF that was inherent in the original TWINKLE
^
production. When a left recursive list is created, it is necessary for the
non-recursive productions of the list- implementing nonterminal to absorb
any of the context prior to it that may have arisen from constructs within
the TWINKLE production that initially gave rise to the list. As may perhaps
be anticipated, problems begin to crop up when more than one list is included
in a single TWINKLE production- -as in the case of nested lists.
Each time a definition is complete (i.e., whenever a slash, right
square bracket, semicolon, or special word, OR, is encountered), the following
context- absorbing algorithm is invoked. Let {P n |n = !>•••■> N} be the set
of productions in PDLIST and {D |n = 1,..., N} be the corresponding entries
in PRODS. Further, let P (j) denote the j-th symbol on the right-hand side
of the n-th production and let R denote the corresponding left-hand side.
The right-hand sides of P n are then scanned for the occurence of a left-
recursive TWINKLE- generated dummy nonterminal, say P (j) (such a nonterminal
is characterized by a 3 in the DORNO field discussed in section h.2). If
j > 1, or if the LAMDA bit of PRODS is set, a new production, P (k = N + l),
K.
is created as follows:
P k (i - j + 1) = P n (i) i = 3, + 1,..., ::= a LIST [LIST b SEP c] d LIST e
where a, b, c, and d represent terminal symbols. The productions, before
context absorption, are:
P : :: = \a d
P : : : - *»
P : : : =
P, : : : = Kb
P : : : = c b
P^ : : : = Xe
P : : : = e
where \ denotes productions, P., for which the LAMDA bit of D. is set to 1.
Then:
P : : : = X *
P^ : : : = X b *
56
P c : ::= c b
5
P^: ::= X e
P : : : = e
Po : : : = c
P : :
P 1Q : :
P : :
- \ a
-
= b
are the productions remaining after P, (l) and P~(2) have participated
in context absorption. Note that P^l) does not participate because the
asterisk indicates that the LEVEL field of D p is equal to OLDMA.KKER. Since
P (l) and P 7 (l) cannot participate because they lack a X :
10'
11'
12'
13'
:
: =
\ *
:
: =
X b *
:
: =
c b
:
: =
\e *
:
: =
e
:
: =
:
: =
b
:
: =
:
: =
c e
:
* s;
\ a b
show the productions remaining after context has been absorbed from Po(3) and
2). Note that the duplication of P from P is inhibited. Finally:
P c .: :
5
P : :
:= c b
:= e
\ : ::=
57
P :
P 12 :
P :
P . :
= b
=
= c e
= X a b
show the results of eliminating all productions marked with an asterisk. Note
also that RD1 is no longer explicitly left-recursive but retains implicit
left-recursiveness through P and P • It is easy to see that the terminal
strings defined by the fast set of productions above are exactly those
defined by P through P originally.
Context absorption is a local grammar transformation and is, there-
fore, carried out in PDLIST before the productions are entered into PROTAB.
All of the grammar transformations are global and are performed in PROTAB, it-
self. To discuss these with some facility, the notation of this section will
be altered and augmented as follows. The i-th production in PROTAB will be
denoted by P., its right-hand side symbols by T-(j), where j ranges from 1
through I. , the number of symbols on the right-hand side. The left-hand side
will be denoted by R. • Finally, for each nonterminal n, N(n) and
T(n) will be the sets of nonterminal and terminal head symbols of n, respec-
tively.
h . 5 . 2 Empty Removal
The control of ADDON and PUTINPROTAB is such that empty symbols may
appear in PROTAB only as productions in themselves. That is, if a production
contains an empty symbol, that must be its only symbol. Even in this
relatively mild form, however, empty symbols are unacceptable to BNF2FPL and
it falls to TWINKLE to remove them by back-substituting and collapsing PROTAB
around them. Since PROTAB always contains the initial production:
: : = J_ J_
58
where is the unique objective symbol of the language, and
"_L" is a special termination terminal symbol that appears nowhere else in
PROTAB. Furthermore, there is no nonterminal for which the only production
is LAMBDA since there is a check that each nonterminal has a terminal string
derivation.
Each production, P., for which I. = 1 is tested to determine whether
P.(l) is the empty symbol LAMBDA. When such a production is found, PROTAB
is scanned for occurrences of R. . Such an occurrence, say P (j) = R. ,
1 n 0/ i
generates a new production, P, at the end of PROTAB according to the rules:
(i) if l = 1 then
' n
a) 4 k - 1
b > \-\
c) P v (l) is an empty symbol;
(ii) if I > 1 then
n
c) P k (m) = P n (m) m=l, ...,j-l; P k (m)=P n (m+l), m-J,...,^.
In the first case, P will be picked up in its turn as an empty production.
In the second case,P, will eventually be scanned for occurrences of R. and
possibly generate further new productions. During the course of this pro-
cedure new productions are compared against existing productions in PROTAB
for identity. If a match is found the new production is not entered into
PROTAB.
. , . ';; Back Substitution of Singly Defined Nonterminals
Unlike LAMBDA removal, without which PROTAB is unacceptable to
BNF2FPL, back- substitution of singly-defined nonterminals merely serves to
59
increase the overall efficiency of the resultant compiler by decreasing the
number of reductions required to recognize certain nonterminals.
The algorithm for back-substitution is very straightforward and
proceeds as follows. The sei^ A, of singly defined nonterminals is determined
by one pass through PROTAB. Then, for each nonterminal, n, in A, PROTAB is
scanned for productions, P., in which P.(«j) = n for some j such that 1 < j < t. •
A new production, P^ is then created such that:
(i) r^v
(ID ^ = i ± + t n , . 1;
(iii) P. (m) m = 1, ..., j - 1;
P n , (m - J + 1) m = $t • • » > J + t n . - 1;
P i (m - \,+ 1) m = J + l^t-o'tlj^
where P , is the single production for which R , = n. The production,?., is
then deleted and the scan for occurrences of n continues, eventually reaching
P and checking for possible further occurrences of n.
k.^.h Dummy Insertion
It has been mentioned, briefly, that two otherwise indistinguishable,
Floyd productions may be differentiated by a look ahead of at most three symbols,
If this much look ahead is insufficient, BKF2FPL attempts to combine the pro-
ductions deciding (in essence) that the differentiation may be postponed. Of
course differentiation must eventually be accomplished by look ahead, if at all.
There are two situations in which this combination cannot be performed. First,
a terminal symbol may not be combined with a nonterminal. Second, if A and B
are two nonterminals for which a combination is proposed, that combination
cannot be carried out if A is a head symbol of B (or, of course, if B is a
head symbol of A). The reason for this is that the parser, in the course of
6o
looking for an A or a B, will then be satisfied by finding an A, even though
that A may, in fact, be the beginning of a B which a correct parse would
discover.
The approach to this problem in the original TWS was to look for BNF
productions of the form:
: : - a A p
: : = a y
(1)
where a is a nonempty string of terminals and/or nonterminals, and A is a
headsymbol of the nonterminal, B (A may be either a terminal or a nonterminal)
These productions were modified to;