mKKHKHBKSKSKB ^^ 
 
 *■ 
 
 < 
 
 »'' 'i Wan B 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 
 no.©20-82>5 
 Oop. 2j 
 
i 
 
 
 
 >• 
 
 I 
 
 •3 
 
 31 
 

 UIUCDCS-R- 76-833 
 
 ^ 
 
 \ 
 
 Syntactic Error Recovery for LR Parsers 
 by 
 John A. Modry 
 
 October 1976 
 
i 
 
 ma 
 
 .35* 
 3* 
 
 3£ 
 
 in 
 
SYNTACTIC ERROR RECOVERY 
 FOR LR PARSERS 
 
 BY 
 JOHN ARTHUR MODRY 
 B.S., University of Illinois, 197U 
 
 THESIS 
 
 Submitted in partial fulfillment of th° requirements 
 for *he leqree of Master of Science in Computer Science 
 in the Graduate Colleqe of th^ 
 University of Illinois at Urhana-Champaiqn , 1976 
 
 Urbana, Illinois 
 
>».«a« 
 
 111 
 
 ACKNOWLEDGEMENT 
 
 I would like to thank my thesis advisor. Professor M. 
 0. Mickunas, for both his technical help and his 
 friendship. His suggestions throughout the course of this 
 work proved to be invaluable and his enthusiasm was greatly 
 annreciated. A special thanks is in order for the special 
 effort he made to read the final drafts of this paper during 
 his vacation. 
 
 I would also like to thank fellow graduate students 
 John Bowman and Sco+t Fisher for their encouragement and 
 advice on Tiany sub-jects-both technical anl otherwise, as 
 well as all my friends who have made the last six years at 
 the University of Illinois very enjoyable ones. 
 
**> 
 
IV 
 
 TABLE OF CONTENTS 
 
 Paqe 
 
 1. INTRODUCTION 1 
 
 1.1 Survey of Previous Work... 2 
 
 1.2 Overview of Thesis, „ . 4 
 
 2. PHODES' METHOD 6 
 
 2.1 General Description.... 6 
 
 2.2 Condensation Phase 6 
 
 2.3 Correction Phase 8 
 
 3. EVOLUTION OF THE ERROR RECOVERY METHOD...*.. 10 
 
 4. THE EPPOR RECOVERY METHOD 16 
 
 4.1 General Description 16 
 
 4.2 Setup of Possible Parses 17 
 
 4.3 Parallel Parsing 18 
 
 4.4 Correction Phase 22 
 
 4.5 Special TaMes 30 
 
 4.6 Modifications to LR Parser 32 
 
 5. IMPLEMENTATION ftND SAMPLE PROGRAMS 34 
 
 6. EVALUATION OF THE METHOD 65 
 
 6.1 Effectiveness 65 
 
 6.2 Time Requirements 67 
 
 6.3 Space Requirements 70 
 
 6.4 Ease of Implementation 73 
 
 7. CONCLUSIONS 75 
 
 LIST OF REFERENCES 79 
 
 APPENDIX A 81 
 
1. INTRODUCTION 
 
 There are few things more frustrating than spending a 
 great deal of time debugging syntax errors in a program. 
 Often, an error causes a compiler to make an incorrect 
 assumption which leads to a confusing error message as well 
 as to the generation of many additional error massages. The 
 final recovery often necessitates the skipping of large 
 portions of the input string. This means that any 
 additional errors that were skipped over will go undetected 
 until future runs of the program. 
 
 A good syntactic error recovery method should detect 
 and givp an accurate and meaningful error message far each 
 error in a program. This means completely recovering, and 
 resuming the parse at the point of each error, so as not to 
 miss detecting any subseguent errors. The advantages of a 
 compiler having a good error recovery method are obvious. 
 The disadvantages are that it may be either very costly to 
 develop or very inefficient to use. An automatically 
 generated error recovery method solves the first of these 
 two problems. 
 
 
1. 1 Survey of Previous Work 
 
 The following is a brief survey of work done on 
 automatically generated error recovery schemes. More 
 detailed surveys arp contained in LaFrance [8] and in Rhodes 
 
 r i2i. 
 
 The simplest error recovery scheme is commonly referred 
 to as "panic mode". Upon detectinq an error, the parser 
 enters its "panic mode" and discards symbols from the input 
 strinq until one is encountered that belongs to a set of 
 special symbols. The parser then backs up on the parsing 
 stack to a point where that special symbol is a legal input 
 symbol. This method does not qualify as a good error 
 recovery scheme. It is an extremely fast method, hut may 
 skip over large portions of the input string and its error 
 messages generally consist only of an indication of the 
 inout symbol on which the error was detected. 
 
 In 1963, Irons [5] published an automatically generated 
 error recovery method which he had developed and implemented 
 for a non-backtr ack top-down parsing algorithm. In 1970, 
 Leinius f 9 "J described but did not implement a scheme which 
 appears to be a more sophisticated version of the "panic 
 mode" method. Leinius 1 method is based primarily on a 
 simple precedence parsing algorithm [ 14 ], but he also 
 
discusses the application of his method to LR parsing 
 alqorithms f 1 "|. In 1972, L. R. James [6] implemented 
 Leinius' error recovery ideas on an LALR(k) parser. In 
 1971, LaFrance [8 1 developed and implemented an automatic 
 error recovery method as part of a translator writinq system 
 which produces a Floyd-Evans production parser. The 
 LaFrance method produces qood results, but it is restricted 
 by a bounded look-ahead and by the fact that it does not 
 attempt to modify the parsing stack. Also in 1971, Levy 
 f 1 1 proposed but did not implement an automatic error 
 recovery scheme. Levy's scheme has both unbounded 
 look-ahead and the ability to modify the parsing stack. 
 However, it may run into combinatorial problems, which makes 
 its practicality questionable. Tn 1972, Partridge [11] 
 developed and implemented an automatic error recovery system 
 that has the ability to collect statistics on the programs 
 run through it and to accordingly modify itself somewhat in 
 order to increase its efficiency. Partridge's method does 
 entail a considerable amount of overhead, even while parsing 
 correct portions of programs. Tn 1971, Rhodes [12] 
 developed a good automatic error recovery method for simple 
 precedence parsers. Rhodes implemented his method for both 
 full Pascal [13] and for a subset of Algol W. Rhodes' 
 method is effective as well as being efficient to use. A 
 paper by Graham and Rhodes [3] contains a brief overview of 
 
simple precedence parsing and error detection in simple 
 precedence parsers, as well as a description of Rhodes 1 
 method. 
 
 1.2 Overview of Thesis 
 
 ■41 
 
 Mm 
 
 Is 
 
 i 
 
 P 
 ■JSP 
 
 aw 
 
 3> 
 
 IS 
 
 The motivation for this work came from reading of 
 Rhodes' method in the paper by Graham and Rhodes [3]. The 
 original idea was to develop an error recovery scheme that 
 would extend Rhodes' ideas to LR parsers [7], Note that the 
 reader is presumed to have a knowledge of LR parsing 
 techniques [ 1 "]. Section 2 contains a description of Rhodes' 
 error recovery method. Rhodes' ideas proved to be difficult 
 to apply directly to LR parsers and some of our original 
 ideas changed during the course of this work. Section 3 
 describes the evolution from the original ideas to the 
 method which was finally adopted. Section 4 consists of a 
 detailed explanation of the error recovery scheme. Section 
 5 describes its implementation and discusses the results of 
 some samole programs run on it. Section 6 evaluates the 
 effectiveness of the error recovery method, gives an idea of 
 its efficiency both in terms of memory and execution time, 
 and discusses the ease with which it can be implemented. 
 
Section 7 suggests some improverae nts and restrictions that 
 
 could be placed on the error recovery method in order to 
 
 improve its efficiency, and presents some conclusions that 
 
 can be reached about the effectiveness and practicality of 
 th<* method. 
 
 
2. RHODES' METHOD 
 
 2. 1 General Description 
 
 r 
 
 '.1 
 
 is 
 
 ii* 
 
 2 
 
 Rhodes' method is an automatic recovery method without 
 a fixed bound on lookahead. It determines the most likely 
 correction by using cost vectors, which indicate a cost for 
 the insertion, deletion, and replacement of both terminals 
 and nonterminals. 
 
 The method consists basically of two parts. The first 
 is a Condensation Phase, where an attempt is made to 
 localize the occurrence of the error, and the second is a 
 Correction Phase, where an attempt is made to determine what 
 changes are necessay to correct the error. 
 
 2.2 Condensation Phase 
 
 H ?1 is used to mark the point of the error at the 
 juncture of the parsing stack and the input string. 
 
The Condensation Phase consists of both a backward move 
 and a forward move. For the backward move, ths ?1 is 
 assumed to be a > simple precedence relation [14], which 
 means that the symbol on the top of the stack has qreater 
 precedence than the input symbol on which the error was 
 detected. The backward move consists of making all possible 
 reductions to the left of the ?1, as long as the result 
 forms a Drefix of a valid right-part (RP) of a production of 
 the grammmar and a precedence relation holds between the 
 symbol to the left of the prospective RP and the 
 corresponding left-part (LP) . 
 
 For the forward move, the ?1 is assumed to be a <• 
 simple precedence relation, which means the input symbol 
 will be shifted onto the stack. The forward move consists 
 of continuing the parse to the right of the ?1 until another 
 error is detected (which is marked by a ?2 on the parsing 
 stack). A second error will always be detected by the time 
 the parser attempts to reduce over the ?1, since the ?1 
 cannot be contained in a valid RP. At this point the ?2 is 
 assumed to be a •> simple precedence relation and all 
 possible reductions are made using the symbols to the 
 immediate left of the ?2 in the same manner as for the ?1 in 
 the backward move. 
 
8 
 
 This completes the condensation phase. At this point 
 the error recovery method continues to the Corrrection 
 Phase, assuming that the error has been localized to a 
 section of the parsinq stack bounded on the left by the 
 first « simple precedence relation to the left of ?1 and 
 bounded on the riqht by the ?2. 
 
 2. 3 Correction Phase 
 
 l\ 
 
 C 
 
 r 
 
 ■ as» 
 
 b 
 2i£ 
 
 The Correction Phase assumes the error is contained in 
 
 the localized area determined by the Condensation Phase and 
 
 considers three possible substrings of that area as 
 candidates for chanqe. 
 
 ?1 
 
 ?2 
 
 left bound 
 Candidate *1- The first 4 to the left of ?1 
 Candidate #2- ?1 
 Candidate #3- The first « to the left of ?1 
 
 ?1 
 
 ?2 
 ?2 
 
The three possible substrings are pattern latched 
 against the RP's of all productions of the grammar whose 
 corresponding LP's have precedence relations with both the 
 symbols to the left and to the right of the substrings. By 
 usinq the costs obtained from the Insertion, Deletion, and 
 Reolacement Vectors, a cost is computed for each attempted 
 pattern match, and the solution with the minimum cost is 
 used. 
 
 
10 
 
 3. EVOLUTION OP THE ERROR RECOVERY HETHOD 
 
 The oriqinal idea was to develop an error recovery 
 method that would extend Rhodes' ideas to LR parsers. 
 
 I I 
 J 
 
 ' ■ 
 
 f II 
 
 f 
 
 ■1.1 
 
 •53 
 3:2 
 
 J 5 * 
 
 s 
 
 ««• 
 
 The problem applyir.q Rhodes* method directly to LR 
 parsers is that it is not easy to determine the possible 
 left and riqht ends of the RP as Rhodes has so neatly done 
 with the precedence parser. A set of possible riqht ends of 
 the RP can be determined by continuinq all possible parses 
 in parallel from the point of error detection on (one 
 possible parse for each state from which the ?rror symbol 
 may be leqally read) , until each possible parse either 
 encounters another error or attempts to make a reduction 
 which extends past the top of the stack, as it existed when 
 the error was detected. (We call this "reducing over the 
 error point".) Then for each possible parse, the riqht end 
 of the RP is the point at which that parse attempted to 
 reduce over the error point. 
 
 The problem then becomes choosing the most likely 
 possible parse, which need not necessarily be amonq those 
 indicated. If the input symbol at the error point is a 
 completely erroneous symbol which should be deleted, then 
 
11 
 
 all of the indicated parses are incorrect and a new set of 
 possible parses will have to be discovered. The frequency 
 of occurence of an error due to an erroneous symbol, that is 
 detected immediately, is affected by the similarity of the 
 constructs of the language as well as the programs being 
 run. An idea of this frequency of occurence would be useful 
 in planning a strategy for determining the set of possible 
 parses to start with. If erroneous symbols frequently 
 occured and usually were immediately detected, then it would 
 be more efficient to skip the input symbol at the point of 
 error detection and use the immediately following symbol to 
 set up the possible parses. 
 
 This still leaves the problem cf finding the left end 
 of the RP for each possible parse. This problem could be 
 deferred until the pattern match by using a right-biased 
 right to left pattern match similar to the left-biased left 
 to right pattern match used in Rhodes* method. For each 
 possible parse, the target of the pattern match would be the 
 FP of the production that the parser was attempting when it 
 tried to reduce over the error point. The symbols implied 
 bv the states on the parsing stack cf each possible parse 
 would be matched against the corresponding target RP. 
 However, merely matching a target RP without accumulating a 
 cost too large to be acceptable would not guarantee success. 
 A chr'ck for correct left context must also be made. 
 
12 
 
 A problem is presented by the fact that the error may 
 have caused a reduction that otherwise would not have 
 occured or the error may have prevented a reduction from 
 occuring that otherwise would have. Making all possible 
 reductions at the error point, regardless of the next symbol 
 in the input string, would be the LB parser equivalent of 
 the Condensation Phase in Rhodes' method, and it was thought 
 that performing such reductions would provide some help. 
 
 However, the real problem is in not knowing the exact 
 location on the parsinq stack of the left end of the RP 
 beinq looked for. In fact, an incorrect reduction may have 
 previously occured in the vicinity of the left end, and the 
 exact left °rd may no lonqer exist on the parsinq stack. 
 Rhodes* method did not seem to be bothered by the 
 possibility of reductions occurinq that should not have or 
 reductions not occurinq that, should have. For the LR parser 
 though, this oroblem combined with the inability to 
 accurately determine the left end of the RP, would most 
 likely cause much poorer results than Rhodes* method did for 
 the precedence parser. Whenever necessary, the pattern 
 match could expand the nonterminals implied by the states on 
 the parsing stack into their possible RP*s. Also, if 
 necessary, a series of symbols that form a RP could be 
 condensed into their corresponding LP*s. 
 
13 
 
 However, expandinq and condensing symbols while pattern 
 matchinq leads to a very large number of possibilities. 
 Durinq the pattern match, each time a symbol did not match, 
 a check would have to be made for all possible expansions of 
 the symbol as well as all possible nonterminals which the 
 symbol could be an expansion of. 
 
 Another problem to consider is that the possible 
 expansions of nonterminals are only possibilities. Once a 
 reduction into a nonterminal is made, there is no way of 
 knowina which RP reduced into that nonterminal. Thus there 
 is no way to be sure what the oriqinal input strinq looked 
 like. This means the cost of deletinq a nonterminal called 
 <expression> would have to be the same reqardless of whether 
 it originally consisted of a single identifier, or of a very 
 larqe arithmetic expression. Another example of this 
 problem would be in a lanquage with the productions 
 
 <left-part> => <identifier> := <left-part> 
 = > <identifier> : = 
 
 where the symbol := is an assignment operator. In this 
 case, deleting an identifier and an assignment operator 
 (i.e., "A:=") which reduced into the nonterminal <left-part> 
 would havo the same cost as deleting a <left-part> which 
 oriqinallv consisted of multiple assiqnment statements 
 (i.e., "A: = B:=C:=D: = E: = ") . 
 
14 
 
 f 
 '» 
 
 J 
 
 9 
 
 *1 
 
 *«• 
 
 
 V 
 3» 
 
 <i2 
 32 
 
 All of the above considerations lead to the decision to 
 save the original input string in tokenized form, and only 
 consider insertions and deletions of terminals. Besides 
 guaranteeing that the original input is precisely known, 
 this makes error messages much clearer since they refer to 
 specific terminal symbols instead of nonterminals in the 
 grammar. This is especially helpful to the user who would 
 have no idea of what a nonterminal was. This is important, 
 since the inexperienced programmer is precisely the type of 
 user who would derive the most benefit from a thorough 
 syntactic error recovery method. 
 
 Although the idea of pattern matching to the closest RP 
 worked well and efficiently for precedence parsers, it 
 cannot easily be adapted to LR parsers. Also to be taken 
 into consideration is that the states in an LR parser 
 contain more information about the previous input than 
 simply the symbol on which they were entered. This 
 additional information would not be utilized by merely 
 pattern matching against the symbols the states were entered 
 on. 
 
 The idea of pattern matching to the closest RP was 
 finally abandoned. The method used, which is described in 
 the next section, attempts to use the information contained 
 in the states to make each possible parse continuous across 
 
15 
 
 its prror point. The method works with the tokenized input 
 string and only considers insertions and deletions of 
 terminals. 
 
16 
 
 4. THE ERBOR RECOVERY METHOD 
 
 4. 1 General Description 
 
 The method is to continue a series of possible parses 
 in parallel from the point of error detection on. There is 
 no fixed bound on how far thpy may continue. Thay all 
 proceed until either encounterinq another error or 
 attemptinq to reduce over the error point. Any possible 
 parse attemptinq to reduce over the error point is 
 considered to be a candidate for correction. Insertions and 
 deletions of terminals are considered for each candidate for 
 correction in an attempt to make them continuous parses 
 across the error point. There is an Insertion Cost Vector 
 ard a Deletion Cost Vector, which respectively contain the 
 insertion and deletion costs of each terminal in the 
 lanquaqe. A record is kept of the individual chanqes and 
 the total cost of all the chanqes for each candidate until 
 either the possible parse is corrected or the total cost 
 exceeds some fixed limit. All candidates that are corrected 
 without accumulatinq a total cost qreater than the limit are 
 considered as possible solutions and are printed as such. 
 The error recovery method then indicates its choice of the 
 
17 
 
 most probable of th^se possible solutions, which is the 
 solution with the lowest total cost. 
 
 4. 2 Setup of Possible Parses 
 
 A possible parse is set up for each state for which the 
 parser action would be a "shift" given the input symbol at 
 the point of error detection. It is not necessary to set up 
 possible parses for the states which indicate a "reduce" 
 parser action on the error input symbol. States from which 
 the parser action would be a reduction are considered 
 ancestral to the states which could result from that 
 reduction. The error recovery method is aware of this 
 intended ancestral relation when it is working with the 
 candidates for correction. Therefore, for this error 
 recovery method, setting up one possible parse for each 
 state that would shift on the input symbol is sufficient to 
 cover all possible cases where the input symbol would be 
 leqal. 
 
 A ?1 is assumed to exist at the discontinuous point in 
 each possible parse, which is the point between the state on 
 which the error was detected and the state that was assumed 
 
18 
 
 correct when setting up that possible parse. While parsing, 
 each possible parse is aware of where its ?»s are. 
 
 4.3 Parallel Parsing 
 
 All of the possible parses proceed in parallel. Each 
 one makes all possible reductions, but as soon as it makes a 
 shift, control is passed to the next possible parse. Any 
 parse encountering an error is discarded and the remaining 
 parses continue. Upon attempting to reduce over the ?1, a 
 possible parse is considered as a candidate for correction. 
 If the correction attempt fails then the posssible parse is 
 discarded. If the correction attempt succeeds, then the 
 corrections are tentatively recorded, and the process 
 continues as long as at least one possible parse has not 
 attempted a reduction over the ?1. This continues either 
 until all possible parses have been discarded, or until all 
 those remaining have successfully provided a correction. 
 
 For the case where the process ends with all the 
 remaining parses having been corrected, the corrections that 
 were tentatively recorded for each of the remaining parses 
 are printed as possible solutions. A decision is then made 
 
19 
 
 as to the most probable solution, the possible parse of that 
 solution is indicated, and the error recovery routine is 
 exited. Usually the error recovery routine returns to the 
 regular parser. However, it could return from a recursive 
 call of itself. The conditions under which the error 
 recovery routine recursively calls itself are discussed in 
 the next paragraphs. Regardless of where the error recovery 
 routine returns, or how many of the possible solutions had 
 costs egual to the minimum, only one solution is returned. 
 
 For the case where all of the possible parses have been 
 discarded, there are still two possibilities. The input 
 symbol that was originally assumed correct in setting up the 
 possible parses can be deleted, or the error recovery 
 routine can assume that a second error caused all the parses 
 to be discarded and can call itself recursively in an 
 attempt to independently correct the second error. 
 
 Not all of the possible parses discarded are 
 irretrievable. Those that became candidates for correction, 
 but couia not correct the error within the cost limit are 
 comoletely disregarded, as are all parses that encounter a 
 second error within one symbol of the original error. 
 However, for each pass through the possible parses, until at 
 least one of the possible parses has been able to perform a 
 shift, a pointer is kept to each parse discarded that 
 
20 
 
 
 ■ 3* 
 
 2c 
 
 J ID* 
 
 «o 
 
 3*5 
 
 encountered its error two or more symbols after the original 
 error. Whenever all the possible parses have been 
 discarded, but two or more symbols have been shifted since 
 the possihle parses were set up, it is assumed that the 
 oriqinal input symbol assumed in setting up the possible 
 parses may still be correct and that the parses ware all 
 discarded because of detecting a second error in the input 
 string. By re-instating those parses that encountered their 
 second error on the current input symbol and then calling 
 itself recursively, the error recovery routine attempts to 
 tackle this new error independently of the first error. 
 
 If all of the possible parses have been discarded 
 before shifting two or more symbols, then the input symbol 
 that was oriainally assumed to be correct is permanently 
 deleted, and an error message to that effect is printed at 
 this point. The error recovery routine then restarts 
 itself, setting up new possible parses using the symbol 
 immediately following the one that was just deleted. 
 
 There is one other situation that could arise. A 
 possible parse could attempt to reduce over the error point, 
 enter the Correction Phase, supposedly correct the error, 
 but then be unable to reparse up to the point in the input 
 string where the reduction over the error point was 
 
21 
 
 attempted. An example of this is shown by the following 
 statements. 
 
 x := ( y ♦ ( z*5 ; 
 
 2 := z + 1 ; 
 
 An apparent solution is to insert two right parenthesis. 
 The error is detected when the ";" is the input symbol. The 
 possible parses "shift" the ";" before attempting to reduce 
 over the error point. (Notice that according to the grammar 
 f Appendix A 1, a semicolon is used as a statement 
 terminator.) The Correction Phase inserts a ") " after the 
 "5". However, since two right parenthesis were needed, it 
 encounters an error at the same point during tha reparse. A 
 possible correction has apparently teen found, but with that 
 correction the possible parse is unable to reparse up to the 
 point in the input string that the other possible parses 
 have parsed to. Since all possible parses must proceed in 
 parallel, it must be discarded. A list of pointers is also 
 kept to the possible parses discarded under these 
 conditions. If another possible parse either finds a 
 correction and is able to reparse further, or parses further 
 than the discarded ones did before attempting to reduce over 
 the error point, then these parses are permanently 
 discarded. If not, then these discarded parses are 
 re-instated and the error routine* calls itself recursively 
 
22 
 
 from the point at which they encountered the error on the 
 
 attempted reparse. This is in fact what happens in the 
 
 above example. Upon entering the error routine again, a 
 
 second ")" is inserted and this time the parse is completely 
 corrected. 
 
 U.U Correction Phase 
 
 A possible parse becomes a candidate for correction as 
 soon as it attempts to reduce over a ?. This ? can best be 
 looked on as a barrier across which the possible parse is 
 not continuous. The parse is continuous from its beginninq 
 up to the ?1, since the parser got that far before 
 encountering its first prror. The possible parse is 
 continuous between ?'s (if there is more than dug) and is 
 continuous from the last ? to the point where the attempted 
 reduction took place. Thus, if the possible parse can be 
 made continuous across the ?*s, then it can be considered as 
 corrected. 
 
 Remember that the possible parses wer^ created on the 
 assumption that the input symbol at the error point was 
 correct. Each individual possible parse is based on the 
 
23 
 
 additional assumption that the state to the immediate right 
 of the ? (the state from which that possible parse was 
 continued after the error) is a reasonable choice. Should 
 either of these assumptions be false, the correction should 
 not be possible within the fixed cost limit, resultinq in 
 the possible parse being discarded. Ideally, if both 
 assumptions are true, then the minimal changes to the input 
 string that are necessary to correct the possible parse will 
 be found. 
 
 The basic idea is to find a way to get from the state 
 on the left of the barrier to the state on the right of the 
 barrier, while considering only insertions and deletions of 
 terminal symbols. There is actually a set of states on both 
 the left and the right of the barrier. A state can have 
 many ancestral states. For the scope of this text, 
 ancestral states are defined as follows: 
 
 A state S» is ancestral to state S if given a 
 certain state stack with state 5* on top, one or 
 more "reduce moves" could be performed which would 
 leave state S on top of the state stack. 
 
 Since the error recovery method will only be working with 
 insertions ani deletions of terminals, it is sufficient to 
 work with terminal-entry ancestral states. For the scope of 
 
24 
 
 this text, terminal-entry ancestral states are defined as 
 follows: 
 
 The set of terminal-entry ancestral (TEA) 
 states of state S is the set consisting of all 
 ancestral states of state S which are enterable on 
 a terminal symbol (as a result of a shift) . 
 
 The s»t of TEA states to the immediate left of the 
 barrier will be called the leftstates and the set of TEA 
 states to the immediate riqht of the barrier will be called 
 the riqhtstates. The objective is to find a "simple way" to 
 qet from one of the leftstates to one of the riqhtstates. 
 This "simple way" is by the insertion of a sinqle terminal, 
 thouqh the method could be extended to consider the 
 insertion of multiple terminals. If the sinqle insertion 
 will not break the barrier, it is assumed that the error 
 ocurred at an earlier point in the input string and an 
 attempt is made to back thp barrier up one symbol. The 
 symbol to be backed ovpt is the previous symbol in ths input 
 strinq, which is always known since the oriqinal input 
 strinq is saved in tokenized form. Backinq the barrier over 
 this terminal means backinq up both the leftstates and the 
 riqhtstates over the terminal. Backinq a set of states over 
 a terminal has the followinq effect. The set of states 
 
25 
 
 after the backup consists of all states from which a shift 
 (on the terminal) can be made to one of the states that was 
 in the set prior to the backup. 
 
 All of the leftstates which can be backed up over the 
 terminal are backed up, and those that cannot be backed up 
 over the terminal are discarded. At least one of the 
 leftstates will always be able to back over the terminal, 
 since one of them is the state that was in the actual parse 
 before the error was detected. 
 
 If ono or more of the rightstates can also back over 
 the terminal, then those that can are backed up and the 
 barrier has successfully been backed up one symbol without 
 accumulating any cost. The orocess then repeats itself by 
 alternately looking for a "simple way" (a single terminal 
 insertion) to break the barrier and then backing up the 
 barrier. This continues until either the barrier is broken 
 (the possible parse is considered corrected) or a total cost 
 is accumulated which is greater than the fixed limit. 
 
 However, it is very likely that none of the rightstates 
 will be able to back over the previous terminal from the 
 input string. Remember that each possible parse is based on 
 the assumption that the state that was assumed when setting 
 up the possible parse (the original righstate) was a 
 reasonable choice. Therefore, if the possible parse is 
 
26 
 
 still assumed a reasonable one, but none of the rightstates 
 can back over the terminal from the input string, then that 
 terminal must be an incorrect one. The terminal is 
 considered deleted as far as the possible parse is 
 concerned, and the cost of its deletion is added to the 
 total cost being accumulated for the possible parse. Even 
 though the rightstates are unsuccessful in backing up, 
 because of this deletion, the barrier itself has been backed 
 up. The process then continues, provided that the total 
 cost accumulated is still less than the fixed limit. 
 
 £■ 
 
 Each time either the set of leftstates or the set of 
 rightstates is backed over a symbol, that set of states is 
 updated so that it only contains TEA states. In the case of 
 the set of rightstates (for reasons explained in the next 
 several paragraphs), the set of states is saved before the 
 update occcurs. The set of rightstates prior to the update 
 will be referred to as the "true" rightstates. 
 
 There is a problem with nondet erminism both in backing 
 up the set of leftstates and in backing up the set of 
 rightstates. 
 
 At the beginning of the Correction Phase, the set of 
 leftstates consists of the TEA states of a single state, 
 which is the state that was on the top of the parsing stack 
 at the error point. Ideally, while backing up over the 
 
27 
 
 input string, the leftstates should always consist of the 
 TEA states of a single state (the "true" leftstate). That 
 "true" leftstate should be the state that was on the top of 
 the parsing stack when the original forward parse was at the 
 sane point in the input string. If that "true" leftstate 
 still exists in the parsing stack, then it can be 
 determined, since each state pushed on the parsing stack is 
 accompanied by a pointer into the tokenized input string. 
 
 The nondeterminism in backing up the set of leftstates 
 only occurs when backing over input symbols that have 
 already been reduced into a nonterminal. When this occurs, 
 the leftstates are only a set of possibilities, and the 
 "true" leftstate will not be known until backing up into a 
 state which still exists on the parsing stack. During this 
 period of uncertainty about the validity of the set of 
 leftstates, a break of the barrier is not necessarily a 
 correct solution. Each time a way is found to break the 
 barrier, the leftstate used is checked to guarantee that it 
 is a "true" leftstate or that it can be backed up into a 
 "true" leftstate. If the leftstate used does not satisfy 
 this check, then the solution is ignored and that leftstate 
 is discarded from the set of leftstates. 
 
28 
 
 The nondeterminism encountered in backing up the set of 
 riqhtstates is not as easy to control as that encountered in 
 backing up the set of leftstates. If not checked, the set 
 of riqhtstates can expand very rapidly and the 
 nondeterrainism involved could lead to a potentially 
 confusinq situation for the Correction Phase which attempts 
 to find a simple way to get from one of the leftstates to 
 one of the riqhtstates. 
 
 This problem appears to be checked nicely by always 
 attemptinq to back the barrier over a nonterminal before 
 attemptinq to back it. over the previous terminal in the 
 input string. This can be done whenever three conditions 
 are satisfied. The "true" leftstate must be known, that 
 "true" leftstate mast be enterable on a nonterminal, and at 
 least one of the "true" rightstates must be enterable on 
 that same nonterminal. When these three conditions are 
 satisfied, the "true" leftstate is backed over this 
 nonterminal into another "true" leftstate. Those "true" 
 riqhtstates that can be backed up over the same nonterminal 
 are backed up and those that cannot are discarded. There is 
 no problem in determininq the point in the input strinq to 
 which the barrier has been backed up, since there is still a 
 "true" leftstate which exists on the parsinq stack and which 
 is accompanied by a pointer into the tokenized input strinq. 
 If any one of the three co ditions is not satisfied, then an 
 
29 
 
 attempt is made to back the barrier over the previous 
 terminal in the input strinq in the manner previously 
 described. 
 
 In all the examples we tested, backing over a 
 nonterminal whenever possible prevented the Correction Phase 
 from qettinq into trouble due to the nondetermin ism involved 
 in backinq up the rightstates. In some cases it would have 
 run into trouble had it only backed up over the terminals in 
 the input strinq. Though this method of controllinq the 
 nondeterminism problem worked very well, it is not clear 
 that it is sufficient to handle any possible case that could 
 be contrived. 
 
 There is an additional advantaqe to backinq over a 
 nonterminal whenever possible, and that is because it is 
 much more efficient than backing over the individual 
 terminals which have already reduced into the nonterminal. 
 For example, backinq over a nonterminal <statement> is much 
 more efficient than backinq over each of the individual 
 terminals of that statement. 
 
 Eventually the process will complete. If it ends by 
 accumulatinq too qreat a cost, then it failed, and as a 
 result the possible parse will be discarded. If it ends by 
 breaking the barrier, then the possible parse has been 
 corrected. Taking the corrections into consideration, the 
 
30 
 
 input string is reparsed beginning with the "true" leftstate 
 closest to the barrier which has not seen the point in the 
 input string where the earliest correction was made. 
 
 Although only insertions and deletions of terminals are 
 considered at any point, the error messages may suggest 
 changing one terminal to another. Each time an insertion is 
 to be made for a possible parse, a check is made to see if 
 the terminal is to be inserted next to a terminal which the 
 Correction Phase has just deleted for the same possible 
 parse. If so, the corrections are combined into one and the 
 cost accumulated for that possible parse is changed to 
 reflect only the maximum of the two individual costs. 
 
 4.5 Special Tables 
 
 The error recovery method reguires three special tables 
 other than the normal parsing tables reguiced foe an LR 
 parser. 
 
 1.) The Legal State Table. 
 
 2.) The Predecessor States Table. 
 
 3.) The TEA States Table. 
 
31 
 
 The Legal State Table is an indexed sequential table 
 which is indexed by terminals of the language. For each 
 terminal, the table contains all states from which the 
 parser action with that terminal as an input symbol is a 
 shift. The Lpqal State Table is used in settinq up the 
 possible parses. 
 
 The Predecessor States Table is really two indexed 
 sequential tables which are indexed by the states of the LR 
 parser. For each state, the first table contains the symbol 
 that the state is enterable on, and the second table 
 contains all states from which the state could be entered. 
 The Predec8ssor States Table is used in backing up a set of 
 states over the input strinq as well as in creatinq the TEA 
 States Table. 
 
 The TEA States Table is an indexed sequantial table 
 which is indexed by the states of the T-R parser. For each 
 state, the table contains all TEA states. This table is 
 also used in backinq up a set of states over the input 
 strinq. 
 
 All three of these tables are automatically generated 
 directly from the tables of the LR parser. 
 
32 
 
 4.6 Modifications to LR Parser 
 
 U 
 
 i 
 
 A 
 
 • 
 
 3 
 j3 
 
 \l 
 
 I 
 
 P 
 X 
 
 •JJi 
 
 22 
 
 J3» 
 
 b 
 
 52 
 
 This error recovery method requires some modifications 
 to the original LR parser. The error recovery method 
 depends on having access to the original input string in 
 tokenized form. Therefore the lexical analysis phase must 
 save each token that it processes. If the original input 
 string were ordinarily available and there were a desire to 
 reduce overhead for the parsing of completely correct 
 programs, then the error routine could re-tokenize when it 
 needed a symbol. However, for incorrect programs, it would 
 be much more efficient to save the input string in tokenized 
 form. 
 
 A problem is presented by the fact that the error 
 recovery method can decile to back up and restart a set of 
 possible parses. This requires the lexical analysis phase 
 to check if the next token it is supposed to process has 
 alreadv be<=r. consumed. Tf it has already been consumed, 
 then it can be found in the tokenized input string that was 
 saved. 
 
 Another problem is found in trying to make a 
 correspondence between the states on the parsing stack and 
 the tokenized input string. Suppose the error recovery 
 routine makes a change to the input string. The parse must 
 
33 
 
 
 be resumed at the point of the charge. However, there is a 
 problem in determining the state closest to the top of the 
 parsing stack which has not seen the point in the input 
 string where the change was made. without this 
 correspondence between the parsing stack and the input 
 string, the input string would have to be reparsed entirely 
 after each correction. Therefore, each state that is pushed 
 onto the stack is accompanied by a pointer into the 
 tokenized input string. 
 
34 
 
 5. IMPLEMENTATION AND SAMPLE PROGRAMS 
 
 The error method recovery was implemented with an LE 
 parser for a languaqe whose BNF is listed in Appendix A. 
 The LR parser has 356 states. The implementation is written 
 in Pascal [4,13] and was run a a DEC-10 timesharing system. 
 The sample programs were all run with a constant cost of 10 
 for the insertion or deletion of any terminal in the 
 language. The cost limit for each correction attempt of a 
 possible parse was set at 31. 
 
 The remainder of this section consists of a discussion 
 of the results from ten sample programs. Sample program #1 
 has a missing statement terminator, which is solved by a 
 simple insertion. Sample program #2 also has a missing 
 statement terminator, but it is in such a context that the 
 prror routine cannot determine which of two possible 
 solutions is the most likely. In sample program #3, the 
 prror routine provides a single solution with the option of 
 ins^rtirg either of two symbols at a specific point in the 
 input string. Three reasonable solutions, including two not 
 so obvious ones, are provided for sample program #U. This 
 example demonstrates consecutive insertion and deletion 
 
35 
 
 corrections being combined into a single "change" 
 correction. In sample program #5, the error is detected 
 immediately and conseguently the possible parses are set up 
 for a symbol that should be deleted. The error routine must 
 discard all of the possible parses, delete the incorrect 
 symbol, and set up another set of possible parses before 
 findinq a reasonable solution- Sample program #6 contains 
 an "if" statement that is missing the "IF", and the error 
 routine correctly inserts it. This example demonstrates the 
 advantages of an unbounded look-ahead scheme. Sample 
 program #7 contains an expression with three unmatched left 
 parenthesis and requires the error routine tD call itself 
 recursively in order to provide a solution. In sample 
 program #8, the restriction of not considering the insertion 
 of multiple terminals as a "simple way" to break the barrier 
 prevents the error routine from finding an obvious solution, 
 but it continues and finds another equally acceptable 
 solution. In sample program #9, the same restriction 
 prevents the error routine from finding any reasonable 
 solutions. This example demonstrates the problems that 
 arise when the error routine is unable to provide a 
 solution. This is the only one of the ten sample programs 
 for which the error routine is unable to provide at least 
 one reasonable solu+ion. Finally, to end on a positive 
 
 
36 
 
 note, sample program #10 is presented, and it demonstrates 
 the error routine providing several good solutions. 
 
 Sample Program #1 
 
 A D C[20 1. 
 READ A B C[20T 
 WRITE A B; 
 WRITE C[ 20 ]; . 
 
 The configuration of the parser at the point of error 
 detection is as follows: 
 
 Symbols Implied By. States On Parsing Stack 
 
 IH£Ut Symbol 
 
 <declarat ion-list> . <statement-list> READ 
 <input-list> <identifier> f <expression> ] ?1 "WRITE" 
 
 in line #3 
 
 The error is detected at this point because the symbol 
 "WRITE" is not a legal right context to make the the 
 reduction <subscripted-variable> => <identifier> [ 
 <expression> ]. The posssible parses are set up for the 
 symbol "WRITE". The error recovery method supplies one 
 solution. 
 
37 
 
 The possible parses which yield solution #1 are in the 
 followinq configuration when they attempt to reduce over the 
 error point. 
 
 Symbols Im.Elied By States On Parsing Stack 
 
 <f1aclaration-list> . <statement-lis t> READ 
 <input-list> <identifier> [ <expression> ] 
 ?1 <statement> ; 
 
 IS.£Ht Symbol 
 
 "WRITE" 
 in line #U 
 
 Th» attempted reduction is: 
 
 <statement-list> => <statemen t-li st> <stateraer»t> ; 
 
 The Correction Phase immediately finds that the insertion of 
 a ";" will break the ?1 barrier and yields the solution: 
 
 INSERT ";" AFTER " ]" WHICH IS TOKEN #7 IN LINE #2 
 
38 
 
 Sample Program #2 
 
 A X Y. 
 READ A 
 X := Y; . 
 
 The conf iquration of the parser at the point of error 
 detection is as follows: 
 
 Syjnbgls Implied By_ States On Parsing Stack 
 
 IHfilJt Symbol 
 
 <declaration-list> . <statement-lis t> READ 
 <input-list> <identifier> ?1 ":" 
 
 v 
 
 The error is detected at this point because, with this stack 
 configuration, the symbol ":" is not a legal right context 
 to make the reduction <variable> => <identifier>. (Note 
 that this is an LR, not an SLR [2] parser.) The possible 
 parses are set up for the symbol ": ". The error recovery 
 method supplies two solutions. 
 
 The possible parse which yields solution #1 is in the 
 following configuration when it attempts to reduce over the 
 error point. 
 
 Symbols Implied By_ States On Parsing Stack 
 
 <declaration-list> . <statement-list> READ 
 <input-list> <identifier> ?1 : = 
 
 Input Sy_mbol 
 
 »t y " 
 
 in line #3 
 
39 
 
 The attempted reduction is: 
 <left-part> => <identifier> : = 
 
 The Correction Phase finds no "simple way" to break the 
 original ?1 barrier. It successfully backs the barrier over 
 the symbol "X", leaving the possible parse in the following 
 configuration: 
 
 Symbols Implied By_ States On Parsing Stack 
 
 <declaration-list> . <statement-list> READ 
 <input-list> ?1 <identifier> : = 
 
 IHEJit. Symbol 
 
 It then finds that the insertion of a " ;" will break 
 the barrier and yields the solution: 
 
 INSERT ";" AFTER "A" WHICH IS TOKEN #2 IN LINE #2 
 
 The possible parse which yields solution #2 is in the 
 following configuration when it attempts to reduce over the 
 error point. 
 
 Symbols Implied By States On Parsing Stack 
 
 IHEHi Symbol 
 
 <declarat ion-list > . <statement-lis t> READ 
 <input-list> <identifier> ?1 : = "Y" 
 
 in line #3 
 
&. 
 
 
 1-1 
 
 r 
 
 {3 
 3* 
 
 UO 
 
 The attempted reduction is: 
 <left-part> => <identifier> : = 
 
 The Correction Phase finds no "simple way" to break the 
 
 oriqinal ?1 barrier. Tt successfully backs the barrier over 
 
 the symhol "X". It then finds that the insertion of a "[ " 
 will break the barrier and yields the solution: 
 
 INSEPT "f " AFTEP "A" WHICH IS TOKEN #2 IN LINE #2 
 
 Both solutions have the same ccst attributed to them 
 and the error recovery method arbitrarily chooses solution 
 #1. For *his sample program, the error recovery method 
 provides one correct solution and one incorrect solution. 
 Sinrro both solutions have the same cost attributed to them, 
 it chooses the first one, which happens to bs the correct 
 solution. This is a casp where the error recovery method 
 almost q°ts fooled by a language construct being legal in 
 morp than on? context. According to the language 
 description f Appendix- A "|, an assignment statement can appear 
 as the subscript of an array. The second solution is based 
 on the assumption that the "X:=Y" is a subscript, and that 
 the "rpal" statement was intended to b«= "PFAD Arx:=Yl". If 
 ther° is a " ]" between the "Y" and the ";", then solution #2 
 is the most likely. If no* , then solution #1 is the most 
 likely. The problem is that the error recovery rou^in? must 
 
41 
 
 
 
 make a decision without knowinq what symbols ar<= to the 
 riqht of the "Y". This is a case where our method of 
 parsinq until an attempt is made to reduce over the error 
 point does not provide sufficient look-ahead to decide 
 between the possibilities. Luckily, it chooses the correct 
 solution. If it chose the incorrect one, it would insert 
 the "[ " after the W A", return to the reqular parser, 
 continue the parse until discoverinq that the correspondinq 
 "1" was missinq, and call the error routine aqain, at which 
 point the " ]" would be inserted. If the oriqinal proqrara 
 was 
 
 A X Y. 
 
 READ A X:=Y]; . 
 
 then the error routinp would insert a H ; " after the "A" in 
 line #2 and return to the reqular parser. The reqular 
 parser would encounter another error and call the error 
 routine with the "1" as the input symbol. The error routine 
 would set up a sot of possible parses for the symbol " ]", 
 and proceed to find the solution of "chanqinq" the ";" (that 
 it just inserted) to a "[". 
 
f: 
 
 S3 
 
 ".J3 
 
 \l 
 
 i 
 
 Sc 
 
 J" 
 
 S 
 
 B 
 
 42 
 
 S§JE£le Program *i 
 
 A[ 20] B X. 
 
 X := Ar B:= ]; 
 
 WRITE X; . 
 
 The configuration of the parser at the point of prror 
 detection is as follows: 
 
 Symbols Implied By. States On Parsing Stack 
 
 IfiEUt Symbol 
 
 <d^clarat ion-list> . <statement-lis t> 
 
 <left-part> <identifier> [ <identifier> : = ?1 "]" 
 
 in line #2 
 
 The error is detected at this point because the symbol " "j" 
 is not a leqal riqht context to make the reduction 
 <left-part> => <identifier> : =. Th<=> possible parses are 
 set up for the symbol " "]". Thp error recovery method 
 supplies one solution that contains an option of two 
 symbols. 
 
U3 
 
 The possible parses which yield solution #1 are in the 
 followinq configuration when they attempt to reduce over the 
 error point. 
 
 Symbo l s Implied By. States On Parsing Stack 
 
 <declarat ion-list> . <statement-list> 
 <left-part> <identifier> [ <identifier> : - 
 ?1 1 
 
 IHElit Symbol 
 
 * 
 
 in line #2 
 
 The attempted reduction is: 
 
 <subscripted-var> = > <identifier> [ <assignment> ] 
 
 The Correction Phase immediately finds that the insertion of 
 either an identifier or a string of digits will break the ?1 
 barrier and yields the solution: 
 
 INSERT "identifier" or "digits" AFTER "=" HHICH IS 
 TOKEN #8 IN LINE #2. 
 
 :'.-\--"r- 
 

 
 Si 
 
 U4 
 
 3 
 
 C3 
 
 i: 
 
 is 
 g 
 
 '0 
 
 «*■ 
 
 B 
 
 52 
 
 A B. 
 TOGO A; 
 READ B; . 
 
 The configuration of the parser at the point of »rror 
 detection is as follows: 
 
 Symbols Imfilied By Sta tes On Parsing Stack IHEUt Symbol 
 
 ?1 
 
 <de~larat ion-list> . <statement-lis t> 
 <identifier> 
 
 "A" 
 
 in line #2 
 
 The error is detected at this point because a statement 
 
 cannot start with two consecutive identifiers. The possible 
 
 parsps are set up for f he identifier "A". T hp error 
 recovery method supplies three solutions. 
 
 The possible parse which yields solution #1 is in the 
 following configuration when it attempts to reduce over the 
 
 error point. 
 
 Symbols Tmp_Iifid By States On Parsing Stack 
 
 <dolclaration-list> . <statement-li st> 
 <identifier> ?1 <id«=ntif ier> 
 
 InEiAi Symbol 
 
 ii ■ ii 
 
 « 
 
 in line #2 
 
 Th Q attempted reduction is: 
 <statemont> => GOT^ <identifier> 
 
45 
 
 The Correction Phase finds no "simple way" to break the ?1 
 barrier. It attempts to back the barrier over the symbol 
 "TOGO", but to do so must delete the "TOGO". It then finds 
 that the insertion of the keyword "GOTO" will break the 
 barrier. The consecutive insertion and deletion are 
 combined into a change and the solution emitted is: 
 
 CHANGE "TOGO" WHICH IS TOKEN #1 IN LINE #2 TO "GOTO" 
 
 The possible parse which yields solution #2 is in the 
 followinq configuration when it attempts to reduce over the 
 error point. 
 
 Symbols Implied By_ States On Parsing Stack 
 
 <de3laration-list> . <statement-lis t> 
 <identifier> ?1 <variable> 
 
 IHEHi Sy_mbol 
 
 in line #2 
 
 The attempted reduction is: 
 
 <input-list> => <input-list> <variable> 
 
 The Correction Phase finds no "simple way" to break the 
 
 oriqinal barrier. It successfully backs the barrier over 
 
 the symbol "TOGO". It then finds that the insertion of the 
 
 keyword "BEAD" will break th* barrier and yields the 
 solut ion: 
 
 INSERT "READ" AETER "." WHICH IS TOKEN #3 IN LINE #1 
 
U6 
 
 The possible parse which yields solution #3 is in the 
 followina configuration when it attempts to reduce over the 
 °rror point. 
 
 Sy_rabgls Implied Ry_ states On Parsing Stack 
 
 <declaration-list> . <statement-lis t> 
 <ident i f ier> ?1 <variable> 
 
 lDI>yt Syjnbol 
 in line #2 
 
 dfe 
 
 The attempted reduction is: 
 
 <output-list > => <out put-list> <variable> 
 
 The Correction Phase finds no "simple way" to break the ?1 
 barrier. It successfully backs the barrier over th 3 symbol 
 "TOGO". It then finds that th<=> insertion of the keyword 
 "WRITE" will break «-he barrier and yields the solution: 
 
 INSFPT "WRITE" AFTEP "." WHICH IS TOKEN #3 IV LINE #1 
 
 Solution *1 makes an insertion and a deletion, bat * hey 
 are combined into a sinale change at the cost of the maximum 
 of the costs of the two individual corrections. Since all 
 of the sample programs were run with constant insertion and 
 deletion costs for every terminal, all three possible 
 solutions have the same cost attributed to them. The error 
 recovery method arbitrarily chooses solution #1. Note that 
 
U7 
 
 the solutions provided have nothinq to do with the 
 similarity between the spelling of the symbols "TOGO" and 
 "GOTO". The same solutions will be provided if "TOGO" is 
 spelled "WXYZ" or "PEED". 
 
 SamEle Program #5 
 
 X. 
 
 X ;= 2; 
 
 WRITE X; . 
 
 The configuration of the parser at the point of error 
 detection is as follows: 
 
 Symbols Implied R£ Sta tes On Parsing Stack 
 
 <declaration-list> . <statement-lis t> 
 <identif ier> 
 
 IHEUt Symbol 
 
 ?1 ";" token #2 
 in line #2 
 
 The error is detected at this point because a statement 
 cannot start with an identifier followed by a ";". The 
 possible parses are set up for the symbol ";". However, all 
 of the possible parses encounter an error on the very next 
 symbol ("="). The ";" is deleted, the possible parses are 
 set up acrain, and the error routine is restarted. The error 
 recovery method supplies one solution. 
 
U8 
 
 The possible parse which yields solution #1 is in the 
 following configuration when it attempts to reduce over the 
 error point. 
 
 Symbols Implied By States On Parsing Stack 
 
 <declarat iop-list> . <statement-list> 
 <identifier> ?1 = 
 
 IHEUt Symbol 
 
 & - 
 
 
 
 
 
 i 
 
 fa 
 
 Th^ att^irpted reduction is: 
 <left-part> => <identif ier> : = 
 
 The Correction Phase immediately finds that the insertion of 
 an ":" will break the ?1 barrier and yields the solution: 
 
 DELETE ";" WHICH IS TOKEN #2 IN LINE *2 
 
 INSFRT ":" AFTEP "X" WHICH IS TOKEN #1 IN LINE #2 
 
 Ir this sample program, the consecutive insertion and 
 deletion are net combined into a single change. This is 
 because the "DELETE ;'» message originates from the 
 discarding of the first set of possible parses and 
 conseguently its cost is not represented in the total cost 
 of the Dossible parse tor which the final correction is 
 found . 
 
 astfRS 
 
U9 
 
 Sa»£le Program 16 
 
 x y z. 
 
 X=Y THEN Z: = ELSE Z:=Z*1; 
 WRITE Z; . 
 
 The configuration of the parser at the point of error 
 detection is as follows: 
 
 Symbols Implied By States On Parsing Stack 
 
 <declaration-list> . <statement-lis t> 
 <identif ier> 
 
 InEUt Syjnbol 
 
 ?1 "=" token #2 
 in line #2 
 
 The error is detected at this point because a statement 
 
 cannot start with an identifier followed by an "=". The 
 
 possible parses are set up for the symbol »• = ". The error 
 recovery method supplies one solution. 
 
 The possible parse which yields solution #1 is in the 
 following configuration when it attempts to reduce over the 
 error point. 
 
 Symbols Implied By. States On Parsing Stack 
 
 <declarat ion-list> . <statement-lis t> 
 <identifier> ?1 <rela tional-op> <expression> 
 
 IHEUt Symbol 
 
 "THEN" 
 
 The attempted reduction is: 
 
 <boolean-expr> => <expression> <re lational-op> <expression> 
 
 
50 
 
 The Correction Phase finds no "simple way" to break the ?1 
 barrier. It successfully backs the barrier over the symbol 
 "X". it then finds that the insertion of the keyword "IF" 
 will break the barrier and yields the solution: 
 
 INSERT "IF" AFTER ". " WHICH IS TOKEN #4 IN LINE #1 
 
 p. 
 
 
 
 '11 
 
 ! 
 
 63 
 
 as 
 
 i 
 
 IS 
 
 «o 
 
 This sample proqram is an example of why unbounded 
 look-ahead is so important and why this method continues 
 parallel Darsinq until every possible parse has either been 
 discarded or attempted to reduce over the error point, 
 reqardless of how many possible solutions have been found. 
 Tn f his case, + here is another possible oarse which thinks 
 it is parsinq an assiqnment. statement. it attempts to 
 reduce over the error point (usinq the production 
 <left-part> => <ider. ti fier> : =) immediately after shifting 
 the •«=«, ar.d finds the first solution of insertinq a ":" 
 after the "X" in line #2. Reanwhile the possible parse 
 which eventually provides the correct solution thinks it is 
 parsina an "if" statement, but. has not yet attempted to 
 reduce over the error point. If the statement was intended 
 +c be an assiqnment statement, then the first solution is 
 correct. If the statement was intended to be an "if" 
 statement, as is apparently the case here, then the first 
 solution is wrorq. There is no way of knowinq which is the 
 
51 
 
 case, until the symbol immediately after the expression to 
 +he right of the "=" is known. In this case it is a "THEN" 
 and the first solution is wrong. Since another possible 
 parse is still parsing, the possible parse of the first 
 solution must continue also. Dpon seeing the "THEN", the 
 possible parse of the first solution encounters an error and 
 is discarded. At this point, the single remaining possible 
 parse attempts to reduce over the error and provides the 
 final solution. 
 
 The importance of unbounded look-ahead is demonstrated 
 by the fact that the expression to the right of the "=" (in 
 this case a single "Y") can be of any length. If the 
 look-ahead is bounded, and the length of the expression is 
 qreater than the bound, then the symbol following the 
 expression will not be known and no error recovery method 
 can determine which statement was most likely intended by 
 the programmer. 
 
52 
 
 Saragle ££22£§.l 12 
 
 X Y Z. 
 
 X := (( Y + ( Z*5; 
 
 WFITE X; . 
 
 The configuration of the parser at the point of error 
 detection is as follows: 
 
 Symbols Implied By_ States On Parsing Stack 
 
 IU£lit Symbol 
 
 j j 
 J 
 
 '3 
 
 f 
 
 .1 
 
 J 
 
 j3 
 
 C3 
 
 r= 
 
 is 
 
 is 
 
 '0 
 
 22 
 52 
 
 <declarat ion-list > . <statement-lis t> 
 
 <left-part> ( { <expression> + ( <term> * 
 
 <<Ugits> ?1 ";" 
 
 in line #2 
 
 The error is detected at this point because, with v hi3 stack 
 configuration, the symbol ";" is not a legal right context 
 +o make the reduction <primary> => <digits>. The possible 
 parses are set up for the symbol ";•'. T he error recovery 
 method supplies one solution. 
 
 The possible parse which yields solution *1 is in the 
 followinq configuration when it attempts to reduce over the 
 error noint. 
 
 Symbols Imp_l.i.e_i By_ States On Parsing Stack 
 
 <3eclarat ion-list> . <statement-lis t> 
 <lf»ft-part> ( ( <expression> ♦ ( <term> * 
 <digits> ?1 ; 
 
 IlLDlit Symbol 
 
 ••WRITE" 
 
53 
 
 The attempted reduction is: 
 
 <statement-list> => <statement-li st> <stateraent> ; 
 
 The Correction Phase finds that the insertion of a ") " will 
 break the barrier. In this case though, there are three 
 unmatched left parenthesis and insertinq a single right 
 parenthesis does not provide a correct solution. This is a 
 case where the Correction Phase is temporarily fooled. The 
 Correction Phase inserts the ") " and attempts to reparse 
 back up to the symbol which was the input symbol when the 
 attempt was made to reduce over the error point. However, 
 the possible parse encounters an error before reparsing to 
 that symbol (the ";" in line #2). The possible parse is 
 temporarily discarded. However, no other possible parse 
 either "shifts" on the ";", or attempts to reduce over the 
 error point and provides a solution which allows reparsing 
 any further than the discarded one did. Therefore, the 
 possible parses discarded in this way (two in this example) 
 are r^-instated and the error routine is called recursively 
 from the point at which they encountered the error during 
 the reparse attempt. 
 
 The process then repeats itself, with the Correction 
 Phase breaking the barrier by inserting a ")", but then 
 encountering an error on the reparse. Aaain the error 
 routine is called recursively, and again the Correction 
 
54 
 
 Phase breaks the barrier by inserting a ")". This tine the 
 reparse finally succeeds and the error routine yields the 
 solat ion: 
 
 INSERT ") " AFTER "5" WHICH IS TOKEN #11 IN LINE #2 
 INSERT ") •» AFTER "5" WHICH IS TOKEN #11 IN LINE #2 
 INSEPT ") '• AFTER "5" WHICH IS TOKEN #11 IN LINE #2 
 
 This sample proqram demonstrates that the error 
 recovery method can handle any number of unmatched left 
 parenthesis, as lonq as a restriction is not placed on the 
 number of levels of recursion allowed. 
 
 V 
 
 This error recovery method can also handle any number 
 of unmatched riqht parenthesis. This is much less 
 complicated, since it does not. require any recursion. Each 
 unmatched riqht parenthesis is encountered one at a time, 
 and as each one is encountered, the corresponding left 
 parenthesis is inserted. 
 
55 
 
 S§.IEl§ Program #8 
 
 A X. 
 BEGIN 
 READ A; 
 X := 2*A; 
 
 WRITE X; . 
 
 The configuration of the parser at the point of error 
 detection is as follows: 
 
 Symbols. Implied Ql states On Parsing Stack 
 
 <daclara tion-list> . <statement-lis t> 
 BEGIN <statement-list> <statement> ; 
 
 IHEJi* Symbol 
 
 ?1 
 
 The error is detected at this point because, with this stack 
 configuration, the symbol "." is not a legal right context 
 to make the reduction <statement-li st> => <statement-list> 
 <statement> ;. The possible parses are set up for the 
 symbol ".". The error recovery method supplies one 
 solution. 
 
 The possible parse which yields solution #1 is in the 
 following configuration when it attempts to reduce over the 
 error point. 
 
 Syjnbols Implied Bv. States On Parsing Stack 
 
 <declarat ion-list> . <st atement-lis t> 
 BEGIN <statement-list> <statement> ; ?1 . 
 
 IHEUt Syjnbol 
 
 end-of-f ile 
 
56 
 
 The attempted reduction is: 
 
 <program> => <declaration-list> . <statement-list> . 
 
 The Correction Phase finds no "simple way" to break the ?1 
 barrier. It then successfully backs the barrier over the 
 ";" in line #5. It aqain finds no "simple way" to break the 
 barrier. This time it finds that the "true" leftstate and 
 one of the "true" rightstates are both enterable on the 
 nonterminal <statement>. The barrier is successfully backed 
 over the nonterminal <statement> (which originally consisted 
 of "WRITE X"). There is still no "simple way" to break the 
 barrier, and the barrier is again backed up. This time the 
 "true" leftstate and one of the "true" rightstates are both 
 »nterabl<=> on the nonterminal <state ment-list>. The barrier 
 is successfully backed over the nonterminal <statement- list> 
 {which originally consisted of "READ A; X := 2*A;"). Once 
 again there is no "simple way" to break the barrier. It 
 then attempts to back the barrier over the symbol "BEGIN", 
 but to do so must delete the "BEGIN". At this point, the 
 Correction Phase realizes that deleting the "BEGIN" has 
 broken the barrier, and it yields the solution: 
 
 DELETE "BEGIN" WHICH IS TOKEN #1 IN LINE #2 
 
57 
 
 The reason the Correction Phase does not insert an 
 "END" instead of deleting the "BEGIN" is because, according 
 to the syntax of the language [Appendix A], it needs to 
 insert "FIND ;". Since the method was implemented so that it 
 considers a "simple way" to break the barrier as being the 
 insertion of a single terminal, it cannot insert "END ;" and 
 must continue backing up until it deletes the "BEGIN". 
 
 Sa5£l§ EE9.9.0I5 19. 
 
 X. 
 
 X < 2*3; • 
 
 The configuration of the parser at the point of error 
 detection is as follows: 
 
 Symbols l!Eli§! By_ States On Parsing Stack l5.£Ut Sy_mbol 
 
 <declaration-list> . <statement-lis t> 
 
 <identifier> ?1 "<" 
 
 The error is detected at this point because a statement 
 cannot start with an identifier followed by the symbol "<". 
 The possible parses are set up for the symbol "<". The 
 error recovery method does not find any solutions. 
 
58 
 
 f 
 
 s 
 
 V 
 
 r 
 
 2 
 
 63 
 
 is 
 
 IS 
 •o 
 
 2 
 
 52 
 
 There is originally only one possible parse, since 
 there is only one state from which it is legal to "shift" on 
 the input symbol "<". That possible parse believes it is 
 parsina the Boolean expression in an "if" statement, and 
 encounters an error with the ";" being an illegal right 
 context to make the reduction <primary> => <digits>. At 
 this point the possible parse is in the following 
 configuration. 
 
 Symbols Implied By States On Parsing Stack 
 
 <declarat ion-list> . <statement-lis t> 
 <identifier> ?1 <rela tional-op> <term> * 
 <digits> 
 
 lB£!it Symbol 
 
 ?2 
 
 n . » 
 
 At this point, the error routine calls itself recursively 
 and a new set of possible parses is set up for the symbol 
 ";". All of thp possible parses attempt to reduce over the 
 error point immediately after "shifting" the ";". None can 
 successfully break the barrier, and the ";" is deleted. The 
 error recovery routine tri°s again, this time setting up the 
 possible parses for the input symbol ".". All of the 
 possible parses attempt to reduce over the error point 
 immediately after shifting the ".". Again, nona can 
 successfully break the barrier, and the ". " is deleted. At 
 this point, the error recovery routine realizes that it 
 would be pretty silly to set up possible parses for the 
 end-of-file condition, so it admits defeat and indicates 
 
59 
 
 that it has found zero solutions. 
 
 If the method is implemented to consider a "simple way" 
 to break the barrier as beinq the insertion of two terminals 
 instead of only one, then a solution will be found. While 
 the possible parses are set up for the ";", the Correction 
 Phase will discover the solution of deleting the "<" and 
 inserting a ":" followed by an "=»„ 
 
 The reason this sample program is presented here is to 
 demonstrate what happens when the error recovery method 
 cannot find a solution. It should be expected that the 
 error recovery method will not always be able to provide a 
 solution. If the method allows the insertion of N 
 terminals, there could always be a case reguiring the 
 insertion of N+1 terminals. 
 
 The problem with the method, as it is described in 
 Section 4, is that it does not know when to "give up". In 
 this sample program, there are no mora statements following 
 the erroneous one. However, if there are additional 
 statements, the error routine will parse through them all, 
 still attempting to correct the first error. Suppose the 
 erroneous statement is followed by several completely 
 correct statements, each correctly terminated by a ";". For 
 each one, the error routine sets up possible parses for the 
 first symbol of the statement, parses the statement until it 
 
60 
 
 reduces into <statement>, "shifts" the terminating "; N , and 
 attempts the reduction <statement- list> => <statenent- list> 
 <statement> ;. For all of the possible parses, this 
 reduction means reducing over the first error point. The 
 Correction Phase fails as it did previously and conseguently 
 the innut symbol which was assumed correct in setting up the 
 possible parses is deleted. (The symbol delated is the 
 firs* symbol of the correct statement.) The possible parses 
 are then set up for the next input symbol (the second symbol 
 of the correct statement) and the process continues. 
 
 I 
 
 In summary, each correct statement is parsed correctly, 
 but is unable to reduce into <stateraent-list> because of the 
 previous erroneous statement which th^ Correction Phase 
 could not find a solution for. Each correct statement is 
 then deleted symbol by symbol. Occasionally, the remaining 
 portion of a statement will resemble another language 
 construe* and the error routine will call itself recursively 
 before resuming its pattern of delating symbols. 
 
 Clearly, these spurious error messages throughout the 
 remainder of the program are not acceptable. What is needed 
 is a way for the error routine to know when a symbol cannot 
 be found, so that it can enter "panic mode" and "clean up" 
 the stack before continuing. This error recovery method 
 should be implemented with a "panic mode" which keys on one 
 
61 
 
 or more special symbols of the lanquage. For this language 
 [Appendix A], a sinqle special symbol of " ; " should be 
 sufficient. The "panic mode" should be entered whenever the 
 error routine deletes a ";" as a result of discarding all 
 possible parses that were originally set up for that ";". 
 This is preferable to entering "panic mode" each time a ";" 
 is deleted. 
 
 Sa§Eie Program jl^Q 
 
 A B X Y. 
 X := Y: 
 
 A := B; 
 WRITS X A; . 
 
 The configuration of the parser at the point of error 
 detection is as follows: 
 
 Symbols Implied By States On Parsing Stack 
 
 <declarat ion-list> . <statement-list> 
 <left-part> <identifier> : 
 
 lUfiMt Symbol 
 
 ?1 "A" 
 
 in line #3 
 
 The error is defected at this point because "X:=Y:" cannot 
 be followed by any symbol but an "=". The possible parses 
 are set up for the identifier "A". The error recovery 
 method yields three solutions. 
 
62 
 
 The possible parse which yields solution #1 is in the 
 following configuration when it attempts to reduce over the 
 error point. 
 
 Syjnbols Implied bj States On Parsincj Stack 
 
 iBBiJt Syjnbol 
 
 <declarat ior -list> . <stateraen t-list> 
 
 <lpft-part> <identifier> : ?1 <stateioent> ";" 
 
 in line #3 
 
 The attempted reduction is: 
 
 <statement> => <identifier> : <statement> 
 
 The Correction Phase finds no "simple way" to break the ?1 
 barrier. It then successfully backs the barrier over tha 
 ":" (token #5 in line #2). It again finds no "simple way" 
 to break the barrier and it successfully backs the barrier 
 over the "Y" (token #U in line #2). There is still no 
 simple way to break the barrier. It attempts to back the 
 barrier over the "=" (token #3 in line #2) , but to do so 
 must delete that "=". At this point, the Correction Phase 
 realizes that deleting the "=" has broken the barrier, and 
 it yields the solution: 
 
 DELETE " = " WHICH IS TOKEN #3 IN LINE #2 
 
 m 
 
63 
 
 The possible parsn which yields solution #2 is in the 
 following configuration when it attempts to reduce over the 
 error point. 
 
 Symbols Implied By States On Parsing Stack 
 
 <deelaration-list> . <statement-list> 
 <left-part> <identifier> : ?1 <assignment> 
 
 lQ.£Ht Symbol 
 
 ii . « 
 
 in line #3 
 
 The attempted reduction is: 
 
 <assignment> => <left-part> <assiqnment> 
 
 The Correction Phase immediately finds that the insertion of 
 an "=" will break the ?1 barrier and yields the solution: 
 
 INSERT "=" AFTER »• : " WHICH IS TOKEN #5 IN LINE #2 
 
 The possible parse which yields solution #3 is in the 
 following configuration when it attempts to reduce over the 
 error point. 
 
 Symbols Implied By States On Parsing Stack 
 
 <declarat ion-list > . <statement-lis t> 
 <left-part> <identifier> : ?1 <statement> ; 
 
 IHEUl Symbol 
 
 "WRITE" 
 
64 
 
 The attempted reduction is: 
 
 <statement-list> => <statement-list > <statement> ; 
 
 * 
 
 < 
 
 ■41 
 
 11 
 
 r 
 
 'Si 
 
 M 
 
 k 
 \\ 
 
 '0 
 
 ;'!5 
 
 SI 
 
 The Correction Phase finds no "simple way" to break the ?1 
 barrier. It attempts to back the barrier over the symbol 
 ":" (token #5 in line #2), but to do so must delete that 
 ":". It then finds that the insertion of a ";" will break 
 the barrier. The consecutive insertion and deletion are 
 combined into a chanqe and the solution emitted is: 
 
 CHANGE ":" WHICH IS TOKEN #5 IN LINE #2 TO ";" 
 
 All three possible solutions have the sane cost 
 attributed to them and the error recovery routine 
 arbitrarily chooses solution #1. 
 
 ■■•>:•■ 
 
65 
 
 6. EVALUATION OF THE METHOD 
 
 6. 1 Effectiveness 
 
 This recovery method meets all of the criteria 
 necessary for an effective error recovery method. Upon 
 correcting an error, it resumes parsing smoothly and does 
 no* skip over any of the input string. If a second error is 
 encountered while attempting to correct the original error, 
 the error routine calls itself recursively and independently 
 corrects the second error before resuming the correction 
 attempt on the first error. This allows the error recovery 
 method to handle several errors in close proximity. The 
 recovery method has the ability to make corrections at any 
 point in the input string, even if the corrections reguire 
 modifying the parsing stack. It uses an unbounded 
 look-ahead scheme. The advantages of an unbounded 
 look-ahead scheme are demonstrated in Sample Program #6. 
 The recovery method supplies extremely helpful error 
 messages. The messages explicitly state how the program 
 should be changed in order to make it syntactically correct. 
 
66 
 
 B3 
 
 a 
 J* 
 
 i\ 
 
 
 •S3 
 
 ■ C3 
 
 ■is 
 
 15 
 
 A disadvantaqe is that the method, as implemented , only 
 considers single insertions. There are cases where this is 
 insufficient. Of course, the method could be extended to 
 consider multiple insertions, but this would be less 
 effecient time-wise and it is not clear that It would always 
 be more effective overall. For instance, the Correction 
 Phase miqht provide a solution by insertinq two terminals 
 and then stop short of findinq a less costly solution at an 
 earlier point in the input strinq. Beqardless of how many 
 insertions are allowed in a correction attempt, there will 
 be times when the error routine will fail to find a 
 solution. When this occurs, the error routine must enter 
 "panic mode" and "clean up" the stack in order to resume the 
 parse smoothly. This is explained in more detail in the 
 discussion followinq Sample Proqram #9 in Section 5. 
 
 This °rror recovery method can produce some unexpected 
 results. For instance, it miqht provide a sinqle solution 
 when th^re are several equally obvious solutions that it did 
 not find. At other times, it will provide all reasonable 
 solutions, includinq some that are not so obvious. This is 
 because the Correction Phase returns the first acceptable 
 solution that it finds for each possible parse. Multiple 
 solutions only occur when more than one possible parse 
 yields an acceptable solution. Overall, this error recovery 
 
67 
 
 method is very effective. It usually provides at least one, 
 and often provides several reasonable solutions. 
 
 6.2 Time Requirements 
 
 The following are rough estimates of the actual DEC-10 
 CPU time required to execute some of the sample programs. 
 They must be considered as rough estimates only, since the 
 CPU time recorded varied up to 10% for identical sample 
 programs running in the same amount of core. 
 
 The LR parser itself has an overhead of approximately 
 4.75 seconds, which is due to it reading in the parsing 
 tables. The additional time reguired for the LR parser to 
 parse correct versions of the sample programs is negligible 
 relative to the parser overhead. For the first error 
 encountered, the error recovery routine has an overhead of 
 approximately 5.5 seconds. This is due to it reading in the 
 special tables that the error routine needs. 
 
 The time to correct, each error varies depending on the 
 amount of core in which the program is executing. The 
 sample programs were always executed in 80k of virtual 
 memory. The DEC-10 CPU time was recorded both while 
 
68 
 
 executinq the programs in 30K of real memory and in 60K of 
 real memory. 
 
 The number of possible parses that are set up is a biq 
 factor in the time required to correct an error. Even 
 thouqh most possible parses are discarded fairly quickly, 
 iust settinq them up takes a significant amount of time. 
 Another biq factor is the number of possible parses which 
 are discarded before atteraptinq to reduce over the error 
 noint and consequently never enter the Correction Phase. 
 While executinq in 10K of real memory, the missinq ";" in 
 sample proqram #1 takes 2.8 seconds to correct. (8 possible 
 parses are set up for the input symbol "WRITE".) Whf»n the 
 ";" is missinq from the last statement instead of the first 
 statement, it only takes 2.3 seconds to correct. (9 
 possible parses are set up for the input symbol ".".) Thouqh 
 the number of possible parses that are set up is 
 approximately the same, the latter correction is quicker, 
 since only one possible parse (as opposed to three for the 
 former correction) attempts to reduce over the error point. 
 In sample proqram #U, the error is detected with an 
 identifier as the input symbol and consequently 32 possible 
 parses are set up. The error routine requires .1.9 seconds 
 to provide the solutions for sample program *U. Recursion 
 is also a biq factor in the time required, since a new set 
 of possible parses is set up for each level of recursion and 
 
69 
 
 the number of possible parses is usually increasing with 
 each successive level of recursion. In sample program #7, 
 the three missing right parenthesis take a total of 26 
 seconds to correct. 
 
 While executing in 60K of real memory, the results are 
 significantly better. Sample program #1 takes 1.2 seconds 
 to correct the missing ";". When the ";" is missing from 
 the last statement instead of the first statement, it only 
 takns .9 seconds to correct. Sample program #4, with the 82 
 original possible parses, takes 2.9 seconds to correct., and 
 the three hissing right parenthesis in sample program #7 
 take a total of 17 seconds to correct. 
 
 In summary, while executing in 30K of real memory, the 
 error routine takes anywhere from 2 seconds to 10 seconds to 
 correct a single error. (The three missing right 
 parenthesis are considered three errors.) While executing in 
 60K of real memory, the error routine takes anywhere from 
 less than 1 second up to 6 seconds to correct a single 
 error. Remember that these fiaures should only be 
 considered as rough estimates, and that they are based on 
 DEC-10 CPU time. A significantly faster machine would 
 reguire significantly less time. 
 
70 
 
 6.3 Space Requirements 
 
 a 
 V, 
 
 S 
 
 J! 
 w 
 
 11 
 
 ■ i 
 
 P 
 re 
 jj» 
 
 12 
 •o 
 
 12 
 
 <2 
 !2 
 
 An LR parser im pie wen ted with this error recovery 
 method requires considerably more space than the same LR 
 parser without any error recovery method. There are four 
 main contributors to the additional space requirements, and 
 they will be discussed in the next several paragraphs. The 
 "bier four" are the error routine code, the tokenized input 
 string, the special tables, and the error routines main data 
 structures. There are many other data structures required 
 by the error routine, but the sum of their space 
 requirements is neqligible when compared to the requirements 
 of the "biq four". 
 
 The space required by the error routine code is 
 approximately 10K. 
 
 The input strinq which is saved in tokenized form can 
 
 also require a larqe amount of space. That space is simply 
 
 the larqest program (in terms of number of tokens) which can 
 be run on the system. 
 
 The Leqal State Table in our implementation is an array 
 of lenqth 518 and is indexed by a n array of length 27 (one 
 for each terminal). The Predecessor States Table is an 
 array of length 984 indexed by an array of lenqth 156 (one 
 
71 
 
 for each state) . The Predecessor States Table also includes 
 another array of lenqth 356 to hold the symbol each state is 
 enterable on. The TEH States Table is an array of length 
 17ft<l indexed by an array of lenqth 356 (one for each state). 
 The total space required for these three tables sums up to 
 approximately 4.4K. 
 
 Storage space is needed for each tpntative correction 
 of every possible parse, as well as for pointers into the 
 tokenized input string indicating where each correction 
 should take place. These require two 2-dimensional arrays, 
 each of length {maximum number of corrections per possible 
 parse) * (maximum number of possible parses). Even more space 
 is reguired by the possible parses themselves and their 
 accompanying pointers into the tokenized input string. 
 These reguire two 2-di mensional arrays, each of length 
 (parsing stack limit)* (maximum number of possible parses). 
 
 Just to get a rough estimate of the space reguired, 
 assume that the maximum number of corrections per possible 
 parse is restricted to 5, and that the parsing stack is 
 restricted to a maximum depth of 50. This means that the 
 space reguired by the error routines data structures will be 
 approximately (2*5 ♦ 2*50) * (maxi mum number of possible 
 parse), which is egual to 110* (maxi mum number of possible 
 parses) . The problem is that the number of possible parses 
 
W^M 
 
 72 
 
 needed can be quite large, especially if several levels of 
 recursion are allowed. The absolute minimum number of 
 possible parses, even if no recursion is allowed, is still 
 equal to the largest number of states from which it is legal 
 to "shift" for any given input symbol. For the language 
 implemented [Appendix A], there are 82 states from which it 
 is legal to "shift" when the input symbol is an identifier. 
 Thus, for our implementation, the absolute minimum space 
 reguired is 110*82 words of memory, which is approximately 
 10K. 
 
 3» 
 
 This absolute minimum increases fairly rapidly as the 
 number of levels of recursion that are allowed is increased. 
 The problem is that usually more than one possible parse is 
 re-instated immediately prior to a recursive call of the 
 error routine. For example, in Sample Program #7 (three 
 unmatched left parenthesis) the error routine recursively 
 calls itself twice. Both recursive calls, plus the original 
 call from the regular parser, occur on the input symbol ";". 
 There are only two states from which it is legal to "shift" 
 on the input svmbol ";". However, both possible parses are 
 re-instated immediately prior to the first recursive call. 
 The pattern repeats itself and the first and second levels 
 of recursion have four and eight possible parses 
 respectively. This increase in the number of possible 
 parses needed for each successive level of recursion is even 
 
73 
 
 more troublesome if the input symbol is an identifier 
 instead of a ";". Also remember that in this example the 
 number of possible parses which are needed for each 
 successive level of recursion only doubles. It is possible 
 for the number to be multiplied by several times itself for 
 each successive level of recursion. 
 
 Remember that all of these figures are based on our 
 implementation and will vary with the language that is 
 implemented. 
 
 6.4 Ease of Implementation 
 
 There is no problem in constructing the special tables 
 that are needed by the error routine. The tables are 
 automatically generated from the parsing tables. 
 
 The LR parser itself must be modified somewhat. The 
 lexical analysis phase must be modified to save each token 
 that it processes, as well as to check if the next token it 
 is supposed to process has already been consumed. The 
 parser must be modified to store a corresponding pointer 
 into the tokenized input string for each state that it 
 pushes onto the parsing stack. These modifications and the 
 
74 
 
 rationale behind them are described in more detail in 
 Section U.6. 
 
 Mi. 
 
 11 
 
 % 
 
 M 
 
 ■iz 
 
 •o 
 
 <2 
 J2 
 
 The Cost Vectors must be filled in and a cost limit 
 chosen. This can be done in a trivial amount of time, and 
 still provide good results. Our sample programs produced 
 good results and were run with a constant cost for all 
 insertions and deletions. More sophisticated costs should 
 not take too much longer to determine. Por example, 
 insertions and deletions of single character tokens could be 
 given a relatively lower cost, while assigning a relatively 
 high cost to the deletion of keywords of the language. 
 
75 
 
 7. CONCLUSIONS 
 
 The implementation of this error recovery method in an 
 actual compiler appears to be feasible. Host major 
 programming languages are larger than the language [Appendix 
 A T on which the method was tested. However, the language we 
 implemented is large enough and in particular contains 
 enough similar constructs to risk some nondeterminism 
 problems. Thus, we do not believe that the performance 
 would be greatly affected by implementing a larger language. 
 
 The time reguired to correct each error is greater than 
 is desir^able. However, the additional time required can 
 certainly be justified in an environment where clear 
 accurate error messages are essential (i.e., any environment 
 where the programmer frequently has a minimal knowledge of 
 the language) . The experienced programmer would also 
 benefit in that roost errors are corrected, and the 
 programmer does not have to search through many spurious 
 error messages. 
 
 For some languages, the average time to correct an 
 error might be significantly improved by always skipping the 
 input symbol on which an error is detected, and then setting 
 
76 
 
 up the possible parses for the immediately following symbol. 
 For the case where the error is detected immediately, with a 
 symbol that should be deleted being the current input 
 symbol, this would save the time involved in setting up and 
 then discarding the first set of possible parses. 
 
 A slight improvement in the average time to correct an 
 error could be achieved by "fine tuning" the cost vectors 
 and the cost limit. This would allow erroneous possible 
 parses to be discarded more guickly. 
 
 The biggest problem of this method is its space 
 reguirements. One possibility is to obtain additional space 
 by not evpn attempting to execute a program with one or more 
 syntax errors. If this was the policy, then the space 
 normally taken up by semantic routines could be used for the 
 error routine. 
 
 However, even if the semantic routine space were to be 
 available to the error routine, some sort of restriction 
 would still be necessary on the levels of recursion allowed. 
 The memory cannot possibly be big enough to handle every 
 hypothetical case. However, those cases which result in a 
 multitude of possible parses, do occur relatively 
 inf reguently. Allowing many levels of recursion is not 
 really that advantageous. Festricting the algorithm to N 
 levels of recursion would merely mean that only N and not 
 
77 
 
 N*1 additional errors in the close proximity to the first 
 error could be corrected, (Errors in close proximity are 
 errors which are encountered before the first error is 
 corrected.) Still, allowing a few levels of recursion is 
 desirable in order to be able to correct a few errors in 
 close proximity. The solution to this problem appears to be 
 conditional recursion. An absolute limit on the number of 
 possible parses allowed could be set. This limit would be 
 dependent on the amount of memory available to the error 
 routine. Recursion could be allowed as long as the number 
 of possible parses was under the absolute limit. This would 
 be preferable to placing a strict limit on the number of 
 levels of recursion allowed. 
 
 Being unable to execute an incorrect program is not. a 
 significant drawback to the error recovery method. 
 Executing incorrect programs might even tend to encourage 
 careless programming. In any event, a programmer can hardly 
 complain if an error recovery method detects and provides a 
 helpful error message for every error in a program, but then 
 fails to execute it. 
 
 This error recovery method does an excellent job of 
 providing a reasonable correction for most syntax errors in 
 a program. The error messages provided are very helpful and 
 even the most inexperienced programmer should not have any 
 
78 
 
 trouble interpreting them. One might even want to implement 
 an LB parser with this error recovery method strictly as a 
 syntax checker. 
 
79 
 
 LIST OF REFERENCES 
 
 [11 Aho, A. V. and Johnson, S. C, "LR Parsing", 
 Computing Surveys, Vol. 6, June 1974, pp. 99-124. 
 
 [21 DeRemer, Franklin L. , "Simple 
 Communications of the ACM, Vol. 
 453-460. 
 
 LR(Jc) Grammars", 
 14, July 1971, pp. 
 
 [31 Graham, Susan L. and Rhodes, Steven P., "Practical 
 Syntactic Error Recovery in Compilers", Conference 
 Record of the ACM Symposium on the Principles of 
 Programming Languages, Boston, Mass., October 1973, 
 pp. S2-58. 
 
 [41 Hoare, C. A. R. and Wirth, N., "An Axiomatic 
 Definition of the Programming Language PASCAL", 
 Acta Informatica, Vol. 2, 1973, pp. 335-355. 
 
 [5] Irons, E. T. , "An Error-Correcting 
 Communications of the ACM, Vol. 
 pp. 669-673. 
 
 Parse Algorithm", 
 6, November 1963, 
 
 [6] James, Lewis R. , "A Syntax Directed Error Recovery 
 Method", Waster's Thesis, Technical Report CSRG-13, 
 Computer Systems Research Group, University of 
 Toronto, Toronto, Canada, May 1972. 
 
 [7 1 Knuth, Donald E. , "On the Translation of Languages from 
 Left to Fight", Information and Control, Vol. 8, 
 1965, pp. 607-639. 
 
 [81 LaFrance, Jacgues E. , "Syntax-Directed Error Recovery 
 For Compilers", Ph.D. Thesis, ILLIAC IV Document 
 No. 249, Department of Computer Science, 
 University of Illinois, Urbana, Illinois, June 
 1971. 
 
 [9 1 Leinius, Ponald, "Error Detection and Recovery for 
 Syntax Directed Compiler Systems", Ph.D. Thesis, 
 Computer Science Department, University of 
 Wisconsin, Madison, Wisconsin, 1970. 
 
80 
 
 f 10 "J Levy, J. P., "Automatic Correction of Syntax Errors in 
 Programming Languages", Ph.D. Thesis, Technical 
 Report TR71-116, Computer Science Department, 
 Cornell University, Ithaca, New York, 1971. 
 
 M11 Partridge, Derek, "Heuristic Methods in the Analysis of 
 Program Statements", Ph. D Thesis, Department of 
 Computing and Control, University of London, 
 London, England, August 1972. 
 
 [121 Rhodes, Steven Paul, "Practical Syntactic Error 
 Recovery For Programming Languages", Ph.D. Thesis, 
 Technical Report No. 15, Computer Science 
 Department, University of California, Berkeley, 
 California, June 1973. 
 
 r 1 3 1 Wirth, N., "The Programming Language Pascal", Acta 
 Tnformatica, Vol. 1, 1971, pp. 35-63. 
 
 f14T Mirth, Niklaus and Weber, Helmut, "EULER, A 
 Generalization of ALGOL and its Formal Definition", 
 Comminucations of the ACM, Vol. 9, January and 
 February 1956, pp. 13-23 and pp. 89-99. 
 
 ■■•<:■'/'• 
 
APPENDIX A 
 
 81 
 
 <proqram> 
 
 => <declaration-list> . 
 <sta tement-list> . 
 
 <doclaration-list> => 
 
 <st atetnent> 
 
 <input-list> 
 
 <output-list> 
 
 <boolean-expr> 
 
 => <declarat ion-list > 
 <identif ier> 
 
 => <del caration-list > 
 
 <identifier> f <iigits> ] 
 
 <statement;-list.> => 
 
 => <sta tement-list> 
 <stateraent> ; 
 
 => GOTO <identifier> 
 
 => READ <input-list> 
 
 => WRITE <output-list> 
 
 = > IF < boolean-expr> THEN 
 
 <sta*ement> ELSE <statewent> 
 
 => <identif ier> : <statement> 
 
 => BEGIN <stateraent-list> END 
 
 => <assignment> 
 
 = > 
 
 => <input-list> <variable> 
 
 = > 
 
 => <out put-list> <variable> 
 
 => <out put-list> <character> 
 
 => <expression> <relational-op> 
 
 <expression> 
 
82 
 
 <left-part> 
 
 <character> 
 
 <relational-op> => < 
 
 => = 
 
 <assignment> => <lef t-part> <assignnent> 
 
 => <left-part> <expression> 
 
 => <identifier> : = 
 
 => <subscripted-var> : = 
 
 <subscripted-var> => <identif ier> f <expression> ] 
 
 => <ide ntif ier> [ <assignment> ] 
 
 <variable> => <identifier> 
 
 => <subscripted-var> 
 
 => * <identifier> 
 
 = > ■ <ciigits> 
 
 = > ' , 
 = > ' . 
 
 = > <tera> 
 
 = > <expr^ssion> ♦ <term> 
 
 => <expression> - <term> 
 
 => <factor> 
 
 = > <terin> * <factor> 
 
 => <tern> / <factor> 
 
 = > <pri mary> 
 
 => <pri mary> 4* <f actor> 
 
 <Gxpressioi\> 
 
 <term> 
 
 <f act.or> 
 
83 
 
 <primary> 
 
 => <variable> 
 
 = > <digits> 
 
 => <character> 
 
 => ( <expression> ) 
 
 = > ( <assiqnroent> ) 
 
Jt 9 
 
 A, I 
 
 i 
 
 63 
 
 \l 
 
 i 
 
 n 
 % 
 J" 
 
 32 
 )> 
 \l 
 
 '•■•■•. 
 
II 
 
 EILIOGRAPHIC DATA 
 SEET 
 
 1. Report No. 
 
 UIUCDCS-R-76-833 
 
 J "it lc and Subtitle 
 
 Syntactic Error Recovery for LR Parsers 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 10-76 
 
 7 tuthor(s) 
 Tnhn A. Modrv 
 
 8. Performing Organization Rept. 
 No. 
 
 9 'erforming Organization Name and Address 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 NSF DCR 72-03740 
 
 1 Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D.C. 
 
 13. Type of Report 8i Period 
 Covered 
 
 14. 
 
 1 Supplementary Notes 
 
 1 Abstracts 
 
 In this thesis, Section 3 describes the evolution from the original ideas to 
 the method which was finally adopted. Section 4 consists of a detailed explanation 
 of the error recovery scheme. Section 5 describes its implementation and discusses 
 the results of some sample programs run on it. Section 6 evaluates the effectiveness 
 of the error recovery method, gives an idea of its efficiency both in terms of mem- 
 ory and execution time, and discusses the ease with which it can be implemented. 
 Section 7 suggests some improvements and restrictions that could be placed on the 
 error recovery method in order to improve its efficiency, and presents some conclu- 
 sions that can be reached about the effectiveness and practicality of the method. 
 
 Key Words and Document Analysis. 17a. Descriptors 
 
 Programming languages 
 
 Error correction 
 
 Automatic correction 
 
 Parsing 
 
 LR 
 
 syntax errors 
 
 compilers 
 
 <■ Identifiers/Open-Ended Terms 
 
 COSAT1 Field/Group 
 
 Availability Statement 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 
 Page 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 22. Price 
 
 !"M NTIS-3B ( 10-70) 
 
 USCOMM-DC 40329-P7 1 
 
1i 
 
 J 
 ii 
 
 S3 
 
 •i 
 »• 
 
 | 
 
 I 
 
 s 
 
 § 
 
 I! 
 I 
 
 J 
 
f 
 
 5 
 
 | 
 
 I 
 
 13 
 
 it 
 
 i 
 
 :» 
 
 i< 
 
 n 
 x 
 
■■^■■^■■■■1 
 
i 
 

 9 197&