mKKHKHBKSKSKB ^^ *■ < »'' 'i Wan B LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 no.©20-82>5 Oop. 2j i >• I •3 31 UIUCDCS-R- 76-833 ^ \ Syntactic Error Recovery for LR Parsers by John A. Modry October 1976 i ma .35* 3* 3£ in SYNTACTIC ERROR RECOVERY FOR LR PARSERS BY JOHN ARTHUR MODRY B.S., University of Illinois, 197U THESIS Submitted in partial fulfillment of th° requirements for *he leqree of Master of Science in Computer Science in the Graduate Colleqe of th^ University of Illinois at Urhana-Champaiqn , 1976 Urbana, Illinois >».«a« 111 ACKNOWLEDGEMENT I would like to thank my thesis advisor. Professor M. 0. Mickunas, for both his technical help and his friendship. His suggestions throughout the course of this work proved to be invaluable and his enthusiasm was greatly annreciated. A special thanks is in order for the special effort he made to read the final drafts of this paper during his vacation. I would also like to thank fellow graduate students John Bowman and Sco+t Fisher for their encouragement and advice on Tiany sub-jects-both technical anl otherwise, as well as all my friends who have made the last six years at the University of Illinois very enjoyable ones. **> IV TABLE OF CONTENTS Paqe 1. INTRODUCTION 1 1.1 Survey of Previous Work... 2 1.2 Overview of Thesis, „ . 4 2. PHODES' METHOD 6 2.1 General Description.... 6 2.2 Condensation Phase 6 2.3 Correction Phase 8 3. EVOLUTION OF THE ERROR RECOVERY METHOD...*.. 10 4. THE EPPOR RECOVERY METHOD 16 4.1 General Description 16 4.2 Setup of Possible Parses 17 4.3 Parallel Parsing 18 4.4 Correction Phase 22 4.5 Special TaMes 30 4.6 Modifications to LR Parser 32 5. IMPLEMENTATION ftND SAMPLE PROGRAMS 34 6. EVALUATION OF THE METHOD 65 6.1 Effectiveness 65 6.2 Time Requirements 67 6.3 Space Requirements 70 6.4 Ease of Implementation 73 7. CONCLUSIONS 75 LIST OF REFERENCES 79 APPENDIX A 81 1. INTRODUCTION There are few things more frustrating than spending a great deal of time debugging syntax errors in a program. Often, an error causes a compiler to make an incorrect assumption which leads to a confusing error message as well as to the generation of many additional error massages. The final recovery often necessitates the skipping of large portions of the input string. This means that any additional errors that were skipped over will go undetected until future runs of the program. A good syntactic error recovery method should detect and givp an accurate and meaningful error message far each error in a program. This means completely recovering, and resuming the parse at the point of each error, so as not to miss detecting any subseguent errors. The advantages of a compiler having a good error recovery method are obvious. The disadvantages are that it may be either very costly to develop or very inefficient to use. An automatically generated error recovery method solves the first of these two problems. 1. 1 Survey of Previous Work The following is a brief survey of work done on automatically generated error recovery schemes. More detailed surveys arp contained in LaFrance [8] and in Rhodes r i2i. The simplest error recovery scheme is commonly referred to as "panic mode". Upon detectinq an error, the parser enters its "panic mode" and discards symbols from the input strinq until one is encountered that belongs to a set of special symbols. The parser then backs up on the parsing stack to a point where that special symbol is a legal input symbol. This method does not qualify as a good error recovery scheme. It is an extremely fast method, hut may skip over large portions of the input string and its error messages generally consist only of an indication of the inout symbol on which the error was detected. In 1963, Irons [5] published an automatically generated error recovery method which he had developed and implemented for a non-backtr ack top-down parsing algorithm. In 1970, Leinius f 9 "J described but did not implement a scheme which appears to be a more sophisticated version of the "panic mode" method. Leinius 1 method is based primarily on a simple precedence parsing algorithm [ 14 ], but he also discusses the application of his method to LR parsing alqorithms f 1 "|. In 1972, L. R. James [6] implemented Leinius' error recovery ideas on an LALR(k) parser. In 1971, LaFrance [8 1 developed and implemented an automatic error recovery method as part of a translator writinq system which produces a Floyd-Evans production parser. The LaFrance method produces qood results, but it is restricted by a bounded look-ahead and by the fact that it does not attempt to modify the parsing stack. Also in 1971, Levy f 1 1 proposed but did not implement an automatic error recovery scheme. Levy's scheme has both unbounded look-ahead and the ability to modify the parsing stack. However, it may run into combinatorial problems, which makes its practicality questionable. Tn 1972, Partridge [11] developed and implemented an automatic error recovery system that has the ability to collect statistics on the programs run through it and to accordingly modify itself somewhat in order to increase its efficiency. Partridge's method does entail a considerable amount of overhead, even while parsing correct portions of programs. Tn 1971, Rhodes [12] developed a good automatic error recovery method for simple precedence parsers. Rhodes implemented his method for both full Pascal [13] and for a subset of Algol W. Rhodes' method is effective as well as being efficient to use. A paper by Graham and Rhodes [3] contains a brief overview of simple precedence parsing and error detection in simple precedence parsers, as well as a description of Rhodes 1 method. 1.2 Overview of Thesis ■41 Mm Is i P ■JSP aw 3> IS The motivation for this work came from reading of Rhodes' method in the paper by Graham and Rhodes [3]. The original idea was to develop an error recovery scheme that would extend Rhodes' ideas to LR parsers [7], Note that the reader is presumed to have a knowledge of LR parsing techniques [ 1 "]. Section 2 contains a description of Rhodes' error recovery method. Rhodes' ideas proved to be difficult to apply directly to LR parsers and some of our original ideas changed during the course of this work. Section 3 describes the evolution from the original ideas to the method which was finally adopted. Section 4 consists of a detailed explanation of the error recovery scheme. Section 5 describes its implementation and discusses the results of some samole programs run on it. Section 6 evaluates the effectiveness of the error recovery method, gives an idea of its efficiency both in terms of memory and execution time, and discusses the ease with which it can be implemented. Section 7 suggests some improverae nts and restrictions that could be placed on the error recovery method in order to improve its efficiency, and presents some conclusions that can be reached about the effectiveness and practicality of th<* method. 2. RHODES' METHOD 2. 1 General Description r '.1 is ii* 2 Rhodes' method is an automatic recovery method without a fixed bound on lookahead. It determines the most likely correction by using cost vectors, which indicate a cost for the insertion, deletion, and replacement of both terminals and nonterminals. The method consists basically of two parts. The first is a Condensation Phase, where an attempt is made to localize the occurrence of the error, and the second is a Correction Phase, where an attempt is made to determine what changes are necessay to correct the error. 2.2 Condensation Phase H ?1 is used to mark the point of the error at the juncture of the parsing stack and the input string. The Condensation Phase consists of both a backward move and a forward move. For the backward move, ths ?1 is assumed to be a > simple precedence relation [14], which means that the symbol on the top of the stack has qreater precedence than the input symbol on which the error was detected. The backward move consists of making all possible reductions to the left of the ?1, as long as the result forms a Drefix of a valid right-part (RP) of a production of the grammmar and a precedence relation holds between the symbol to the left of the prospective RP and the corresponding left-part (LP) . For the forward move, the ?1 is assumed to be a <• simple precedence relation, which means the input symbol will be shifted onto the stack. The forward move consists of continuing the parse to the right of the ?1 until another error is detected (which is marked by a ?2 on the parsing stack). A second error will always be detected by the time the parser attempts to reduce over the ?1, since the ?1 cannot be contained in a valid RP. At this point the ?2 is assumed to be a •> simple precedence relation and all possible reductions are made using the symbols to the immediate left of the ?2 in the same manner as for the ?1 in the backward move. 8 This completes the condensation phase. At this point the error recovery method continues to the Corrrection Phase, assuming that the error has been localized to a section of the parsinq stack bounded on the left by the first « simple precedence relation to the left of ?1 and bounded on the riqht by the ?2. 2. 3 Correction Phase l\ C r ■ as» b 2i£ The Correction Phase assumes the error is contained in the localized area determined by the Condensation Phase and considers three possible substrings of that area as candidates for chanqe. ?1 ?2 left bound Candidate *1- The first 4 to the left of ?1 Candidate #2- ?1 Candidate #3- The first « to the left of ?1 ?1 ?2 ?2 The three possible substrings are pattern latched against the RP's of all productions of the grammar whose corresponding LP's have precedence relations with both the symbols to the left and to the right of the substrings. By usinq the costs obtained from the Insertion, Deletion, and Reolacement Vectors, a cost is computed for each attempted pattern match, and the solution with the minimum cost is used. 10 3. EVOLUTION OP THE ERROR RECOVERY HETHOD The oriqinal idea was to develop an error recovery method that would extend Rhodes' ideas to LR parsers. I I J ' ■ f II f ■1.1 •53 3:2 J 5 * s ««• The problem applyir.q Rhodes* method directly to LR parsers is that it is not easy to determine the possible left and riqht ends of the RP as Rhodes has so neatly done with the precedence parser. A set of possible riqht ends of the RP can be determined by continuinq all possible parses in parallel from the point of error detection on (one possible parse for each state from which the ?rror symbol may be leqally read) , until each possible parse either encounters another error or attempts to make a reduction which extends past the top of the stack, as it existed when the error was detected. (We call this "reducing over the error point".) Then for each possible parse, the riqht end of the RP is the point at which that parse attempted to reduce over the error point. The problem then becomes choosing the most likely possible parse, which need not necessarily be amonq those indicated. If the input symbol at the error point is a completely erroneous symbol which should be deleted, then 11 all of the indicated parses are incorrect and a new set of possible parses will have to be discovered. The frequency of occurence of an error due to an erroneous symbol, that is detected immediately, is affected by the similarity of the constructs of the language as well as the programs being run. An idea of this frequency of occurence would be useful in planning a strategy for determining the set of possible parses to start with. If erroneous symbols frequently occured and usually were immediately detected, then it would be more efficient to skip the input symbol at the point of error detection and use the immediately following symbol to set up the possible parses. This still leaves the problem cf finding the left end of the RP for each possible parse. This problem could be deferred until the pattern match by using a right-biased right to left pattern match similar to the left-biased left to right pattern match used in Rhodes* method. For each possible parse, the target of the pattern match would be the FP of the production that the parser was attempting when it tried to reduce over the error point. The symbols implied bv the states on the parsing stack cf each possible parse would be matched against the corresponding target RP. However, merely matching a target RP without accumulating a cost too large to be acceptable would not guarantee success. A chr'ck for correct left context must also be made. 12 A problem is presented by the fact that the error may have caused a reduction that otherwise would not have occured or the error may have prevented a reduction from occuring that otherwise would have. Making all possible reductions at the error point, regardless of the next symbol in the input string, would be the LB parser equivalent of the Condensation Phase in Rhodes' method, and it was thought that performing such reductions would provide some help. However, the real problem is in not knowing the exact location on the parsinq stack of the left end of the RP beinq looked for. In fact, an incorrect reduction may have previously occured in the vicinity of the left end, and the exact left °rd may no lonqer exist on the parsinq stack. Rhodes* method did not seem to be bothered by the possibility of reductions occurinq that should not have or reductions not occurinq that, should have. For the LR parser though, this oroblem combined with the inability to accurately determine the left end of the RP, would most likely cause much poorer results than Rhodes* method did for the precedence parser. Whenever necessary, the pattern match could expand the nonterminals implied by the states on the parsing stack into their possible RP*s. Also, if necessary, a series of symbols that form a RP could be condensed into their corresponding LP*s. 13 However, expandinq and condensing symbols while pattern matchinq leads to a very large number of possibilities. Durinq the pattern match, each time a symbol did not match, a check would have to be made for all possible expansions of the symbol as well as all possible nonterminals which the symbol could be an expansion of. Another problem to consider is that the possible expansions of nonterminals are only possibilities. Once a reduction into a nonterminal is made, there is no way of knowina which RP reduced into that nonterminal. Thus there is no way to be sure what the oriqinal input strinq looked like. This means the cost of deletinq a nonterminal called would have to be the same reqardless of whether it originally consisted of a single identifier, or of a very larqe arithmetic expression. Another example of this problem would be in a lanquage with the productions => := = > : = where the symbol := is an assignment operator. In this case, deleting an identifier and an assignment operator (i.e., "A:=") which reduced into the nonterminal would havo the same cost as deleting a which oriqinallv consisted of multiple assiqnment statements (i.e., "A: = B:=C:=D: = E: = ") . 14 f '» J 9 *1 *«• V 3» is much more efficient than backinq over each of the individual terminals of that statement. Eventually the process will complete. If it ends by accumulatinq too qreat a cost, then it failed, and as a result the possible parse will be discarded. If it ends by breaking the barrier, then the possible parse has been corrected. Taking the corrections into consideration, the 30 input string is reparsed beginning with the "true" leftstate closest to the barrier which has not seen the point in the input string where the earliest correction was made. Although only insertions and deletions of terminals are considered at any point, the error messages may suggest changing one terminal to another. Each time an insertion is to be made for a possible parse, a check is made to see if the terminal is to be inserted next to a terminal which the Correction Phase has just deleted for the same possible parse. If so, the corrections are combined into one and the cost accumulated for that possible parse is changed to reflect only the maximum of the two individual costs. 4.5 Special Tables The error recovery method reguires three special tables other than the normal parsing tables reguiced foe an LR parser. 1.) The Legal State Table. 2.) The Predecessor States Table. 3.) The TEA States Table. 31 The Legal State Table is an indexed sequential table which is indexed by terminals of the language. For each terminal, the table contains all states from which the parser action with that terminal as an input symbol is a shift. The Lpqal State Table is used in settinq up the possible parses. The Predecessor States Table is really two indexed sequential tables which are indexed by the states of the LR parser. For each state, the first table contains the symbol that the state is enterable on, and the second table contains all states from which the state could be entered. The Predec8ssor States Table is used in backing up a set of states over the input strinq as well as in creatinq the TEA States Table. The TEA States Table is an indexed sequantial table which is indexed by the states of the T-R parser. For each state, the table contains all TEA states. This table is also used in backinq up a set of states over the input strinq. All three of these tables are automatically generated directly from the tables of the LR parser. 32 4.6 Modifications to LR Parser U i A • 3 j3 \l I P X •JJi 22 J3» b 52 This error recovery method requires some modifications to the original LR parser. The error recovery method depends on having access to the original input string in tokenized form. Therefore the lexical analysis phase must save each token that it processes. If the original input string were ordinarily available and there were a desire to reduce overhead for the parsing of completely correct programs, then the error routine could re-tokenize when it needed a symbol. However, for incorrect programs, it would be much more efficient to save the input string in tokenized form. A problem is presented by the fact that the error recovery method can decile to back up and restart a set of possible parses. This requires the lexical analysis phase to check if the next token it is supposed to process has alreadv be<=r. consumed. Tf it has already been consumed, then it can be found in the tokenized input string that was saved. Another problem is found in trying to make a correspondence between the states on the parsing stack and the tokenized input string. Suppose the error recovery routine makes a change to the input string. The parse must 33 be resumed at the point of the charge. However, there is a problem in determining the state closest to the top of the parsing stack which has not seen the point in the input string where the change was made. without this correspondence between the parsing stack and the input string, the input string would have to be reparsed entirely after each correction. Therefore, each state that is pushed onto the stack is accompanied by a pointer into the tokenized input string. 34 5. IMPLEMENTATION AND SAMPLE PROGRAMS The error method recovery was implemented with an LE parser for a languaqe whose BNF is listed in Appendix A. The LR parser has 356 states. The implementation is written in Pascal [4,13] and was run a a DEC-10 timesharing system. The sample programs were all run with a constant cost of 10 for the insertion or deletion of any terminal in the language. The cost limit for each correction attempt of a possible parse was set at 31. The remainder of this section consists of a discussion of the results from ten sample programs. Sample program #1 has a missing statement terminator, which is solved by a simple insertion. Sample program #2 also has a missing statement terminator, but it is in such a context that the prror routine cannot determine which of two possible solutions is the most likely. In sample program #3, the prror routine provides a single solution with the option of ins^rtirg either of two symbols at a specific point in the input string. Three reasonable solutions, including two not so obvious ones, are provided for sample program #U. This example demonstrates consecutive insertion and deletion 35 corrections being combined into a single "change" correction. In sample program #5, the error is detected immediately and conseguently the possible parses are set up for a symbol that should be deleted. The error routine must discard all of the possible parses, delete the incorrect symbol, and set up another set of possible parses before findinq a reasonable solution- Sample program #6 contains an "if" statement that is missing the "IF", and the error routine correctly inserts it. This example demonstrates the advantages of an unbounded look-ahead scheme. Sample program #7 contains an expression with three unmatched left parenthesis and requires the error routine tD call itself recursively in order to provide a solution. In sample program #8, the restriction of not considering the insertion of multiple terminals as a "simple way" to break the barrier prevents the error routine from finding an obvious solution, but it continues and finds another equally acceptable solution. In sample program #9, the same restriction prevents the error routine from finding any reasonable solutions. This example demonstrates the problems that arise when the error routine is unable to provide a solution. This is the only one of the ten sample programs for which the error routine is unable to provide at least one reasonable solu+ion. Finally, to end on a positive 36 note, sample program #10 is presented, and it demonstrates the error routine providing several good solutions. Sample Program #1 A D C[20 1. READ A B C[20T WRITE A B; WRITE C[ 20 ]; . The configuration of the parser at the point of error detection is as follows: Symbols Implied By. States On Parsing Stack IH£Ut Symbol . READ f ] ?1 "WRITE" in line #3 The error is detected at this point because the symbol "WRITE" is not a legal right context to make the the reduction => [ ]. The posssible parses are set up for the symbol "WRITE". The error recovery method supplies one solution. 37 The possible parses which yield solution #1 are in the followinq configuration when they attempt to reduce over the error point. Symbols Im.Elied By States On Parsing Stack . READ [ ] ?1 ; IS.£Ht Symbol "WRITE" in line #U Th» attempted reduction is: => ; The Correction Phase immediately finds that the insertion of a ";" will break the ?1 barrier and yields the solution: INSERT ";" AFTER " ]" WHICH IS TOKEN #7 IN LINE #2 38 Sample Program #2 A X Y. READ A X := Y; . The conf iquration of the parser at the point of error detection is as follows: Syjnbgls Implied By_ States On Parsing Stack IHfilJt Symbol . READ ?1 ":" v The error is detected at this point because, with this stack configuration, the symbol ":" is not a legal right context to make the reduction => . (Note that this is an LR, not an SLR [2] parser.) The possible parses are set up for the symbol ": ". The error recovery method supplies two solutions. The possible parse which yields solution #1 is in the following configuration when it attempts to reduce over the error point. Symbols Implied By_ States On Parsing Stack . READ ?1 : = Input Sy_mbol »t y " in line #3 39 The attempted reduction is: => : = The Correction Phase finds no "simple way" to break the original ?1 barrier. It successfully backs the barrier over the symbol "X", leaving the possible parse in the following configuration: Symbols Implied By_ States On Parsing Stack . READ ?1 : = IHEJit. Symbol It then finds that the insertion of a " ;" will break the barrier and yields the solution: INSERT ";" AFTER "A" WHICH IS TOKEN #2 IN LINE #2 The possible parse which yields solution #2 is in the following configuration when it attempts to reduce over the error point. Symbols Implied By States On Parsing Stack IHEHi Symbol . READ ?1 : = "Y" in line #3 &. 1-1 r {3 3* UO The attempted reduction is: => : = The Correction Phase finds no "simple way" to break the oriqinal ?1 barrier. Tt successfully backs the barrier over the symhol "X". It then finds that the insertion of a "[ " will break the barrier and yields the solution: INSEPT "f " AFTEP "A" WHICH IS TOKEN #2 IN LINE #2 Both solutions have the same ccst attributed to them and the error recovery method arbitrarily chooses solution #1. For *his sample program, the error recovery method provides one correct solution and one incorrect solution. Sinrro both solutions have the same cost attributed to them, it chooses the first one, which happens to bs the correct solution. This is a casp where the error recovery method almost q°ts fooled by a language construct being legal in morp than on? context. According to the language description f Appendix- A "|, an assignment statement can appear as the subscript of an array. The second solution is based on the assumption that the "X:=Y" is a subscript, and that the "rpal" statement was intended to b«= "PFAD Arx:=Yl". If ther° is a " ]" between the "Y" and the ";", then solution #2 is the most likely. If no* , then solution #1 is the most likely. The problem is that the error recovery rou^in? must 41 make a decision without knowinq what symbols ar<= to the riqht of the "Y". This is a case where our method of parsinq until an attempt is made to reduce over the error point does not provide sufficient look-ahead to decide between the possibilities. Luckily, it chooses the correct solution. If it chose the incorrect one, it would insert the "[ " after the W A", return to the reqular parser, continue the parse until discoverinq that the correspondinq "1" was missinq, and call the error routine aqain, at which point the " ]" would be inserted. If the oriqinal proqrara was A X Y. READ A X:=Y]; . then the error routinp would insert a H ; " after the "A" in line #2 and return to the reqular parser. The reqular parser would encounter another error and call the error routine with the "1" as the input symbol. The error routine would set up a sot of possible parses for the symbol " ]", and proceed to find the solution of "chanqinq" the ";" (that it just inserted) to a "[". f: S3 ".J3 \l i Sc J" S B 42 S§JE£le Program *i A[ 20] B X. X := Ar B:= ]; WRITE X; . The configuration of the parser at the point of prror detection is as follows: Symbols Implied By. States On Parsing Stack IfiEUt Symbol . [ : = ?1 "]" in line #2 The error is detected at this point because the symbol " "j" is not a leqal riqht context to make the reduction => : =. Th<=> possible parses are set up for the symbol " "]". Thp error recovery method supplies one solution that contains an option of two symbols. U3 The possible parses which yield solution #1 are in the followinq configuration when they attempt to reduce over the error point. Symbo l s Implied By. States On Parsing Stack . [ : - ?1 1 IHElit Symbol * in line #2 The attempted reduction is: = > [ ] The Correction Phase immediately finds that the insertion of either an identifier or a string of digits will break the ?1 barrier and yields the solution: INSERT "identifier" or "digits" AFTER "=" HHICH IS TOKEN #8 IN LINE #2. :'.-\--"r- Si U4 3 C3 i: is g '0 «*■ B 52 A B. TOGO A; READ B; . The configuration of the parser at the point of »rror detection is as follows: Symbols Imfilied By Sta tes On Parsing Stack IHEUt Symbol ?1 . "A" in line #2 The error is detected at this point because a statement cannot start with two consecutive identifiers. The possible parsps are set up for f he identifier "A". T hp error recovery method supplies three solutions. The possible parse which yields solution #1 is in the following configuration when it attempts to reduce over the error point. Symbols Tmp_Iifid By States On Parsing Stack . ?1 InEiAi Symbol ii ■ ii « in line #2 Th Q attempted reduction is: => GOT^ 45 The Correction Phase finds no "simple way" to break the ?1 barrier. It attempts to back the barrier over the symbol "TOGO", but to do so must delete the "TOGO". It then finds that the insertion of the keyword "GOTO" will break the barrier. The consecutive insertion and deletion are combined into a change and the solution emitted is: CHANGE "TOGO" WHICH IS TOKEN #1 IN LINE #2 TO "GOTO" The possible parse which yields solution #2 is in the followinq configuration when it attempts to reduce over the error point. Symbols Implied By_ States On Parsing Stack . ?1 IHEHi Sy_mbol in line #2 The attempted reduction is: => The Correction Phase finds no "simple way" to break the oriqinal barrier. It successfully backs the barrier over the symbol "TOGO". It then finds that the insertion of the keyword "BEAD" will break th* barrier and yields the solut ion: INSERT "READ" AETER "." WHICH IS TOKEN #3 IN LINE #1 U6 The possible parse which yields solution #3 is in the followina configuration when it attempts to reduce over the °rror point. Sy_rabgls Implied Ry_ states On Parsing Stack . ?1 lDI>yt Syjnbol in line #2 dfe The attempted reduction is: => The Correction Phase finds no "simple way" to break the ?1 barrier. It successfully backs the barrier over th 3 symbol "TOGO". It then finds that th<=> insertion of the keyword "WRITE" will break «-he barrier and yields the solution: INSFPT "WRITE" AFTEP "." WHICH IS TOKEN #3 IV LINE #1 Solution *1 makes an insertion and a deletion, bat * hey are combined into a sinale change at the cost of the maximum of the costs of the two individual corrections. Since all of the sample programs were run with constant insertion and deletion costs for every terminal, all three possible solutions have the same cost attributed to them. The error recovery method arbitrarily chooses solution #1. Note that U7 the solutions provided have nothinq to do with the similarity between the spelling of the symbols "TOGO" and "GOTO". The same solutions will be provided if "TOGO" is spelled "WXYZ" or "PEED". SamEle Program #5 X. X ;= 2; WRITE X; . The configuration of the parser at the point of error detection is as follows: Symbols Implied R£ Sta tes On Parsing Stack . IHEUt Symbol ?1 ";" token #2 in line #2 The error is detected at this point because a statement cannot start with an identifier followed by a ";". The possible parses are set up for the symbol ";". However, all of the possible parses encounter an error on the very next symbol ("="). The ";" is deleted, the possible parses are set up acrain, and the error routine is restarted. The error recovery method supplies one solution. U8 The possible parse which yields solution #1 is in the following configuration when it attempts to reduce over the error point. Symbols Implied By States On Parsing Stack . ?1 = IHEUt Symbol & - i fa Th^ att^irpted reduction is: => : = The Correction Phase immediately finds that the insertion of an ":" will break the ?1 barrier and yields the solution: DELETE ";" WHICH IS TOKEN #2 IN LINE *2 INSFRT ":" AFTEP "X" WHICH IS TOKEN #1 IN LINE #2 Ir this sample program, the consecutive insertion and deletion are net combined into a single change. This is because the "DELETE ;'» message originates from the discarding of the first set of possible parses and conseguently its cost is not represented in the total cost of the Dossible parse tor which the final correction is found . astfRS U9 Sa»£le Program 16 x y z. X=Y THEN Z: = ELSE Z:=Z*1; WRITE Z; . The configuration of the parser at the point of error detection is as follows: Symbols Implied By States On Parsing Stack . InEUt Syjnbol ?1 "=" token #2 in line #2 The error is detected at this point because a statement cannot start with an identifier followed by an "=". The possible parses are set up for the symbol »• = ". The error recovery method supplies one solution. The possible parse which yields solution #1 is in the following configuration when it attempts to reduce over the error point. Symbols Implied By. States On Parsing Stack . ?1 IHEUt Symbol "THEN" The attempted reduction is: => 50 The Correction Phase finds no "simple way" to break the ?1 barrier. It successfully backs the barrier over the symbol "X". it then finds that the insertion of the keyword "IF" will break the barrier and yields the solution: INSERT "IF" AFTER ". " WHICH IS TOKEN #4 IN LINE #1 p. '11 ! 63 as i IS «o This sample proqram is an example of why unbounded look-ahead is so important and why this method continues parallel Darsinq until every possible parse has either been discarded or attempted to reduce over the error point, reqardless of how many possible solutions have been found. Tn f his case, + here is another possible oarse which thinks it is parsinq an assiqnment. statement. it attempts to reduce over the error point (usinq the production => : =) immediately after shifting the •«=«, ar.d finds the first solution of insertinq a ":" after the "X" in line #2. Reanwhile the possible parse which eventually provides the correct solution thinks it is parsina an "if" statement, but. has not yet attempted to reduce over the error point. If the statement was intended +c be an assiqnment statement, then the first solution is correct. If the statement was intended to be an "if" statement, as is apparently the case here, then the first solution is wrorq. There is no way of knowinq which is the 51 case, until the symbol immediately after the expression to +he right of the "=" is known. In this case it is a "THEN" and the first solution is wrong. Since another possible parse is still parsing, the possible parse of the first solution must continue also. Dpon seeing the "THEN", the possible parse of the first solution encounters an error and is discarded. At this point, the single remaining possible parse attempts to reduce over the error and provides the final solution. The importance of unbounded look-ahead is demonstrated by the fact that the expression to the right of the "=" (in this case a single "Y") can be of any length. If the look-ahead is bounded, and the length of the expression is qreater than the bound, then the symbol following the expression will not be known and no error recovery method can determine which statement was most likely intended by the programmer. 52 Saragle ££22£§.l 12 X Y Z. X := (( Y + ( Z*5; WFITE X; . The configuration of the parser at the point of error detection is as follows: Symbols Implied By_ States On Parsing Stack IU£lit Symbol j j J '3 f .1 J j3 C3 r= is is '0 22 52 . ( { + ( * < ?1 ";" in line #2 The error is detected at this point because, with v hi3 stack configuration, the symbol ";" is not a legal right context +o make the reduction => . The possible parses are set up for the symbol ";•'. T he error recovery method supplies one solution. The possible parse which yields solution *1 is in the followinq configuration when it attempts to reduce over the error noint. Symbols Imp_l.i.e_i By_ States On Parsing Stack <3eclarat ion-list> . ( ( ♦ ( * ?1 ; IlLDlit Symbol ••WRITE" 53 The attempted reduction is: => ; The Correction Phase finds that the insertion of a ") " will break the barrier. In this case though, there are three unmatched left parenthesis and insertinq a single right parenthesis does not provide a correct solution. This is a case where the Correction Phase is temporarily fooled. The Correction Phase inserts the ") " and attempts to reparse back up to the symbol which was the input symbol when the attempt was made to reduce over the error point. However, the possible parse encounters an error before reparsing to that symbol (the ";" in line #2). The possible parse is temporarily discarded. However, no other possible parse either "shifts" on the ";", or attempts to reduce over the error point and provides a solution which allows reparsing any further than the discarded one did. Therefore, the possible parses discarded in this way (two in this example) are r^-instated and the error routine is called recursively from the point at which they encountered the error during the reparse attempt. The process then repeats itself, with the Correction Phase breaking the barrier by inserting a ")", but then encountering an error on the reparse. Aaain the error routine is called recursively, and again the Correction 54 Phase breaks the barrier by inserting a ")". This tine the reparse finally succeeds and the error routine yields the solat ion: INSERT ") " AFTER "5" WHICH IS TOKEN #11 IN LINE #2 INSERT ") •» AFTER "5" WHICH IS TOKEN #11 IN LINE #2 INSEPT ") '• AFTER "5" WHICH IS TOKEN #11 IN LINE #2 This sample proqram demonstrates that the error recovery method can handle any number of unmatched left parenthesis, as lonq as a restriction is not placed on the number of levels of recursion allowed. V This error recovery method can also handle any number of unmatched riqht parenthesis. This is much less complicated, since it does not. require any recursion. Each unmatched riqht parenthesis is encountered one at a time, and as each one is encountered, the corresponding left parenthesis is inserted. 55 S§.IEl§ Program #8 A X. BEGIN READ A; X := 2*A; WRITE X; . The configuration of the parser at the point of error detection is as follows: Symbols. Implied Ql states On Parsing Stack . BEGIN ; IHEJi* Symbol ?1 The error is detected at this point because, with this stack configuration, the symbol "." is not a legal right context to make the reduction => ;. The possible parses are set up for the symbol ".". The error recovery method supplies one solution. The possible parse which yields solution #1 is in the following configuration when it attempts to reduce over the error point. Syjnbols Implied Bv. States On Parsing Stack . BEGIN ; ?1 . IHEUt Syjnbol end-of-f ile 56 The attempted reduction is: => . . The Correction Phase finds no "simple way" to break the ?1 barrier. It then successfully backs the barrier over the ";" in line #5. It aqain finds no "simple way" to break the barrier. This time it finds that the "true" leftstate and one of the "true" rightstates are both enterable on the nonterminal . The barrier is successfully backed over the nonterminal (which originally consisted of "WRITE X"). There is still no "simple way" to break the barrier, and the barrier is again backed up. This time the "true" leftstate and one of the "true" rightstates are both »nterabl<=> on the nonterminal . The barrier is successfully backed over the nonterminal {which originally consisted of "READ A; X := 2*A;"). Once again there is no "simple way" to break the barrier. It then attempts to back the barrier over the symbol "BEGIN", but to do so must delete the "BEGIN". At this point, the Correction Phase realizes that deleting the "BEGIN" has broken the barrier, and it yields the solution: DELETE "BEGIN" WHICH IS TOKEN #1 IN LINE #2 57 The reason the Correction Phase does not insert an "END" instead of deleting the "BEGIN" is because, according to the syntax of the language [Appendix A], it needs to insert "FIND ;". Since the method was implemented so that it considers a "simple way" to break the barrier as being the insertion of a single terminal, it cannot insert "END ;" and must continue backing up until it deletes the "BEGIN". Sa5£l§ EE9.9.0I5 19. X. X < 2*3; • The configuration of the parser at the point of error detection is as follows: Symbols l!Eli§! By_ States On Parsing Stack l5.£Ut Sy_mbol . ?1 "<" The error is detected at this point because a statement cannot start with an identifier followed by the symbol "<". The possible parses are set up for the symbol "<". The error recovery method does not find any solutions. 58 f s V r 2 63 is IS •o 2 52 There is originally only one possible parse, since there is only one state from which it is legal to "shift" on the input symbol "<". That possible parse believes it is parsina the Boolean expression in an "if" statement, and encounters an error with the ";" being an illegal right context to make the reduction => . At this point the possible parse is in the following configuration. Symbols Implied By States On Parsing Stack . ?1 * lB£!it Symbol ?2 n . » At this point, the error routine calls itself recursively and a new set of possible parses is set up for the symbol ";". All of thp possible parses attempt to reduce over the error point immediately after "shifting" the ";". None can successfully break the barrier, and the ";" is deleted. The error recovery routine tri°s again, this time setting up the possible parses for the input symbol ".". All of the possible parses attempt to reduce over the error point immediately after shifting the ".". Again, nona can successfully break the barrier, and the ". " is deleted. At this point, the error recovery routine realizes that it would be pretty silly to set up possible parses for the end-of-file condition, so it admits defeat and indicates 59 that it has found zero solutions. If the method is implemented to consider a "simple way" to break the barrier as beinq the insertion of two terminals instead of only one, then a solution will be found. While the possible parses are set up for the ";", the Correction Phase will discover the solution of deleting the "<" and inserting a ":" followed by an "=»„ The reason this sample program is presented here is to demonstrate what happens when the error recovery method cannot find a solution. It should be expected that the error recovery method will not always be able to provide a solution. If the method allows the insertion of N terminals, there could always be a case reguiring the insertion of N+1 terminals. The problem with the method, as it is described in Section 4, is that it does not know when to "give up". In this sample program, there are no mora statements following the erroneous one. However, if there are additional statements, the error routine will parse through them all, still attempting to correct the first error. Suppose the erroneous statement is followed by several completely correct statements, each correctly terminated by a ";". For each one, the error routine sets up possible parses for the first symbol of the statement, parses the statement until it 60 reduces into , "shifts" the terminating "; N , and attempts the reduction => ;. For all of the possible parses, this reduction means reducing over the first error point. The Correction Phase fails as it did previously and conseguently the innut symbol which was assumed correct in setting up the possible parses is deleted. (The symbol delated is the firs* symbol of the correct statement.) The possible parses are then set up for the next input symbol (the second symbol of the correct statement) and the process continues. I In summary, each correct statement is parsed correctly, but is unable to reduce into because of the previous erroneous statement which th^ Correction Phase could not find a solution for. Each correct statement is then deleted symbol by symbol. Occasionally, the remaining portion of a statement will resemble another language construe* and the error routine will call itself recursively before resuming its pattern of delating symbols. Clearly, these spurious error messages throughout the remainder of the program are not acceptable. What is needed is a way for the error routine to know when a symbol cannot be found, so that it can enter "panic mode" and "clean up" the stack before continuing. This error recovery method should be implemented with a "panic mode" which keys on one 61 or more special symbols of the lanquage. For this language [Appendix A], a sinqle special symbol of " ; " should be sufficient. The "panic mode" should be entered whenever the error routine deletes a ";" as a result of discarding all possible parses that were originally set up for that ";". This is preferable to entering "panic mode" each time a ";" is deleted. Sa§Eie Program jl^Q A B X Y. X := Y: A := B; WRITS X A; . The configuration of the parser at the point of error detection is as follows: Symbols Implied By States On Parsing Stack . : lUfiMt Symbol ?1 "A" in line #3 The error is defected at this point because "X:=Y:" cannot be followed by any symbol but an "=". The possible parses are set up for the identifier "A". The error recovery method yields three solutions. 62 The possible parse which yields solution #1 is in the following configuration when it attempts to reduce over the error point. Syjnbols Implied bj States On Parsincj Stack iBBiJt Syjnbol . : ?1 ";" in line #3 The attempted reduction is: => : The Correction Phase finds no "simple way" to break the ?1 barrier. It then successfully backs the barrier over tha ":" (token #5 in line #2). It again finds no "simple way" to break the barrier and it successfully backs the barrier over the "Y" (token #U in line #2). There is still no simple way to break the barrier. It attempts to back the barrier over the "=" (token #3 in line #2) , but to do so must delete that "=". At this point, the Correction Phase realizes that deleting the "=" has broken the barrier, and it yields the solution: DELETE " = " WHICH IS TOKEN #3 IN LINE #2 m 63 The possible parsn which yields solution #2 is in the following configuration when it attempts to reduce over the error point. Symbols Implied By States On Parsing Stack . : ?1 lQ.£Ht Symbol ii . « in line #3 The attempted reduction is: => The Correction Phase immediately finds that the insertion of an "=" will break the ?1 barrier and yields the solution: INSERT "=" AFTER »• : " WHICH IS TOKEN #5 IN LINE #2 The possible parse which yields solution #3 is in the following configuration when it attempts to reduce over the error point. Symbols Implied By States On Parsing Stack . : ?1 ; IHEUl Symbol "WRITE" 64 The attempted reduction is: => ; * < ■41 11 r 'Si M k \\ '0 ;'!5 SI The Correction Phase finds no "simple way" to break the ?1 barrier. It attempts to back the barrier over the symbol ":" (token #5 in line #2), but to do so must delete that ":". It then finds that the insertion of a ";" will break the barrier. The consecutive insertion and deletion are combined into a chanqe and the solution emitted is: CHANGE ":" WHICH IS TOKEN #5 IN LINE #2 TO ";" All three possible solutions have the sane cost attributed to them and the error recovery routine arbitrarily chooses solution #1. ■■•>:•■ 65 6. EVALUATION OF THE METHOD 6. 1 Effectiveness This recovery method meets all of the criteria necessary for an effective error recovery method. Upon correcting an error, it resumes parsing smoothly and does no* skip over any of the input string. If a second error is encountered while attempting to correct the original error, the error routine calls itself recursively and independently corrects the second error before resuming the correction attempt on the first error. This allows the error recovery method to handle several errors in close proximity. The recovery method has the ability to make corrections at any point in the input string, even if the corrections reguire modifying the parsing stack. It uses an unbounded look-ahead scheme. The advantages of an unbounded look-ahead scheme are demonstrated in Sample Program #6. The recovery method supplies extremely helpful error messages. The messages explicitly state how the program should be changed in order to make it syntactically correct. 66 B3 a J* i\ •S3 ■ C3 ■is 15 A disadvantaqe is that the method, as implemented , only considers single insertions. There are cases where this is insufficient. Of course, the method could be extended to consider multiple insertions, but this would be less effecient time-wise and it is not clear that It would always be more effective overall. For instance, the Correction Phase miqht provide a solution by insertinq two terminals and then stop short of findinq a less costly solution at an earlier point in the input strinq. Beqardless of how many insertions are allowed in a correction attempt, there will be times when the error routine will fail to find a solution. When this occurs, the error routine must enter "panic mode" and "clean up" the stack in order to resume the parse smoothly. This is explained in more detail in the discussion followinq Sample Proqram #9 in Section 5. This °rror recovery method can produce some unexpected results. For instance, it miqht provide a sinqle solution when th^re are several equally obvious solutions that it did not find. At other times, it will provide all reasonable solutions, includinq some that are not so obvious. This is because the Correction Phase returns the first acceptable solution that it finds for each possible parse. Multiple solutions only occur when more than one possible parse yields an acceptable solution. Overall, this error recovery 67 method is very effective. It usually provides at least one, and often provides several reasonable solutions. 6.2 Time Requirements The following are rough estimates of the actual DEC-10 CPU time required to execute some of the sample programs. They must be considered as rough estimates only, since the CPU time recorded varied up to 10% for identical sample programs running in the same amount of core. The LR parser itself has an overhead of approximately 4.75 seconds, which is due to it reading in the parsing tables. The additional time reguired for the LR parser to parse correct versions of the sample programs is negligible relative to the parser overhead. For the first error encountered, the error recovery routine has an overhead of approximately 5.5 seconds. This is due to it reading in the special tables that the error routine needs. The time to correct, each error varies depending on the amount of core in which the program is executing. The sample programs were always executed in 80k of virtual memory. The DEC-10 CPU time was recorded both while 68 executinq the programs in 30K of real memory and in 60K of real memory. The number of possible parses that are set up is a biq factor in the time required to correct an error. Even thouqh most possible parses are discarded fairly quickly, iust settinq them up takes a significant amount of time. Another biq factor is the number of possible parses which are discarded before atteraptinq to reduce over the error noint and consequently never enter the Correction Phase. While executinq in 10K of real memory, the missinq ";" in sample proqram #1 takes 2.8 seconds to correct. (8 possible parses are set up for the input symbol "WRITE".) Whf»n the ";" is missinq from the last statement instead of the first statement, it only takes 2.3 seconds to correct. (9 possible parses are set up for the input symbol ".".) Thouqh the number of possible parses that are set up is approximately the same, the latter correction is quicker, since only one possible parse (as opposed to three for the former correction) attempts to reduce over the error point. In sample proqram #U, the error is detected with an identifier as the input symbol and consequently 32 possible parses are set up. The error routine requires .1.9 seconds to provide the solutions for sample program *U. Recursion is also a biq factor in the time required, since a new set of possible parses is set up for each level of recursion and 69 the number of possible parses is usually increasing with each successive level of recursion. In sample program #7, the three missing right parenthesis take a total of 26 seconds to correct. While executing in 60K of real memory, the results are significantly better. Sample program #1 takes 1.2 seconds to correct the missing ";". When the ";" is missing from the last statement instead of the first statement, it only takns .9 seconds to correct. Sample program #4, with the 82 original possible parses, takes 2.9 seconds to correct., and the three hissing right parenthesis in sample program #7 take a total of 17 seconds to correct. In summary, while executing in 30K of real memory, the error routine takes anywhere from 2 seconds to 10 seconds to correct a single error. (The three missing right parenthesis are considered three errors.) While executing in 60K of real memory, the error routine takes anywhere from less than 1 second up to 6 seconds to correct a single error. Remember that these fiaures should only be considered as rough estimates, and that they are based on DEC-10 CPU time. A significantly faster machine would reguire significantly less time. 70 6.3 Space Requirements a V, S J! w 11 ■ i P re jj» 12 •o 12 <2 !2 An LR parser im pie wen ted with this error recovery method requires considerably more space than the same LR parser without any error recovery method. There are four main contributors to the additional space requirements, and they will be discussed in the next several paragraphs. The "bier four" are the error routine code, the tokenized input string, the special tables, and the error routines main data structures. There are many other data structures required by the error routine, but the sum of their space requirements is neqligible when compared to the requirements of the "biq four". The space required by the error routine code is approximately 10K. The input strinq which is saved in tokenized form can also require a larqe amount of space. That space is simply the larqest program (in terms of number of tokens) which can be run on the system. The Leqal State Table in our implementation is an array of lenqth 518 and is indexed by a n array of length 27 (one for each terminal). The Predecessor States Table is an array of length 984 indexed by an array of lenqth 156 (one 71 for each state) . The Predecessor States Table also includes another array of lenqth 356 to hold the symbol each state is enterable on. The TEH States Table is an array of length 17ft => . . => => => f ] => => ; => GOTO => READ => WRITE = > IF < boolean-expr> THEN ELSE => : => BEGIN END => = > => = > => => => 82 => < => = => => => : = => : = => f ] => [ ] => => => * = > ■ = > ' , = > ' . = > = > => - => = > * => / = > => 4* 83 => = > => => ( ) = > ( ) Jt 9 A, I i 63 \l i n % J" 32 )> \l '•■•■•. II EILIOGRAPHIC DATA SEET 1. Report No. UIUCDCS-R-76-833 J "it lc and Subtitle Syntactic Error Recovery for LR Parsers 3. Recipient's Accession No. 5. Report Date 10-76 7 tuthor(s) Tnhn A. Modrv 8. Performing Organization Rept. No. 9 'erforming Organization Name and Address Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. NSF DCR 72-03740 1 Sponsoring Organization Name and Address National Science Foundation Washington, D.C. 13. Type of Report 8i Period Covered 14. 1 Supplementary Notes 1 Abstracts In this thesis, Section 3 describes the evolution from the original ideas to the method which was finally adopted. Section 4 consists of a detailed explanation of the error recovery scheme. Section 5 describes its implementation and discusses the results of some sample programs run on it. Section 6 evaluates the effectiveness of the error recovery method, gives an idea of its efficiency both in terms of mem- ory and execution time, and discusses the ease with which it can be implemented. Section 7 suggests some improvements and restrictions that could be placed on the error recovery method in order to improve its efficiency, and presents some conclu- sions that can be reached about the effectiveness and practicality of the method. Key Words and Document Analysis. 17a. Descriptors Programming languages Error correction Automatic correction Parsing LR syntax errors compilers <■ Identifiers/Open-Ended Terms COSAT1 Field/Group Availability Statement 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 22. Price !"M NTIS-3B ( 10-70) USCOMM-DC 40329-P7 1 1i J ii S3 •i »• | I s § I! I J f 5 | I 13 it i :» i< n x ■■^■■^■■■■1 i 9 197&