The GNAT scanner implements some basic error recovery techniques which simplify the implementation of the parser. The GNAT parser has what is considered to be the best set of error recovery strategies of any Ada compiler in use. In this chapter we briefly present these strategies. It is structured as follows: Section 3.1 introduces the error recovery techniques implemented in the scanner, and Section 3.2 presents the mechanisms used to resynchronize the parser. Section 3.2.1 presents the parser scope-stack which is used to handle nested scopes parsing; Section 3.1.1 discusses the use of the programmer casing convention to distinguish keywords from user-defined identifiers, and finally Sections 3.2.3 and 3.2.4 discuss the special handling of the keyword 'is' used in place of semicolon.
The GNAT scanner implements a simple error recovery technique which simplifies error handing to the parser. After detecting a lexical error, the scanner posts the corresponding error message and returns one heuristic Token which masks the error to the next phases of the compiler. For example:
&&'' (which is the syntax of the C and operator), it assumes that the programmer intention was to use the Ada and operator. Thus it posts the precise error message ``&& should be 'and then' '' and returns the ``and then'' token to the Parser. The scanner has many error messages specific for C programmers.
In most cases this simple but powerful mechanism helps masking lexical errors to the parser. This simplifies the implementation of the parser, which does not need to repeatedly handle them in many contexts. For additional details, read the Scan, Nlit, and Slit subprograms in package Scn.
The GNAT scanner assumes that the user has some consistent policy regarding the casing convention used to distinguish keywords from user-defined entities. The scanner deduces this convention from the first keyword and identifier that it encounters (cf. Package Casing). In the following example GNAT makes use of the upper/lower case rule for identifiers to treat the word exception as an indented identifier rather than the beginning of an exception handler (cf. subprogram scan_reserved_identifier).
The GNAT parser includes a sophisticated error recovery system which, among other things, takes indentation into account when attempting to correct scope errors. When an error is encountered, a call is made to one parser routine to record the error (cf. Package Errout). If the parser can recover locally, it masks the failure to the next stages of the front-end by generating the AST nodes as if the syntax were right, and the parsing continues unimpeded. On the other hand, if the error represents a situation from which the parser cannot recover locally, the exception Error_Resync) is raised after the call to the routine that records the error. Exception handlers are located at strategic points to resynchronize the parser. For example, when an error occurs in a statement, the handler skips to the next semicolon and continues the scan from there.
In the GNAT sources, each parsing routine has a note with the heading ``Error recovery'' which shows if it can propagate the Error_Resync exception (cf. Files par-ch2.adb to par-ch13.adb). To avoid propagating the exception, a procedure must either contain its own handler for this exception, or it must not call any other routines which propagate the exception.
Many rules of the language define a syntax scope (rules ended with 'end'). For example, the syntax rules of packages, subprograms and all the flow-of-control statements. The GNAT parser uses a scope-stack to record the scope context. An entry is made when the parser encounters the opening of a nested construct, and then package Endh uses this stack to deal with 'end' lines (including properly dealing with 'end' nesting errors).
In case of parsing resynchronization by means of the exception mechanism (cf. Section 3.2), the arrangement of the exception handlers is such that it should never be possible to transfer control through a procedure which made an entry in the scope stack, invalidating the contents of the stack.
In some cases, at the end of a syntax scope, the programmer is allowed to specify a name (ie. at the end of a subprogram body); other scope rules have a rigid format (ie. at the end of a record-type definition). In the first case, it is a semantic error to open a syntax scope with a name and to close it with a different name. Although many Ada compilers detect this error in the phase of semantic analysis, GNAT uses the parser scope-stack to detect it as soon as possible and thus simplify the semantics.
The next example combines the use of the scope-stack plus the indentation to match the statement:
Note that a more conventional approach to error recovery would have produced two errors: it would have identified the end with the if-statement, complained about a missing ``if'', and then complained about the missing end for the procedure itself.
The two contexts in which a semicolon may have been erroneously used in place of 'is' are at the outer level, and within a declarative region. The first case corresponds to the following example:
In this case the GNAT parser knows that something is wrong as soon as it encounters Q (or 'begin', if there are no declarations), and it can immediately diagnose that the semicolon should have been 'is'. The situation in a declarative region is more complex, and corresponds to the following example:
In this case, the syntax error (line
<1>) has the syntax of a
[AAR95, Section 6-1].
Therefore, the declaration of Q is read as belonging to the outer
region. The parser does not know that it was an error until it
encounters the 'begin' (line
<2>). It is still not clear at
this point from a syntactic point of view that something is wrong,
because the 'begin' could belong to the enclosing syntax
scope. However, the GNAT parser incorporates a bit of semantic
knowledge and note that the body of X is missing, so it diagnoses the
error as semicolon in place of 'is' on the subprogram line. To control
this analysis, some global variables with prefix ``SIS_'' are used to
indicate that we have a subprogram declaration whose body is required
and has not yet been found. For example, the variable
indicates that a subprogram declaration has been encountered, but no
body for this subprogram has been encountered yet. The prefix stands
for ``Subprogram IS'' handling. Five things can happen to an active
This is a somewhat trickier situation, and although the GNAT parser can not catch it in all cases, it does its best to detect common situations resulting from a ``cut and paste'' operation which forgets to change the 'is' to semicolon. Let us consider the following example:
The trouble is that the section of text from line
<1> to line
<2> syntactically constitutes a valid procedure body, and the
danger is that the parser finds out far too late that something is
wrong (indeed most compilers will behave uncomfortably on the above
To control this situation the GNAT parser avoids swallowing the last 'end' if it can be sure that some error will result from doing so. In particular, the parser will not accept the 'end' it if it is immediately followed by end of file, 'with' or 'separate' (all tokens that signal the start of a compilation unit, and which therefore allow us to reserve the 'end' for the outer level). For more details on this aspect of the handling, see package Endh. Similarly, if the enclosing package has no 'begin', then the result is a missing 'begin' message, which refers back to the subprogram header. Such an error message is not too bad (it's already a big improvement over what many parsers do), but it's not ideal, because the declarations following the 'is' have been associated with the wrong scope.
To catch at least some of these cases, the GNAT parser carries out the following additional steps. First, a subprogram body is marked as having a suspicious 'is' if the declaration line is followed by a line which starts with a symbol that can start a declaration in the same column, or to the left of the column in which the 'function' or 'procedure' starts (normal style is to indent any declarations which really belong a subprogram). If such a subprogram encounters a missing 'begin' or missing 'end', then the parser decides that the 'is' should have been a semicolon, and the subprogram body node is marked (by setting the Bad_Is_Detected flag to true). This is not done for library level procedures since they must have a body.
The processing for a declarative part checks to see if the last declaration scanned is marked in this way, and if it is, the tree is modified to reflect the 'is' being interpreted as a semicolon.
GNAT implements an heuristic error recovery mechanism on it which simplifies the implementation of the parser: when an error is detected, the scanner generates the corresponding error message, estimates a substitute token and returns it to the parser. In most cases this simple, but powerful, mechanism helps the parser to continue as if the source program had no lexical errors.
In case of simple errors the parser masks the failure and generates the right nodes as if the source program were correct. In case of complex errors, the parser implements a resynchronization mechanism based on exception handlers.