next up previous contents index
Next: II. Second Part: Semantic Up: I. First Part: Introduction Previous: 2. Overview of the   Contents   Index

Subsections


3. Error Recovery

The GNAT scanner implements some basic error recovery techniques which simplify the implementation of the parser. The GNAT parser has what is considered to be the best set of error recovery strategies of any Ada compiler in use. In this chapter we briefly present these strategies. It is structured as follows: Section 3.1 introduces the error recovery techniques implemented in the scanner, and Section 3.2 presents the mechanisms used to resynchronize the parser. Section 3.2.1 presents the parser scope-stack which is used to handle nested scopes parsing; Section 3.1.1 discusses the use of the programmer casing convention to distinguish keywords from user-defined identifiers, and finally Sections 3.2.3 and 3.2.4 discuss the special handling of the keyword 'is' used in place of semicolon.


3.1 Scanner Error Recovery

The GNAT scanner implements a simple error recovery technique which simplifies error handing to the parser. After detecting a lexical error, the scanner posts the corresponding error message and returns one heuristic Token which masks the error to the next phases of the compiler. For example:

In most cases this simple but powerful mechanism helps masking lexical errors to the parser. This simplifies the implementation of the parser, which does not need to repeatedly handle them in many contexts. For additional details, read the Scan, Nlit, and Slit subprograms in package Scn.


3.1.1 Use of the Casing of Identifiers

The GNAT scanner assumes that the user has some consistent policy regarding the casing convention used to distinguish keywords from user-defined entities. The scanner deduces this convention from the first keyword and identifier that it encounters (cf. Package  Casing). In the following example GNAT makes use of the upper/lower case rule for identifiers to treat the word exception as an indented identifier rather than the beginning of an exception handler (cf. subprogram scan_reserved_identifier).

\begin{figure}\begin{lstlisting}{}
procedure Wrong1 is
Exception : Integer;
\v...
...nnot be used as identifier
begin
null;
end Wrong1;
\end{lstlisting}\end{figure}


3.2 Parser Error Recovery

The GNAT parser includes a sophisticated error recovery system which, among other things, takes indentation into account when attempting to correct scope errors. When an error is encountered, a call is made to one parser routine to record the error (cf. Package Errout). If the parser can recover locally, it masks the failure to the next stages of the front-end by generating the AST nodes as if the syntax were right, and the parsing continues unimpeded. On the other hand, if the error represents a situation from which the parser cannot recover locally, the exception Error_Resync) is raised after the call to the routine that records the error. Exception handlers are located at strategic points to resynchronize the parser. For example, when an error occurs in a statement, the handler skips to the next semicolon and continues the scan from there.

In the GNAT sources, each parsing routine has a note with the heading ``Error recovery'' which shows if it can propagate the Error_Resync exception (cf. Files par-ch2.adb to par-ch13.adb). To avoid propagating the exception, a procedure must either contain its own handler for this exception, or it must not call any other routines which propagate the exception.


3.2.1 The Parser Scope-Stack

Many rules of the language define a syntax scope (rules ended with 'end'). For example, the syntax rules of packages, subprograms and all the flow-of-control statements. The GNAT parser uses a scope-stack to record the scope context. An entry is made when the parser encounters the opening of a nested construct, and then package Endh uses this stack to deal with 'end' lines (including properly dealing with 'end' nesting errors).

In case of parsing resynchronization by means of the exception mechanism (cf. Section 3.2), the arrangement of the exception handlers is such that it should never be possible to transfer control through a procedure which made an entry in the scope stack, invalidating the contents of the stack.

In some cases, at the end of a syntax scope, the programmer is allowed to specify a name (ie. at the end of a subprogram body); other scope rules have a rigid format (ie. at the end of a record-type definition). In the first case, it is a semantic error to open a syntax scope with a name and to close it with a different name. Although many Ada compilers detect this error in the phase of semantic analysis, GNAT uses the parser scope-stack to detect it as soon as possible and thus simplify the semantics.

3.2.2 Example 1: Use of indentation

The next example combines the use of the scope-stack plus the indentation to match the statement:

\begin{figure}\begin{lstlisting}{}
procedure Wrong2 is
A, B : Integer;
begin
...
...\vert
>>> ''end if;'' expected for ''if'' at line 4
\end{lstlisting}\end{figure}

Note that a more conventional approach to error recovery would have produced two errors: it would have identified the end with the if-statement, complained about a missing ``if'', and then complained about the missing end for the procedure itself.


3.2.3 Example 2: Handling Semicolon Used in Place of 'is'

The two contexts in which a semicolon may have been erroneously used in place of 'is' are at the outer level, and within a declarative region. The first case corresponds to the following example:

\begin{figure}\begin{lstlisting}{}
-- Case 1: At the outer level
procedure X (Y : Integer);
Q : Integer;
begin
...
end;
\end{lstlisting}\end{figure}

In this case the GNAT parser knows that something is wrong as soon as it encounters Q (or 'begin', if there are no declarations), and it can immediately diagnose that the semicolon should have been 'is'. The situation in a declarative region is more complex, and corresponds to the following example:

\begin{figure}\begin{lstlisting}{}
-- Case 2: Within a declarative region
declar...
...: Integer;
begin -- <2>
...
end;
begin
...
end;
\end{lstlisting}\end{figure}

In this case, the syntax error (line <1>) has the syntax of a subprogram declaration [AAR95, Section 6-1]. Therefore, the declaration of Q is read as belonging to the outer region. The parser does not know that it was an error until it encounters the 'begin' (line <2>). It is still not clear at this point from a syntactic point of view that something is wrong, because the 'begin' could belong to the enclosing syntax scope. However, the GNAT parser incorporates a bit of semantic knowledge and note that the body of X is missing, so it diagnoses the error as semicolon in place of 'is' on the subprogram line. To control this analysis, some global variables with prefix ``SIS_'' are used to indicate that we have a subprogram declaration whose body is required and has not yet been found. For example, the variable SIS_Entry_Active indicates that a subprogram declaration has been encountered, but no body for this subprogram has been encountered yet. The prefix stands for ``Subprogram IS'' handling. Five things can happen to an active SIS entry:

  1. If a 'begin' is encountered with an SIS entry active, then we have exactly the situation in which we know the body of the subprogram is missing. After posting an error message, the parser rebuilds the AST: it changes the specification to a body, re-chaining the declarations found between the specification and the word 'begin'.

  2. Another subprogram declaration or body is encountered. In this case the entry gets overwritten with the information for the new subprogram declaration. Therefore, the GNAT parser does not catch some nested cases, but it does not seem worth the effort.

  3. A nested declarative region (e.g. package declaration or package body) is encountered. The SIS active indication is reset at the start of such a nested region. Again, similar to the previous case, the parser misses some nested cases, but it does not seem worth the effort to stack and unstack the SIS information.

  4. A valid pragma 'interface' or 'import' supplies the missing body. In this case the parser resets the entry.

  5. The parser encounters the end of the declarative region without encountering a 'begin' first. In this situation the parser simply resets the entry: there is a missing body, but it seems more reasonable to let the later semantic checking discover this.


3.2.4 Example 3: Handling 'is' Used in Place of Semicolon

This is a somewhat trickier situation, and although the GNAT parser can not catch it in all cases, it does its best to detect common situations resulting from a ``cut and paste'' operation which forgets to change the 'is' to semicolon. Let us consider the following example:

\begin{figure}\begin{lstlisting}{}
package body X is
procedure A;
procedure B ...
...edure D is
begin
...
end;
begin
...
end; -- <2>
\end{lstlisting}\end{figure}

The trouble is that the section of text from line <1> to line <2> syntactically constitutes a valid procedure body, and the danger is that the parser finds out far too late that something is wrong (indeed most compilers will behave uncomfortably on the above example).

To control this situation the GNAT parser avoids swallowing the last 'end' if it can be sure that some error will result from doing so. In particular, the parser will not accept the 'end' it if it is immediately followed by end of file, 'with' or 'separate' (all tokens that signal the start of a compilation unit, and which therefore allow us to reserve the 'end' for the outer level). For more details on this aspect of the handling, see package Endh. Similarly, if the enclosing package has no 'begin', then the result is a missing 'begin' message, which refers back to the subprogram header. Such an error message is not too bad (it's already a big improvement over what many parsers do), but it's not ideal, because the declarations following the 'is' have been associated with the wrong scope.

To catch at least some of these cases, the GNAT parser carries out the following additional steps. First, a subprogram body is marked as having a suspicious 'is' if the declaration line is followed by a line which starts with a symbol that can start a declaration in the same column, or to the left of the column in which the 'function' or 'procedure' starts (normal style is to indent any declarations which really belong a subprogram). If such a subprogram encounters a missing 'begin' or missing 'end', then the parser decides that the 'is' should have been a semicolon, and the subprogram body node is marked (by setting the Bad_Is_Detected flag to true). This is not done for library level procedures since they must have a body.

The processing for a declarative part checks to see if the last declaration scanned is marked in this way, and if it is, the tree is modified to reflect the 'is' being interpreted as a semicolon.

3.3 Summary

GNAT implements an heuristic error recovery mechanism on it which simplifies the implementation of the parser: when an error is detected, the scanner generates the corresponding error message, estimates a substitute token and returns it to the parser. In most cases this simple, but powerful, mechanism helps the parser to continue as if the source program had no lexical errors.

In case of simple errors the parser masks the failure and generates the right nodes as if the source program were correct. In case of complex errors, the parser implements a resynchronization mechanism based on exception handlers.


next up previous contents index
Next: II. Second Part: Semantic Up: I. First Part: Introduction Previous: 2. Overview of the   Contents   Index
(C) Javier Miranda and Edmond Schonberg, 2004