You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
310 lines
18 KiB
HTML
310 lines
18 KiB
HTML
<html lang="en">
|
|
<head>
|
|
<title>Lexer - The GNU C Preprocessor Internals</title>
|
|
<meta http-equiv="Content-Type" content="text/html">
|
|
<meta name="description" content="The GNU C Preprocessor Internals">
|
|
<meta name="generator" content="makeinfo 4.13">
|
|
<link title="Top" rel="start" href="index.html#Top">
|
|
<link rel="prev" href="Conventions.html#Conventions" title="Conventions">
|
|
<link rel="next" href="Hash-Nodes.html#Hash-Nodes" title="Hash Nodes">
|
|
<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
|
|
<meta http-equiv="Content-Style-Type" content="text/css">
|
|
<style type="text/css"><!--
|
|
pre.display { font-family:inherit }
|
|
pre.format { font-family:inherit }
|
|
pre.smalldisplay { font-family:inherit; font-size:smaller }
|
|
pre.smallformat { font-family:inherit; font-size:smaller }
|
|
pre.smallexample { font-size:smaller }
|
|
pre.smalllisp { font-size:smaller }
|
|
span.sc { font-variant:small-caps }
|
|
span.roman { font-family:serif; font-weight:normal; }
|
|
span.sansserif { font-family:sans-serif; font-weight:normal; }
|
|
--></style>
|
|
</head>
|
|
<body>
|
|
<div class="node">
|
|
<a name="Lexer"></a>
|
|
<p>
|
|
Next: <a rel="next" accesskey="n" href="Hash-Nodes.html#Hash-Nodes">Hash Nodes</a>,
|
|
Previous: <a rel="previous" accesskey="p" href="Conventions.html#Conventions">Conventions</a>,
|
|
Up: <a rel="up" accesskey="u" href="index.html#Top">Top</a>
|
|
<hr>
|
|
</div>
|
|
|
|
<h2 class="unnumbered">The Lexer</h2>
|
|
|
|
<p><a name="index-lexer-3"></a><a name="index-newlines-4"></a><a name="index-escaped-newlines-5"></a>
|
|
|
|
<h3 class="section">Overview</h3>
|
|
|
|
<p>The lexer is contained in the file <samp><span class="file">lex.c</span></samp>. It is a hand-coded
|
|
lexer, and not implemented as a state machine. It can understand C, C++
|
|
and Objective-C source code, and has been extended to allow reasonably
|
|
successful preprocessing of assembly language. The lexer does not make
|
|
an initial pass to strip out trigraphs and escaped newlines, but handles
|
|
them as they are encountered in a single pass of the input file. It
|
|
returns preprocessing tokens individually, not a line at a time.
|
|
|
|
<p>It is mostly transparent to users of the library, since the library's
|
|
interface for obtaining the next token, <code>cpp_get_token</code>, takes care
|
|
of lexing new tokens, handling directives, and expanding macros as
|
|
necessary. However, the lexer does expose some functionality so that
|
|
clients of the library can easily spell a given token, such as
|
|
<code>cpp_spell_token</code> and <code>cpp_token_len</code>. These functions are
|
|
useful when generating diagnostics, and for emitting the preprocessed
|
|
output.
|
|
|
|
<h3 class="section">Lexing a token</h3>
|
|
|
|
<p>Lexing of an individual token is handled by <code>_cpp_lex_direct</code> and
|
|
its subroutines. In its current form the code is quite complicated,
|
|
with read ahead characters and such-like, since it strives to not step
|
|
back in the character stream in preparation for handling non-ASCII file
|
|
encodings. The current plan is to convert any such files to UTF-8
|
|
before processing them. This complexity is therefore unnecessary and
|
|
will be removed, so I'll not discuss it further here.
|
|
|
|
<p>The job of <code>_cpp_lex_direct</code> is simply to lex a token. It is not
|
|
responsible for issues like directive handling, returning lookahead
|
|
tokens directly, multiple-include optimization, or conditional block
|
|
skipping. It necessarily has a minor rôle to play in memory
|
|
management of lexed lines. I discuss these issues in a separate section
|
|
(see <a href="Lexing-a-line.html#Lexing-a-line">Lexing a line</a>).
|
|
|
|
<p>The lexer places the token it lexes into storage pointed to by the
|
|
variable <code>cur_token</code>, and then increments it. This variable is
|
|
important for correct diagnostic positioning. Unless a specific line
|
|
and column are passed to the diagnostic routines, they will examine the
|
|
<code>line</code> and <code>col</code> values of the token just before the location
|
|
that <code>cur_token</code> points to, and use that location to report the
|
|
diagnostic.
|
|
|
|
<p>The lexer does not consider whitespace to be a token in its own right.
|
|
If whitespace (other than a new line) precedes a token, it sets the
|
|
<code>PREV_WHITE</code> bit in the token's flags. Each token has its
|
|
<code>line</code> and <code>col</code> variables set to the line and column of the
|
|
first character of the token. This line number is the line number in
|
|
the translation unit, and can be converted to a source (file, line) pair
|
|
using the line map code.
|
|
|
|
<p>The first token on a logical, i.e. unescaped, line has the flag
|
|
<code>BOL</code> set for beginning-of-line. This flag is intended for
|
|
internal use, both to distinguish a ‘<samp><span class="samp">#</span></samp>’ that begins a directive
|
|
from one that doesn't, and to generate a call-back to clients that want
|
|
to be notified about the start of every non-directive line with tokens
|
|
on it. Clients cannot reliably determine this for themselves: the first
|
|
token might be a macro, and the tokens of a macro expansion do not have
|
|
the <code>BOL</code> flag set. The macro expansion may even be empty, and the
|
|
next token on the line certainly won't have the <code>BOL</code> flag set.
|
|
|
|
<p>New lines are treated specially; exactly how the lexer handles them is
|
|
context-dependent. The C standard mandates that directives are
|
|
terminated by the first unescaped newline character, even if it appears
|
|
in the middle of a macro expansion. Therefore, if the state variable
|
|
<code>in_directive</code> is set, the lexer returns a <code>CPP_EOF</code> token,
|
|
which is normally used to indicate end-of-file, to indicate
|
|
end-of-directive. In a directive a <code>CPP_EOF</code> token never means
|
|
end-of-file. Conveniently, if the caller was <code>collect_args</code>, it
|
|
already handles <code>CPP_EOF</code> as if it were end-of-file, and reports an
|
|
error about an unterminated macro argument list.
|
|
|
|
<p>The C standard also specifies that a new line in the middle of the
|
|
arguments to a macro is treated as whitespace. This white space is
|
|
important in case the macro argument is stringified. The state variable
|
|
<code>parsing_args</code> is nonzero when the preprocessor is collecting the
|
|
arguments to a macro call. It is set to 1 when looking for the opening
|
|
parenthesis to a function-like macro, and 2 when collecting the actual
|
|
arguments up to the closing parenthesis, since these two cases need to
|
|
be distinguished sometimes. One such time is here: the lexer sets the
|
|
<code>PREV_WHITE</code> flag of a token if it meets a new line when
|
|
<code>parsing_args</code> is set to 2. It doesn't set it if it meets a new
|
|
line when <code>parsing_args</code> is 1, since then code like
|
|
|
|
<pre class="smallexample"> #define foo() bar
|
|
foo
|
|
baz
|
|
</pre>
|
|
<p class="noindent">would be output with an erroneous space before ‘<samp><span class="samp">baz</span></samp>’:
|
|
|
|
<pre class="smallexample"> foo
|
|
baz
|
|
</pre>
|
|
<p>This is a good example of the subtlety of getting token spacing correct
|
|
in the preprocessor; there are plenty of tests in the testsuite for
|
|
corner cases like this.
|
|
|
|
<p>The lexer is written to treat each of ‘<samp><span class="samp">\r</span></samp>’, ‘<samp><span class="samp">\n</span></samp>’, ‘<samp><span class="samp">\r\n</span></samp>’
|
|
and ‘<samp><span class="samp">\n\r</span></samp>’ as a single new line indicator. This allows it to
|
|
transparently preprocess MS-DOS, Macintosh and Unix files without their
|
|
needing to pass through a special filter beforehand.
|
|
|
|
<p>We also decided to treat a backslash, either ‘<samp><span class="samp">\</span></samp>’ or the trigraph
|
|
‘<samp><span class="samp">??/</span></samp>’, separated from one of the above newline indicators by
|
|
non-comment whitespace only, as intending to escape the newline. It
|
|
tends to be a typing mistake, and cannot reasonably be mistaken for
|
|
anything else in any of the C-family grammars. Since handling it this
|
|
way is not strictly conforming to the ISO standard, the library issues a
|
|
warning wherever it encounters it.
|
|
|
|
<p>Handling newlines like this is made simpler by doing it in one place
|
|
only. The function <code>handle_newline</code> takes care of all newline
|
|
characters, and <code>skip_escaped_newlines</code> takes care of arbitrarily
|
|
long sequences of escaped newlines, deferring to <code>handle_newline</code>
|
|
to handle the newlines themselves.
|
|
|
|
<p>The most painful aspect of lexing ISO-standard C and C++ is handling
|
|
trigraphs and backlash-escaped newlines. Trigraphs are processed before
|
|
any interpretation of the meaning of a character is made, and unfortunately
|
|
there is a trigraph representation for a backslash, so it is possible for
|
|
the trigraph ‘<samp><span class="samp">??/</span></samp>’ to introduce an escaped newline.
|
|
|
|
<p>Escaped newlines are tedious because theoretically they can occur
|
|
anywhere—between the ‘<samp><span class="samp">+</span></samp>’ and ‘<samp><span class="samp">=</span></samp>’ of the ‘<samp><span class="samp">+=</span></samp>’ token,
|
|
within the characters of an identifier, and even between the ‘<samp><span class="samp">*</span></samp>’
|
|
and ‘<samp><span class="samp">/</span></samp>’ that terminates a comment. Moreover, you cannot be sure
|
|
there is just one—there might be an arbitrarily long sequence of them.
|
|
|
|
<p>So, for example, the routine that lexes a number, <code>parse_number</code>,
|
|
cannot assume that it can scan forwards until the first non-number
|
|
character and be done with it, because this could be the ‘<samp><span class="samp">\</span></samp>’
|
|
introducing an escaped newline, or the ‘<samp><span class="samp">?</span></samp>’ introducing the trigraph
|
|
sequence that represents the ‘<samp><span class="samp">\</span></samp>’ of an escaped newline. If it
|
|
encounters a ‘<samp><span class="samp">?</span></samp>’ or ‘<samp><span class="samp">\</span></samp>’, it calls <code>skip_escaped_newlines</code>
|
|
to skip over any potential escaped newlines before checking whether the
|
|
number has been finished.
|
|
|
|
<p>Similarly code in the main body of <code>_cpp_lex_direct</code> cannot simply
|
|
check for a ‘<samp><span class="samp">=</span></samp>’ after a ‘<samp><span class="samp">+</span></samp>’ character to determine whether it
|
|
has a ‘<samp><span class="samp">+=</span></samp>’ token; it needs to be prepared for an escaped newline of
|
|
some sort. Such cases use the function <code>get_effective_char</code>, which
|
|
returns the first character after any intervening escaped newlines.
|
|
|
|
<p>The lexer needs to keep track of the correct column position, including
|
|
counting tabs as specified by the <samp><span class="option">-ftabstop=</span></samp> option. This
|
|
should be done even within C-style comments; they can appear in the
|
|
middle of a line, and we want to report diagnostics in the correct
|
|
position for text appearing after the end of the comment.
|
|
|
|
<p><a name="Invalid-identifiers"></a>Some identifiers, such as <code>__VA_ARGS__</code> and poisoned identifiers,
|
|
may be invalid and require a diagnostic. However, if they appear in a
|
|
macro expansion we don't want to complain with each use of the macro.
|
|
It is therefore best to catch them during the lexing stage, in
|
|
<code>parse_identifier</code>. In both cases, whether a diagnostic is needed
|
|
or not is dependent upon the lexer's state. For example, we don't want
|
|
to issue a diagnostic for re-poisoning a poisoned identifier, or for
|
|
using <code>__VA_ARGS__</code> in the expansion of a variable-argument macro.
|
|
Therefore <code>parse_identifier</code> makes use of state flags to determine
|
|
whether a diagnostic is appropriate. Since we change state on a
|
|
per-token basis, and don't lex whole lines at a time, this is not a
|
|
problem.
|
|
|
|
<p>Another place where state flags are used to change behavior is whilst
|
|
lexing header names. Normally, a ‘<samp><span class="samp"><</span></samp>’ would be lexed as a single
|
|
token. After a <code>#include</code> directive, though, it should be lexed as
|
|
a single token as far as the nearest ‘<samp><span class="samp">></span></samp>’ character. Note that we
|
|
don't allow the terminators of header names to be escaped; the first
|
|
‘<samp><span class="samp">"</span></samp>’ or ‘<samp><span class="samp">></span></samp>’ terminates the header name.
|
|
|
|
<p>Interpretation of some character sequences depends upon whether we are
|
|
lexing C, C++ or Objective-C, and on the revision of the standard in
|
|
force. For example, ‘<samp><span class="samp">::</span></samp>’ is a single token in C++, but in C it is
|
|
two separate ‘<samp><span class="samp">:</span></samp>’ tokens and almost certainly a syntax error. Such
|
|
cases are handled by <code>_cpp_lex_direct</code> based upon command-line
|
|
flags stored in the <code>cpp_options</code> structure.
|
|
|
|
<p>Once a token has been lexed, it leads an independent existence. The
|
|
spelling of numbers, identifiers and strings is copied to permanent
|
|
storage from the original input buffer, so a token remains valid and
|
|
correct even if its source buffer is freed with <code>_cpp_pop_buffer</code>.
|
|
The storage holding the spellings of such tokens remains until the
|
|
client program calls cpp_destroy, probably at the end of the translation
|
|
unit.
|
|
|
|
<p><a name="Lexing-a-line"></a>
|
|
|
|
<h3 class="section">Lexing a line</h3>
|
|
|
|
<p><a name="index-token-run-6"></a>
|
|
When the preprocessor was changed to return pointers to tokens, one
|
|
feature I wanted was some sort of guarantee regarding how long a
|
|
returned pointer remains valid. This is important to the stand-alone
|
|
preprocessor, the future direction of the C family front ends, and even
|
|
to cpplib itself internally.
|
|
|
|
<p>Occasionally the preprocessor wants to be able to peek ahead in the
|
|
token stream. For example, after the name of a function-like macro, it
|
|
wants to check the next token to see if it is an opening parenthesis.
|
|
Another example is that, after reading the first few tokens of a
|
|
<code>#pragma</code> directive and not recognizing it as a registered pragma,
|
|
it wants to backtrack and allow the user-defined handler for unknown
|
|
pragmas to access the full <code>#pragma</code> token stream. The stand-alone
|
|
preprocessor wants to be able to test the current token with the
|
|
previous one to see if a space needs to be inserted to preserve their
|
|
separate tokenization upon re-lexing (paste avoidance), so it needs to
|
|
be sure the pointer to the previous token is still valid. The
|
|
recursive-descent C++ parser wants to be able to perform tentative
|
|
parsing arbitrarily far ahead in the token stream, and then to be able
|
|
to jump back to a prior position in that stream if necessary.
|
|
|
|
<p>The rule I chose, which is fairly natural, is to arrange that the
|
|
preprocessor lex all tokens on a line consecutively into a token buffer,
|
|
which I call a <dfn>token run</dfn>, and when meeting an unescaped new line
|
|
(newlines within comments do not count either), to start lexing back at
|
|
the beginning of the run. Note that we do <em>not</em> lex a line of
|
|
tokens at once; if we did that <code>parse_identifier</code> would not have
|
|
state flags available to warn about invalid identifiers (see <a href="Invalid-identifiers.html#Invalid-identifiers">Invalid identifiers</a>).
|
|
|
|
<p>In other words, accessing tokens that appeared earlier in the current
|
|
line is valid, but since each logical line overwrites the tokens of the
|
|
previous line, tokens from prior lines are unavailable. In particular,
|
|
since a directive only occupies a single logical line, this means that
|
|
the directive handlers like the <code>#pragma</code> handler can jump around
|
|
in the directive's tokens if necessary.
|
|
|
|
<p>Two issues remain: what about tokens that arise from macro expansions,
|
|
and what happens when we have a long line that overflows the token run?
|
|
|
|
<p>Since we promise clients that we preserve the validity of pointers that
|
|
we have already returned for tokens that appeared earlier in the line,
|
|
we cannot reallocate the run. Instead, on overflow it is expanded by
|
|
chaining a new token run on to the end of the existing one.
|
|
|
|
<p>The tokens forming a macro's replacement list are collected by the
|
|
<code>#define</code> handler, and placed in storage that is only freed by
|
|
<code>cpp_destroy</code>. So if a macro is expanded in the line of tokens,
|
|
the pointers to the tokens of its expansion that are returned will always
|
|
remain valid. However, macros are a little trickier than that, since
|
|
they give rise to three sources of fresh tokens. They are the built-in
|
|
macros like <code>__LINE__</code>, and the ‘<samp><span class="samp">#</span></samp>’ and ‘<samp><span class="samp">##</span></samp>’ operators
|
|
for stringification and token pasting. I handled this by allocating
|
|
space for these tokens from the lexer's token run chain. This means
|
|
they automatically receive the same lifetime guarantees as lexed tokens,
|
|
and we don't need to concern ourselves with freeing them.
|
|
|
|
<p>Lexing into a line of tokens solves some of the token memory management
|
|
issues, but not all. The opening parenthesis after a function-like
|
|
macro name might lie on a different line, and the front ends definitely
|
|
want the ability to look ahead past the end of the current line. So
|
|
cpplib only moves back to the start of the token run at the end of a
|
|
line if the variable <code>keep_tokens</code> is zero. Line-buffering is
|
|
quite natural for the preprocessor, and as a result the only time cpplib
|
|
needs to increment this variable is whilst looking for the opening
|
|
parenthesis to, and reading the arguments of, a function-like macro. In
|
|
the near future cpplib will export an interface to increment and
|
|
decrement this variable, so that clients can share full control over the
|
|
lifetime of token pointers too.
|
|
|
|
<p>The routine <code>_cpp_lex_token</code> handles moving to new token runs,
|
|
calling <code>_cpp_lex_direct</code> to lex new tokens, or returning
|
|
previously-lexed tokens if we stepped back in the token stream. It also
|
|
checks each token for the <code>BOL</code> flag, which might indicate a
|
|
directive that needs to be handled, or require a start-of-line call-back
|
|
to be made. <code>_cpp_lex_token</code> also handles skipping over tokens in
|
|
failed conditional blocks, and invalidates the control macro of the
|
|
multiple-include optimization if a token was successfully lexed outside
|
|
a directive. In other words, its callers do not need to concern
|
|
themselves with such issues.
|
|
|
|
</body></html>
|
|
|