You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
252 lines
13 KiB
HTML
252 lines
13 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<!-- Copyright (C) 1987-2018 Free Software Foundation, Inc.
|
|
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License, Version 1.3 or
|
|
any later version published by the Free Software Foundation. A copy of
|
|
the license is included in the
|
|
section entitled "GNU Free Documentation License".
|
|
|
|
This manual contains no Invariant Sections. The Front-Cover Texts are
|
|
(a) (see below), and the Back-Cover Texts are (b) (see below).
|
|
|
|
(a) The FSF's Front-Cover Text is:
|
|
|
|
A GNU Manual
|
|
|
|
(b) The FSF's Back-Cover Text is:
|
|
|
|
You have freedom to copy and modify this GNU Manual, like GNU
|
|
software. Copies published by the Free Software Foundation raise
|
|
funds for GNU development. -->
|
|
<!-- Created by GNU Texinfo 6.4, http://www.gnu.org/software/texinfo/ -->
|
|
<head>
|
|
<title>Tokenization (The C Preprocessor)</title>
|
|
|
|
<meta name="description" content="Tokenization (The C Preprocessor)">
|
|
<meta name="keywords" content="Tokenization (The C Preprocessor)">
|
|
<meta name="resource-type" content="document">
|
|
<meta name="distribution" content="global">
|
|
<meta name="Generator" content="makeinfo">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
<link href="index.html#Top" rel="start" title="Top">
|
|
<link href="Index-of-Directives.html#Index-of-Directives" rel="index" title="Index of Directives">
|
|
<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
|
|
<link href="Overview.html#Overview" rel="up" title="Overview">
|
|
<link href="The-preprocessing-language.html#The-preprocessing-language" rel="next" title="The preprocessing language">
|
|
<link href="Initial-processing.html#Initial-processing" rel="prev" title="Initial processing">
|
|
<style type="text/css">
|
|
<!--
|
|
a.summary-letter {text-decoration: none}
|
|
blockquote.indentedblock {margin-right: 0em}
|
|
blockquote.smallindentedblock {margin-right: 0em; font-size: smaller}
|
|
blockquote.smallquotation {font-size: smaller}
|
|
div.display {margin-left: 3.2em}
|
|
div.example {margin-left: 3.2em}
|
|
div.lisp {margin-left: 3.2em}
|
|
div.smalldisplay {margin-left: 3.2em}
|
|
div.smallexample {margin-left: 3.2em}
|
|
div.smalllisp {margin-left: 3.2em}
|
|
kbd {font-style: oblique}
|
|
pre.display {font-family: inherit}
|
|
pre.format {font-family: inherit}
|
|
pre.menu-comment {font-family: serif}
|
|
pre.menu-preformatted {font-family: serif}
|
|
pre.smalldisplay {font-family: inherit; font-size: smaller}
|
|
pre.smallexample {font-size: smaller}
|
|
pre.smallformat {font-family: inherit; font-size: smaller}
|
|
pre.smalllisp {font-size: smaller}
|
|
span.nolinebreak {white-space: nowrap}
|
|
span.roman {font-family: initial; font-weight: normal}
|
|
span.sansserif {font-family: sans-serif; font-weight: normal}
|
|
ul.no-bullet {list-style: none}
|
|
-->
|
|
</style>
|
|
|
|
|
|
</head>
|
|
|
|
<body lang="en">
|
|
<a name="Tokenization"></a>
|
|
<div class="header">
|
|
<p>
|
|
Next: <a href="The-preprocessing-language.html#The-preprocessing-language" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html#Initial-processing" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html#Overview" accesskey="u" rel="up">Overview</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html#Index-of-Directives" title="Index" rel="index">Index</a>]</p>
|
|
</div>
|
|
<hr>
|
|
<a name="Tokenization-1"></a>
|
|
<h3 class="section">1.3 Tokenization</h3>
|
|
|
|
<a name="index-tokens"></a>
|
|
<a name="index-preprocessing-tokens"></a>
|
|
<p>After the textual transformations are finished, the input file is
|
|
converted into a sequence of <em>preprocessing tokens</em>. These mostly
|
|
correspond to the syntactic tokens used by the C compiler, but there are
|
|
a few differences. White space separates tokens; it is not itself a
|
|
token of any kind. Tokens do not have to be separated by white space,
|
|
but it is often necessary to avoid ambiguities.
|
|
</p>
|
|
<p>When faced with a sequence of characters that has more than one possible
|
|
tokenization, the preprocessor is greedy. It always makes each token,
|
|
starting from the left, as big as possible before moving on to the next
|
|
token. For instance, <code>a+++++b</code> is interpreted as
|
|
<code>a ++ ++ + b<!-- /@w --></code>, not as <code>a ++ + ++ b<!-- /@w --></code>, even though the
|
|
latter tokenization could be part of a valid C program and the former
|
|
could not.
|
|
</p>
|
|
<p>Once the input file is broken into tokens, the token boundaries never
|
|
change, except when the ‘<samp>##</samp>’ preprocessing operator is used to paste
|
|
tokens together. See <a href="Concatenation.html#Concatenation">Concatenation</a>. For example,
|
|
</p>
|
|
<div class="smallexample">
|
|
<pre class="smallexample">#define foo() bar
|
|
foo()baz
|
|
→ bar baz
|
|
<em>not</em>
|
|
→ barbaz
|
|
</pre></div>
|
|
|
|
<p>The compiler does not re-tokenize the preprocessor’s output. Each
|
|
preprocessing token becomes one compiler token.
|
|
</p>
|
|
<a name="index-identifiers"></a>
|
|
<p>Preprocessing tokens fall into five broad classes: identifiers,
|
|
preprocessing numbers, string literals, punctuators, and other. An
|
|
<em>identifier</em> is the same as an identifier in C: any sequence of
|
|
letters, digits, or underscores, which begins with a letter or
|
|
underscore. Keywords of C have no significance to the preprocessor;
|
|
they are ordinary identifiers. You can define a macro whose name is a
|
|
keyword, for instance. The only identifier which can be considered a
|
|
preprocessing keyword is <code>defined</code>. See <a href="Defined.html#Defined">Defined</a>.
|
|
</p>
|
|
<p>This is mostly true of other languages which use the C preprocessor.
|
|
However, a few of the keywords of C++ are significant even in the
|
|
preprocessor. See <a href="C_002b_002b-Named-Operators.html#C_002b_002b-Named-Operators">C++ Named Operators</a>.
|
|
</p>
|
|
<p>In the 1999 C standard, identifiers may contain letters which are not
|
|
part of the “basic source character set”, at the implementation’s
|
|
discretion (such as accented Latin letters, Greek letters, or Chinese
|
|
ideograms). This may be done with an extended character set, or the
|
|
‘<samp>\u</samp>’ and ‘<samp>\U</samp>’ escape sequences. GCC only accepts such
|
|
characters in the ‘<samp>\u</samp>’ and ‘<samp>\U</samp>’ forms.
|
|
</p>
|
|
<p>As an extension, GCC treats ‘<samp>$</samp>’ as a letter. This is for
|
|
compatibility with some systems, such as VMS, where ‘<samp>$</samp>’ is commonly
|
|
used in system-defined function and object names. ‘<samp>$</samp>’ is not a
|
|
letter in strictly conforming mode, or if you specify the <samp>-$</samp>
|
|
option. See <a href="Invocation.html#Invocation">Invocation</a>.
|
|
</p>
|
|
<a name="index-numbers"></a>
|
|
<a name="index-preprocessing-numbers"></a>
|
|
<p>A <em>preprocessing number</em> has a rather bizarre definition. The
|
|
category includes all the normal integer and floating point constants
|
|
one expects of C, but also a number of other things one might not
|
|
initially recognize as a number. Formally, preprocessing numbers begin
|
|
with an optional period, a required decimal digit, and then continue
|
|
with any sequence of letters, digits, underscores, periods, and
|
|
exponents. Exponents are the two-character sequences ‘<samp>e+</samp>’,
|
|
‘<samp>e-</samp>’, ‘<samp>E+</samp>’, ‘<samp>E-</samp>’, ‘<samp>p+</samp>’, ‘<samp>p-</samp>’, ‘<samp>P+</samp>’, and
|
|
‘<samp>P-</samp>’. (The exponents that begin with ‘<samp>p</samp>’ or ‘<samp>P</samp>’ are
|
|
used for hexadecimal floating-point constants.)
|
|
</p>
|
|
<p>The purpose of this unusual definition is to isolate the preprocessor
|
|
from the full complexity of numeric constants. It does not have to
|
|
distinguish between lexically valid and invalid floating-point numbers,
|
|
which is complicated. The definition also permits you to split an
|
|
identifier at any position and get exactly two tokens, which can then be
|
|
pasted back together with the ‘<samp>##</samp>’ operator.
|
|
</p>
|
|
<p>It’s possible for preprocessing numbers to cause programs to be
|
|
misinterpreted. For example, <code>0xE+12</code> is a preprocessing number
|
|
which does not translate to any valid numeric constant, therefore a
|
|
syntax error. It does not mean <code>0xE + 12<!-- /@w --></code>, which is what you
|
|
might have intended.
|
|
</p>
|
|
<a name="index-string-literals"></a>
|
|
<a name="index-string-constants"></a>
|
|
<a name="index-character-constants"></a>
|
|
<a name="index-header-file-names"></a>
|
|
<p><em>String literals</em> are string constants, character constants, and
|
|
header file names (the argument of ‘<samp>#include</samp>’).<a name="DOCF2" href="#FOOT2"><sup>2</sup></a> String constants and character
|
|
constants are straightforward: <tt>"…"</tt> or <tt>'…'</tt>. In
|
|
either case embedded quotes should be escaped with a backslash:
|
|
<tt>'\''</tt> is the character constant for ‘<samp>'</samp>’. There is no limit on
|
|
the length of a character constant, but the value of a character
|
|
constant that contains more than one character is
|
|
implementation-defined. See <a href="Implementation-Details.html#Implementation-Details">Implementation Details</a>.
|
|
</p>
|
|
<p>Header file names either look like string constants, <tt>"…"</tt>, or are
|
|
written with angle brackets instead, <tt><…></tt>. In either case,
|
|
backslash is an ordinary character. There is no way to escape the
|
|
closing quote or angle bracket. The preprocessor looks for the header
|
|
file in different places depending on which form you use. See <a href="Include-Operation.html#Include-Operation">Include Operation</a>.
|
|
</p>
|
|
<p>No string literal may extend past the end of a line. You may use continued
|
|
lines instead, or string constant concatenation.
|
|
</p>
|
|
<a name="index-punctuators"></a>
|
|
<a name="index-digraphs"></a>
|
|
<a name="index-alternative-tokens"></a>
|
|
<p><em>Punctuators</em> are all the usual bits of punctuation which are
|
|
meaningful to C and C++. All but three of the punctuation characters in
|
|
ASCII are C punctuators. The exceptions are ‘<samp>@</samp>’, ‘<samp>$</samp>’, and
|
|
‘<samp>`</samp>’. In addition, all the two- and three-character operators are
|
|
punctuators. There are also six <em>digraphs</em>, which the C++ standard
|
|
calls <em>alternative tokens</em>, which are merely alternate ways to spell
|
|
other punctuators. This is a second attempt to work around missing
|
|
punctuation in obsolete systems. It has no negative side effects,
|
|
unlike trigraphs, but does not cover as much ground. The digraphs and
|
|
their corresponding normal punctuators are:
|
|
</p>
|
|
<div class="smallexample">
|
|
<pre class="smallexample">Digraph: <% %> <: :> %: %:%:
|
|
Punctuator: { } [ ] # ##
|
|
</pre></div>
|
|
|
|
<a name="index-other-tokens"></a>
|
|
<p>Any other single character is considered “other”. It is passed on to
|
|
the preprocessor’s output unmolested. The C compiler will almost
|
|
certainly reject source code containing “other” tokens. In ASCII, the
|
|
only other characters are ‘<samp>@</samp>’, ‘<samp>$</samp>’, ‘<samp>`</samp>’, and control
|
|
characters other than NUL (all bits zero). (Note that ‘<samp>$</samp>’ is
|
|
normally considered a letter.) All characters with the high bit set
|
|
(numeric range 0x7F–0xFF) are also “other” in the present
|
|
implementation. This will change when proper support for international
|
|
character sets is added to GCC.
|
|
</p>
|
|
<p>NUL is a special case because of the high probability that its
|
|
appearance is accidental, and because it may be invisible to the user
|
|
(many terminals do not display NUL at all). Within comments, NULs are
|
|
silently ignored, just as any other character would be. In running
|
|
text, NUL is considered white space. For example, these two directives
|
|
have the same meaning.
|
|
</p>
|
|
<div class="smallexample">
|
|
<pre class="smallexample">#define X^@1
|
|
#define X 1
|
|
</pre></div>
|
|
|
|
<p>(where ‘<samp>^@</samp>’ is ASCII NUL). Within string or character constants,
|
|
NULs are preserved. In the latter two cases the preprocessor emits a
|
|
warning message.
|
|
</p>
|
|
<div class="footnote">
|
|
<hr>
|
|
<h4 class="footnotes-heading">Footnotes</h4>
|
|
|
|
<h3><a name="FOOT2" href="#DOCF2">(2)</a></h3>
|
|
<p>The C
|
|
standard uses the term <em>string literal</em> to refer only to what we are
|
|
calling <em>string constants</em>.</p>
|
|
</div>
|
|
<hr>
|
|
<div class="header">
|
|
<p>
|
|
Next: <a href="The-preprocessing-language.html#The-preprocessing-language" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html#Initial-processing" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html#Overview" accesskey="u" rel="up">Overview</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html#Index-of-Directives" title="Index" rel="index">Index</a>]</p>
|
|
</div>
|
|
|
|
|
|
|
|
</body>
|
|
</html>
|