You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
108 lines
4.8 KiB
HTML
108 lines
4.8 KiB
HTML
<html lang="en">
|
|
<head>
|
|
<title>Character sets - The C Preprocessor</title>
|
|
<meta http-equiv="Content-Type" content="text/html">
|
|
<meta name="description" content="The C Preprocessor">
|
|
<meta name="generator" content="makeinfo 4.13">
|
|
<link title="Top" rel="start" href="index.html#Top">
|
|
<link rel="up" href="Overview.html#Overview" title="Overview">
|
|
<link rel="next" href="Initial-processing.html#Initial-processing" title="Initial processing">
|
|
<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
|
|
<!--
|
|
Copyright (C) 1987-2015 Free Software Foundation, Inc.
|
|
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License, Version 1.3 or
|
|
any later version published by the Free Software Foundation. A copy of
|
|
the license is included in the
|
|
section entitled ``GNU Free Documentation License''.
|
|
|
|
This manual contains no Invariant Sections. The Front-Cover Texts are
|
|
(a) (see below), and the Back-Cover Texts are (b) (see below).
|
|
|
|
(a) The FSF's Front-Cover Text is:
|
|
|
|
A GNU Manual
|
|
|
|
(b) The FSF's Back-Cover Text is:
|
|
|
|
You have freedom to copy and modify this GNU Manual, like GNU
|
|
software. Copies published by the Free Software Foundation raise
|
|
funds for GNU development.
|
|
-->
|
|
<meta http-equiv="Content-Style-Type" content="text/css">
|
|
<style type="text/css"><!--
|
|
pre.display { font-family:inherit }
|
|
pre.format { font-family:inherit }
|
|
pre.smalldisplay { font-family:inherit; font-size:smaller }
|
|
pre.smallformat { font-family:inherit; font-size:smaller }
|
|
pre.smallexample { font-size:smaller }
|
|
pre.smalllisp { font-size:smaller }
|
|
span.sc { font-variant:small-caps }
|
|
span.roman { font-family:serif; font-weight:normal; }
|
|
span.sansserif { font-family:sans-serif; font-weight:normal; }
|
|
--></style>
|
|
</head>
|
|
<body>
|
|
<div class="node">
|
|
<a name="Character-sets"></a>
|
|
<p>
|
|
Next: <a rel="next" accesskey="n" href="Initial-processing.html#Initial-processing">Initial processing</a>,
|
|
Up: <a rel="up" accesskey="u" href="Overview.html#Overview">Overview</a>
|
|
<hr>
|
|
</div>
|
|
|
|
<h3 class="section">1.1 Character sets</h3>
|
|
|
|
<p>Source code character set processing in C and related languages is
|
|
rather complicated. The C standard discusses two character sets, but
|
|
there are really at least four.
|
|
|
|
<p>The files input to CPP might be in any character set at all. CPP's
|
|
very first action, before it even looks for line boundaries, is to
|
|
convert the file into the character set it uses for internal
|
|
processing. That set is what the C standard calls the <dfn>source</dfn>
|
|
character set. It must be isomorphic with ISO 10646, also known as
|
|
Unicode. CPP uses the UTF-8 encoding of Unicode.
|
|
|
|
<p>The character sets of the input files are specified using the
|
|
<samp><span class="option">-finput-charset=</span></samp> option.
|
|
|
|
<p>All preprocessing work (the subject of the rest of this manual) is
|
|
carried out in the source character set. If you request textual
|
|
output from the preprocessor with the <samp><span class="option">-E</span></samp> option, it will be
|
|
in UTF-8.
|
|
|
|
<p>After preprocessing is complete, string and character constants are
|
|
converted again, into the <dfn>execution</dfn> character set. This
|
|
character set is under control of the user; the default is UTF-8,
|
|
matching the source character set. Wide string and character
|
|
constants have their own character set, which is not called out
|
|
specifically in the standard. Again, it is under control of the user.
|
|
The default is UTF-16 or UTF-32, whichever fits in the target's
|
|
<code>wchar_t</code> type, in the target machine's byte
|
|
order.<a rel="footnote" href="#fn-1" name="fnd-1"><sup>1</sup></a> Octal and hexadecimal escape sequences do not undergo
|
|
conversion; <tt>'\x12'</tt> has the value 0x12 regardless of the currently
|
|
selected execution character set. All other escapes are replaced by
|
|
the character in the source character set that they represent, then
|
|
converted to the execution character set, just like unescaped
|
|
characters.
|
|
|
|
<p>In identifiers, characters outside the ASCII range can only be
|
|
specified with the ‘<samp><span class="samp">\u</span></samp>’ and ‘<samp><span class="samp">\U</span></samp>’ escapes, not used
|
|
directly. If strict ISO C90 conformance is specified with an option
|
|
such as <samp><span class="option">-std=c90</span></samp>, or <samp><span class="option">-fno-extended-identifiers</span></samp> is
|
|
used, then those escapes are not permitted in identifiers.
|
|
|
|
<div class="footnote">
|
|
<hr>
|
|
<h4>Footnotes</h4><p class="footnote"><small>[<a name="fn-1" href="#fnd-1">1</a>]</small> UTF-16 does not meet the requirements of the C
|
|
standard for a wide character set, but the choice of 16-bit
|
|
<code>wchar_t</code> is enshrined in some system ABIs so we cannot fix
|
|
this.</p>
|
|
|
|
<hr></div>
|
|
|
|
</body></html>
|
|
|