<html lang="en"> <head> <title>Character sets - The C Preprocessor</title> <meta http-equiv="Content-Type" content="text/html"> <meta name="description" content="The C Preprocessor"> <meta name="generator" content="makeinfo 4.13"> <link title="Top" rel="start" href="index.html#Top"> <link rel="up" href="Overview.html#Overview" title="Overview"> <link rel="next" href="Initial-processing.html#Initial-processing" title="Initial processing"> <link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage"> <!-- Copyright (C) 1987-2015 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation. A copy of the license is included in the section entitled ``GNU Free Documentation License''. This manual contains no Invariant Sections. The Front-Cover Texts are (a) (see below), and the Back-Cover Texts are (b) (see below). (a) The FSF's Front-Cover Text is: A GNU Manual (b) The FSF's Back-Cover Text is: You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development. --> <meta http-equiv="Content-Style-Type" content="text/css"> <style type="text/css"><!-- pre.display { font-family:inherit } pre.format { font-family:inherit } pre.smalldisplay { font-family:inherit; font-size:smaller } pre.smallformat { font-family:inherit; font-size:smaller } pre.smallexample { font-size:smaller } pre.smalllisp { font-size:smaller } span.sc { font-variant:small-caps } span.roman { font-family:serif; font-weight:normal; } span.sansserif { font-family:sans-serif; font-weight:normal; } --></style> </head> <body> <div class="node"> <a name="Character-sets"></a> <p> Next: <a rel="next" accesskey="n" href="Initial-processing.html#Initial-processing">Initial processing</a>, Up: <a rel="up" accesskey="u" href="Overview.html#Overview">Overview</a> <hr> </div> <h3 class="section">1.1 Character sets</h3> <p>Source code character set processing in C and related languages is rather complicated. The C standard discusses two character sets, but there are really at least four. <p>The files input to CPP might be in any character set at all. CPP's very first action, before it even looks for line boundaries, is to convert the file into the character set it uses for internal processing. That set is what the C standard calls the <dfn>source</dfn> character set. It must be isomorphic with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of Unicode. <p>The character sets of the input files are specified using the <samp><span class="option">-finput-charset=</span></samp> option. <p>All preprocessing work (the subject of the rest of this manual) is carried out in the source character set. If you request textual output from the preprocessor with the <samp><span class="option">-E</span></samp> option, it will be in UTF-8. <p>After preprocessing is complete, string and character constants are converted again, into the <dfn>execution</dfn> character set. This character set is under control of the user; the default is UTF-8, matching the source character set. Wide string and character constants have their own character set, which is not called out specifically in the standard. Again, it is under control of the user. The default is UTF-16 or UTF-32, whichever fits in the target's <code>wchar_t</code> type, in the target machine's byte order.<a rel="footnote" href="#fn-1" name="fnd-1"><sup>1</sup></a> Octal and hexadecimal escape sequences do not undergo conversion; <tt>'\x12'</tt> has the value 0x12 regardless of the currently selected execution character set. All other escapes are replaced by the character in the source character set that they represent, then converted to the execution character set, just like unescaped characters. <p>In identifiers, characters outside the ASCII range can only be specified with the ‘<samp><span class="samp">\u</span></samp>’ and ‘<samp><span class="samp">\U</span></samp>’ escapes, not used directly. If strict ISO C90 conformance is specified with an option such as <samp><span class="option">-std=c90</span></samp>, or <samp><span class="option">-fno-extended-identifiers</span></samp> is used, then those escapes are not permitted in identifiers. <div class="footnote"> <hr> <h4>Footnotes</h4><p class="footnote"><small>[<a name="fn-1" href="#fnd-1">1</a>]</small> UTF-16 does not meet the requirements of the C standard for a wide character set, but the choice of 16-bit <code>wchar_t</code> is enshrined in some system ABIs so we cannot fix this.</p> <hr></div> </body></html>