Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Antoine Leca (
Date: Thu May 25 2000 - 07:56:57 EDT

Markus Scherer wrote:
> Markus Scherer wrote:
> > Given the discussion about \xhh...h without {} curly braces, I believe I will also
> > suggest to keep our current interpretation of \xhh (fixed-length, 2 hex digits)
> > that we have at least in one place.
> Oops - mistake. I just looked again at the source code, and we actually have the
> \x this way:
> \xhh (variable-length, 1..2 hex digits)
> which is pretty consistent with C

OK so far.

> (except that C may accept more than 2 hex digits in wide strings).

This is a common misconception. i.e. One may believe that, provided
that the C wchar_t type use Unicode or UCS-4 (UTF-32) as internal
encoding, you can write

  wchar_t agrave = L'\x00E0';


  wchar_t Amacron = L'\x0100';

Unfortunately, it won't work this way (more exactely, it should not!
If your compiler accept this, it is brocken!)

Here, \x00E0 or \x0100 are hexadecimal escape sequence for the source
set (refer to Валерий's post), which is normally some variation of
US-ASCII, and furthermore where usually 256 is not a valid code. So you
got an error...

The correct way is to write

  wchar_t agrave = 0x00E0;
  wchar_t Amacron = 0x0100;

if you are sure that wchar_t use Unicode for encoding. Or, of course,
use the \u escape sequence, which have been designed for just that
purpose, *even if wchar_t use another encoding*.

That's said, theorically, C can accept more than 2 digits in \x sequence,
provided the C compiler use a wider-than-8bit source character set,
i.e. for practical purposes the source character repertoire is Unicode.
Such compilers might exist, but I believe they are quite rare.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT