Re: Question about \uxxxx etc. for 21-bit code points - need advice

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Thu May 25 2000 - 07:56:57 EDT

Next message: Antoine Leca: "Re: Question about \uxxxx etc. for 21-bit code points - need advice"
Previous message: Michael Everson: "Re: Tamil number system"
Maybe in reply to: Markus Scherer: "Question about \uxxxx etc. for 21-bit code points - need advice"
Next in thread: Antoine Leca: "Re: Question about \uxxxx etc. for 21-bit code points - need advice"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Scherer wrote:
>
> Markus Scherer wrote:
> > Given the discussion about \xhh...h without {} curly braces, I believe I will also
> > suggest to keep our current interpretation of \xhh (fixed-length, 2 hex digits)
> > that we have at least in one place.
>
> Oops - mistake. I just looked again at the source code, and we actually have the
> \x this way:
> \xhh (variable-length, 1..2 hex digits)
>
> which is pretty consistent with C

OK so far.

> (except that C may accept more than 2 hex digits in wide strings).

This is a common misconception. i.e. One may believe that, provided
that the C wchar_t type use Unicode or UCS-4 (UTF-32) as internal
encoding, you can write

wchar_t agrave = L'\x00E0';

and

wchar_t Amacron = L'\x0100';

Unfortunately, it won't work this way (more exactely, it should not!
If your compiler accept this, it is brocken!)

Here, \x00E0 or \x0100 are hexadecimal escape sequence for the source
set (refer to Валерий's post), which is normally some variation of
US-ASCII, and furthermore where usually 256 is not a valid code. So you
got an error...

The correct way is to write

wchar_t agrave = 0x00E0;
wchar_t Amacron = 0x0100;

if you are sure that wchar_t use Unicode for encoding. Or, of course,
use the \u escape sequence, which have been designed for just that
purpose, *even if wchar_t use another encoding*.

That's said, theorically, C can accept more than 2 digits in \x sequence,
provided the C compiler use a wider-than-8bit source character set,
i.e. for practical purposes the source character repertoire is Unicode.
Such compilers might exist, but I believe they are quite rare.

Antoine

Next message: Antoine Leca: "Re: Question about \uxxxx etc. for 21-bit code points - need advice"
Previous message: Michael Everson: "Re: Tamil number system"
Maybe in reply to: Markus Scherer: "Question about \uxxxx etc. for 21-bit code points - need advice"
Next in thread: Antoine Leca: "Re: Question about \uxxxx etc. for 21-bit code points - need advice"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT