RE: How will software source code represent 21 bit unicode charac ters?

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Tue Apr 17 2001 - 13:40:55 EDT


Yves,

<or if there's some implicit assumption that '\U0010000' is
<of type wchar_t,

I don't see why it would have anything to do with wchar_t. The
representation is not important. The representation can be UFT32 or
UTF16.

Is it '\UXXXXXXXX' so that is must be '\U00XXXXXX' to be valid? Would it
not be better to use '\UXXXXXX'? Your example has 7 hex digits. Normally
hex digits appear in pairs.

I assume that non-zero plane characters can be represented as \uxxxx\uxxxx
with the high/low surrogate encoding. Is \UOOOOxxxx invalid if xxxx is a
high or low surrogate point? I would assume so.

Carl

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Yves Arrouye
Sent: Tuesday, April 17, 2001 8:30 AM
To: 'Marco Cimarosti'; unicode@unicode.org; 'William Overington'
Cc: archive@ngo.globalnet.co.uk
Subject: RE: How will software source code represent 21 bit unicode
charac ters?

> > Has this matter already been addressed anywhere?
>
> I think the C standard is in the process of making a decision
> about this. If
> memory helps, we will have escapes like '\uXXXX' and '\UXXXXXXXX'.

I think they made the decision already. It is in the latest editions of the
standards. The only ambiguity (for me) is whether one can write:

        uint32_t codepoint = '\U0010000';

and have it work, or if there's some implicit assumption that '\U0010000' is
of type wchar_t, in which case the construction is not portable because of
the fact that the size of wchar_t is implementation-specific, and can be as
small as 8 bits. I am sure we have a C/C++ expert (or many!) here that can
clear that up though.

YA



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT