Re: 32'nd bit & UTF-8

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Fri Jan 21 2005 - 07:32:38 CST

Next message: Clark Cox: "Re: 32'nd bit & UTF-8"

Previous message: Arcane Jill: "Conformance (was UTF, BOM, etc)"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

-----Original Message-----
From: Philippe Verdy [mailto:vpi92@yahoo.fr]
Sent: 21 January 2005 13:06
To: Arcane Jill
Cc: unicode@unicode.org
Subject: Re: 32'nd bit & UTF-8

>Arcane Jill <arcanejill@ramonsky.com> a écrit :
>> The existence of wchar_t does not imply UTF-32. It does imply UTF-16.

That was a typo of course. It should have read "It does NOT imply UTF-16".

> I like this definition. but what is interesting here are the phrases
> "character set" and "supported by the compilation environment".
>
> "character set": the definition implies that this is necessarily a
> *coded* character set, because it makes an equation between what it
> calls a "character" and a "integer character constant". Unfortunately,
> the definition of "character" is weak. It does not have the same
> meaning as the "abstract character" defined in Unicode/ISO/IEC, so it
> could map to Unicode's "code units". This would make UTF-16 suitable.
>
> But if needs to match with "abstract characters", then there's no
> choice for a C++ compiler: the integer datatype representing "wchar_t"
> must be able to contain at least as many distinct values as the ISO/IEC
> 10646 repertoire, and must contain the value 0.

Well, wchar_t on Windows is 16-bits wide, and hence /not/ able to contain as
many distinct values as the ISO/IEC 10646 repertoire. Gotta be code units then.

> The definition also does not say that the value 0 will necessarily be
> the same as a NULL character (U+0000). This depends on the "supported
> character set" in compile-time locales. There may as well exist a
> supported encoded charset that maps U+0000 to the integer value -2
> (because there's no requirement that integer values match ISO/IEC 10646
> codepoints). The definition relates only to the "null character" i.e.
> the one that "\0" maps to in string or character constants, but makes
> no assumption about if this null matches the ISO10646 NULL (U+0000)
> character.

It is fortunate, then, that C was never implemented on the ZX80 or ZX81, for
which '\0' would have been the SPACE character (U+0020). (See
http://web.ukonline.co.uk/sinclair.zx81/appxa.html). On the ZX80/81, every
space would have terminated a string!

Fun, eh?
Jill

Next message: Clark Cox: "Re: 32'nd bit & UTF-8"
Previous message: Arcane Jill: "Conformance (was UTF, BOM, etc)"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 07:40:05 CST