Re: UCS-4, UCS-2, UTF-16, UTF-8

From: Doug Ewell (dewell@compuserve.com)
Date: Thu Feb 17 2000 - 19:58:05 EST


"Robert A. Rosenberg" <bob.rosenberg@digitscorp.com> wrote:

> At 01:18 PM 02/17/2000 -0800, Yung-Fong Tang wrote:
>> Not only that. UCS-4 does not specify byte order, but UTF-32BE and
>> UTF-32LE does. I think UTF-32 itself (not UTF-32BE neither UTF32-LE)
>> does not make too much sense. But remember byte order is essential in
>> network transmission.
>
> I thought that UTF-32 was able to handle byte-order by starting with
> either FEFF0000 or 0000FFFE (the "byte order signal character" [who
> code name I forgot]).
>
> Note: I might have the codepoint wrong and it is FEFF.

I don't think of the difference between UTF-32 and UCS-4 as having
anything to do with byte order. I know that there are charsets called
"UTF-32BE" and "UTF-32LE" that specify the byte order, but UCS-4 can be
handled and transmitted in either byte order as well. And unlike UCS-2
or UTF-16, the heuristic to determine the endianness of UCS-4 or UTF-16
text is a good one, since the high-order octet is guaranteed to be 0.
So the suffixes "BE" and "LE" might not be as sorely needed for UTF-32
as for UTF-16.

Upon reading Technical Report #19 again, I discovered another important
difference between UTF-32 and UCS-4. Calling text "UTF-32" commits it
to the character semantics of Unicode, which are more specific and more
detailed than those of ISO 10646. This, plus the upper-end limit of
U-0010FFFF, is the primary difference between UTF-32 and UCS-4.

BTW, the code point for BYTE ORDER MARK (U-0000FEFF) is FF FE 00 00 in
little-endian UCS-4 or UTF-32 ("UTF-32LE") and 00 00 FE FF in big-endian
UCS-4 or UTF-32 ("UTF-32BE").

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT