RE: What does it mean to "not be a valid string in Unicode"? from Whistler, Ken on 2013-01-07 (Unicode Mail List Archive)

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Mon, 7 Jan 2013 21:33:11 +0000

Philippe Verdy said:

> Well then I don't know why you need a definition of an "Unicode 16-bit
> string". For me it just means exactly the same as "16-bit string", and
> the encoding in it is not relevant given you can put anything in it
> without even needing to be conformant to Unicode. So a Java string is
> exactly the same, a 16-bit string. The same also as Windows API 16-bit
> strings, or "wide strings" in a C compiler where "wide" is mapped by a
> compiler option to 16-bit code units for wchar_t ...

And elaborating on Mark's response a little:

[0x0061,0x0062,0x4E00,0xFFFF,0x0410]

Is a "Unicode 16-bit string". It contains "a", "b", a Han character, a noncharacter, and a Cyrillic character.

Because it is also well-formed as UTF-16, it is also a "UTF-16 string", by the definitions in the standard.

[0x0061,0xD800,0x4E00,0xFFFF,0x0410]

Is a "Unicode 16-bit string". It contains "a", a high-surrogate code unit, a Han character, a noncharacter, and a Cyrillic character.

Because an unpaired high-surrogate code unit is not allowed in UTF-16, this is *NOT* a "UTF-16 string".

On the other hand, consider:

[0x0061,0x0062,0x88EA,0x8440]

That is *NOT* a Unicode 16-bit string. It contains "a", "b", a Han character, and a Cyrillic character. How do I know? Because I know the character set context. It is a wchar_t implementation of the Shift-JIS code page 932.

The difference is the declaration of the standard one uses to interpret what the 16-bit units mean. In a "Unicode 16-bit string" I go to the Unicode Standard to figure out how to interpret the numbers. In a "wide code Page 932 string" I go to the specification of Code Page 932 to figure out how to interpret the numbers.

This is no different, really, than talking about a "Latin-1 string" versus a "KOI-8 string".

--Ken
Received on Mon Jan 07 2013 - 15:36:09 CST

This archive was generated by hypermail 2.2.0 : Mon Jan 07 2013 - 15:36:10 CST