Philippe Verdy said:
> Well then I don't know why you need a definition of an "Unicode 16-bit
> string". For me it just means exactly the same as "16-bit string", and
> the encoding in it is not relevant given you can put anything in it
> without even needing to be conformant to Unicode. So a Java string is
> exactly the same, a 16-bit string. The same also as Windows API 16-bit
> strings, or "wide strings" in a C compiler where "wide" is mapped by a
> compiler option to 16-bit code units for wchar_t ...
And elaborating on Mark's response a little:
[0x0061,0x0062,0x4E00,0xFFFF,0x0410]
Is a "Unicode 16-bit string". It contains "a", "b", a Han character, a noncharacter, and a Cyrillic character.
Because it is also well-formed as UTF-16, it is also a "UTF-16 string", by the definitions in the standard.
[0x0061,0xD800,0x4E00,0xFFFF,0x0410]
Is a "Unicode 16-bit string". It contains "a", a high-surrogate code unit, a Han character, a noncharacter, and a Cyrillic character.
Because an unpaired high-surrogate code unit is not allowed in UTF-16, this is *NOT* a "UTF-16 string".
On the other hand, consider:
[0x0061,0x0062,0x88EA,0x8440]
That is *NOT* a Unicode 16-bit string. It contains "a", "b", a Han character, and a Cyrillic character. How do I know? Because I know the character set context. It is a wchar_t implementation of the Shift-JIS code page 932.
The difference is the declaration of the standard one uses to interpret what the 16-bit units mean. In a "Unicode 16-bit string" I go to the Unicode Standard to figure out how to interpret the numbers. In a "wide code Page 932 string" I go to the specification of Code Page 932 to figure out how to interpret the numbers.
This is no different, really, than talking about a "Latin-1 string" versus a "KOI-8 string".
--Ken
Received on Mon Jan 07 2013 - 15:36:09 CST
This archive was generated by hypermail 2.2.0 : Mon Jan 07 2013 - 15:36:10 CST