From: Phillips, Addison (addison@amazon.com)
Date: Sat Jun 27 2009 - 12:21:16 CDT
Venu,
Thanks for the detailed desc.
The input is always a readable text from some language(not necessarily English), not an arbitary UTF16 stream.
Let me put the question in diff manner.
Is it possible that a readable/valid string of any other language has a U+0000 in the middle?
AP> No. It doesn’t matter what the language is. The only character in Unicode (and thus UTF-16) that uses the code unit 0x0000 is NULL.
I understand that U+0000 is used for representing NULL char. But is it always NULL irrespective of language/charset?
AP> Yes. Always.
One possibility i cud think of is, e.g. some chinese character might have
one code point = two 16b code units,
AP> Some Chinese (and other characters from other scripts) in fact do use two 16-bit code units. These are called a “surrogate pair” and are restricted to a specific range of code units which are never null.
where 1st 16bit unit is something and the next 16 bit is U+0000. Is that possible?
AP> No.
Any real world character with such encoding value? Does unicode allow character sets to choose U+0000 for their code point representation?
AP> Unicode is the character set. It encodes the various scripts used to write the world’s languages, assigning each character a unique code point. The code point U+0000 is assigned (solely, uniquely) to NULL.
Addison
Addison Phillips
Globalization Architect -- Lab126
Internationalization is not a feature.
It is an architecture.
This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 12:24:27 CDT