From: Venugopalan G (venunet@gmail.com)
Date: Sat Jun 27 2009 - 12:10:05 CDT
Hi Guys,
Thanks for the detailed desc.
The input is always a readable text from some language(not necessarily
English), not an arbitary UTF16 stream.
Let me put the question in diff manner.
Is it possible that a readable/valid string of any other language has a
U+0000 in the middle?
I understand that U+0000 is used for representing NULL char. But is it
always NULL irrespective of language/charset?
One possibility i cud think of is, e.g. some chinese character might have
one code point = two 16b code units,
where 1st 16bit unit is something and the next 16 bit is U+0000. Is that
possible?
Any real world character with such encoding value? Does unicode allow
character sets to choose U+0000 for their code point representation?
Regards,
Venu
On Sat, Jun 27, 2009 at 10:28 PM, Doug Ewell <doug@ewellic.org> wrote:
> John (Eljay) Love-Jensen <eljay at adobe dot com> replied to Venugopalan G:
>
> Like, if i iterate through each unicode character(16 bits), will i find
>>> zero at any time?
>>>
>>
>> It is possible.
>>
>> Basically can i use zero to represent termination of a U16 string?
>>>
>>
>> If your *OWN* encoding reserves 0x0000 for UTF-16 termination, and what
>> you are encoding itself does not have U+0000 code points in it, then using
>> 0x0000 as your own string termination instead of representing a UTF-16 code
>> point is a reasonable compromise.
>>
>> But if what you are trying to represent is any valid Unicode sequence,
>> then U+0000 would be a valid Unicode code point which your string class
>> could not contain. Unless you re-encode U+0000 as something else... but
>> then your string class would not be UTF-16, it would be something
>> close-to-but-not-quite UTF-16, and could not stream in "pure" UTF-16 without
>> translation UTF-16 into your close-but-not-quite UTF-16.
>>
>
> To clarify slightly, the problem is no different for Unicode from what it
> is for ASCII or ISO 8859. ASCII itself does not prohibit 0x00 as part of a
> string, because the definition of "string" is outside its scope. Likewise,
> Unicode does not prohibit U+0000. However, most modern protocols do treat
> "null" as invalid within a string, usually in the role of string terminator.
>
> Strings that did not contain 0x00 in an 8-bit character set will not
> contain U+0000 when converted to Unicode.
>
> If you want to process *any arbitrary sequence of Unicode characters* as a
> string, then you may have problems with U+0000 -- but that would have been
> true if you wanted to process any arbitrary sequence of bytes as an ASCII
> string.
>
> --
> Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
> http://www.ewellic.org
> http://www1.ietf.org/html.charters/ltru-charter.html
> http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
>
>
This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 12:12:45 CDT