From: Doug Ewell (doug@ewellic.org)
Date: Sat Jun 27 2009 - 11:58:10 CDT
John (Eljay) Love-Jensen <eljay at adobe dot com> replied to Venugopalan
G:
>> Like, if i iterate through each unicode character(16 bits), will i
>> find zero at any time?
>
> It is possible.
>
>> Basically can i use zero to represent termination of a U16 string?
>
> If your *OWN* encoding reserves 0x0000 for UTF-16 termination, and
> what you are encoding itself does not have U+0000 code points in it,
> then using 0x0000 as your own string termination instead of
> representing a UTF-16 code point is a reasonable compromise.
>
> But if what you are trying to represent is any valid Unicode sequence,
> then U+0000 would be a valid Unicode code point which your string
> class could not contain. Unless you re-encode U+0000 as something
> else... but then your string class would not be UTF-16, it would be
> something close-to-but-not-quite UTF-16, and could not stream in
> "pure" UTF-16 without translation UTF-16 into your close-but-not-quite
> UTF-16.
To clarify slightly, the problem is no different for Unicode from what
it is for ASCII or ISO 8859. ASCII itself does not prohibit 0x00 as
part of a string, because the definition of "string" is outside its
scope. Likewise, Unicode does not prohibit U+0000. However, most
modern protocols do treat "null" as invalid within a string, usually in
the role of string terminator.
Strings that did not contain 0x00 in an 8-bit character set will not
contain U+0000 when converted to Unicode.
If you want to process *any arbitrary sequence of Unicode characters* as
a string, then you may have problems with U+0000 -- but that would have
been true if you wanted to process any arbitrary sequence of bytes as an
ASCII string.
-- Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14 http://www.ewellic.org http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 12:01:22 CDT