Re: Zero termination

From: Doug Ewell (doug@ewellic.org)
Date: Sat Jun 27 2009 - 11:58:10 CDT

Next message: Venugopalan G: "Re: Zero termination"

Previous message: Phillips, Addison: "RE: Zero termination"
In reply to: John (Eljay) Love-Jensen: "RE: Zero termination"
Next in thread: Venugopalan G: "Re: Zero termination"
Reply: Venugopalan G: "Re: Zero termination"
Reply: John H. Jenkins: "Re: Zero termination"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John (Eljay) Love-Jensen <eljay at adobe dot com> replied to Venugopalan
G:

>> Like, if i iterate through each unicode character(16 bits), will i
>> find zero at any time?
>
> It is possible.
>
>> Basically can i use zero to represent termination of a U16 string?
>
> If your *OWN* encoding reserves 0x0000 for UTF-16 termination, and
> what you are encoding itself does not have U+0000 code points in it,
> then using 0x0000 as your own string termination instead of
> representing a UTF-16 code point is a reasonable compromise.
>
> But if what you are trying to represent is any valid Unicode sequence,
> then U+0000 would be a valid Unicode code point which your string
> class could not contain. Unless you re-encode U+0000 as something
> else... but then your string class would not be UTF-16, it would be
> something close-to-but-not-quite UTF-16, and could not stream in
> "pure" UTF-16 without translation UTF-16 into your close-but-not-quite
> UTF-16.

To clarify slightly, the problem is no different for Unicode from what
it is for ASCII or ISO 8859. ASCII itself does not prohibit 0x00 as
part of a string, because the definition of "string" is outside its
scope. Likewise, Unicode does not prohibit U+0000. However, most
modern protocols do treat "null" as invalid within a string, usually in
the role of string terminator.

Strings that did not contain 0x00 in an 8-bit character set will not
contain U+0000 when converted to Unicode.

If you want to process *any arbitrary sequence of Unicode characters* as
a string, then you may have problems with U+0000 -- but that would have
been true if you wanted to process any arbitrary sequence of bytes as an
ASCII string.

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

Next message: Venugopalan G: "Re: Zero termination"
Previous message: Phillips, Addison: "RE: Zero termination"
In reply to: John (Eljay) Love-Jensen: "RE: Zero termination"
Next in thread: Venugopalan G: "Re: Zero termination"
Reply: Venugopalan G: "Re: Zero termination"
Reply: John H. Jenkins: "Re: Zero termination"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 12:01:22 CDT