Re: Zero termination

From: Venugopalan G (venunet@gmail.com)
Date: Sat Jun 27 2009 - 12:10:05 CDT

  • Next message: Phillips, Addison: "RE: Zero termination"

    Hi Guys,

    Thanks for the detailed desc.
    The input is always a readable text from some language(not necessarily
    English), not an arbitary UTF16 stream.
    Let me put the question in diff manner.

    Is it possible that a readable/valid string of any other language has a
    U+0000 in the middle?
    I understand that U+0000 is used for representing NULL char. But is it
    always NULL irrespective of language/charset?

    One possibility i cud think of is, e.g. some chinese character might have
    one code point = two 16b code units,
     where 1st 16bit unit is something and the next 16 bit is U+0000. Is that
    possible?
    Any real world character with such encoding value? Does unicode allow
    character sets to choose U+0000 for their code point representation?

    Regards,
    Venu

    On Sat, Jun 27, 2009 at 10:28 PM, Doug Ewell <doug@ewellic.org> wrote:

    > John (Eljay) Love-Jensen <eljay at adobe dot com> replied to Venugopalan G:
    >
    > Like, if i iterate through each unicode character(16 bits), will i find
    >>> zero at any time?
    >>>
    >>
    >> It is possible.
    >>
    >> Basically can i use zero to represent termination of a U16 string?
    >>>
    >>
    >> If your *OWN* encoding reserves 0x0000 for UTF-16 termination, and what
    >> you are encoding itself does not have U+0000 code points in it, then using
    >> 0x0000 as your own string termination instead of representing a UTF-16 code
    >> point is a reasonable compromise.
    >>
    >> But if what you are trying to represent is any valid Unicode sequence,
    >> then U+0000 would be a valid Unicode code point which your string class
    >> could not contain. Unless you re-encode U+0000 as something else... but
    >> then your string class would not be UTF-16, it would be something
    >> close-to-but-not-quite UTF-16, and could not stream in "pure" UTF-16 without
    >> translation UTF-16 into your close-but-not-quite UTF-16.
    >>
    >
    > To clarify slightly, the problem is no different for Unicode from what it
    > is for ASCII or ISO 8859. ASCII itself does not prohibit 0x00 as part of a
    > string, because the definition of "string" is outside its scope. Likewise,
    > Unicode does not prohibit U+0000. However, most modern protocols do treat
    > "null" as invalid within a string, usually in the role of string terminator.
    >
    > Strings that did not contain 0x00 in an 8-bit character set will not
    > contain U+0000 when converted to Unicode.
    >
    > If you want to process *any arbitrary sequence of Unicode characters* as a
    > string, then you may have problems with U+0000 -- but that would have been
    > true if you wanted to process any arbitrary sequence of bytes as an ASCII
    > string.
    >
    > --
    > Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
    > http://www.ewellic.org
    > http://www1.ietf.org/html.charters/ltru-charter.html
    > http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
    >
    >



    This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 12:12:45 CDT