RE: Zero termination

From: John (Eljay) Love-Jensen (eljay@adobe.com)
Date: Sat Jun 27 2009 - 11:24:57 CDT

  • Next message: Phillips, Addison: "RE: Zero termination"

    Hi Venu,

    > I just want to know if a valid UTF16 string can contain the value zero(0), not the character zero but the 16bit value zero.

    In UTF-16, the Unicode code point U+0000 is encoded as 0x0000.

    > Like, if i iterate through each unicode character(16 bits), will i find zero at any time?

    It is possible.

    > Is Zero a valid code point or a part of a code point?

    Yes, U+0000 is a valid Unicode code point, and 0x0000 is the UTF-16 encoding of that U+0000 code point.

    > Basically can i use zero to represent termination of a U16 string?

    If your *OWN* encoding reserves 0x0000 for UTF-16 termination, and what you are encoding itself does not have U+0000 code points in it, then using 0x0000 as your own string termination instead of representing a UTF-16 code point is a reasonable compromise.

    But if what you are trying to represent is any valid Unicode sequence, then U+0000 would be a valid Unicode code point which your string class could not contain. Unless you re-encode U+0000 as something else... but then your string class would not be UTF-16, it would be something close-to-but-not-quite UTF-16, and could not stream in "pure" UTF-16 without translation UTF-16 into your close-but-not-quite UTF-16.

    > because if zero is in the middle of str, then the program will terminate in wrong place.

    Yep, that is correct.

    Different languages handle the issue in different ways.

    For example, C++ has a std::basic_string template class, which can contain UTF-16 encoding units. Such a std::basic_string<utf16> can hold U+0000, since 0x0000 is not used as a terminator. (You'd have to roll your own utf16 data type, since wchar_t may-or-may-not be suitable as UTF-16 and certainly isn't portable.)

    Python 3 has Unicode support for strings. I haven't tried to ingest a UTF-16 data files that had U+0000 within.

    Ruby is getting there.

    Perl 5.8 has Unicode savvy I/O. And I presume forthcoming Perl 6 will be even better.

    Sincerely,
    --Eljay



    This archive was generated by hypermail 2.1.5 : Sat Jun 27 2009 - 11:29:09 CDT