From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 22 2003 - 18:42:09 EDT
Philippe Verdy said:
> > > If you want to store a NUL ASCII in your serialization (so that null
> > > Unicode codepoints will be preserved by the encoding), you may use an
> > > exception, by escaping it (like does Java internally in the JNI
> > > interface).
> > >
> > > This is NOT allowed in UTF-8 but is a trivial extension, used also in
> > > the alternate CESU-8 encoding (which is an encoding "scheme" "similar"
> > > to UTF-8, except that it is derived from the UTF-16 encoding "form",
> > > instead of the UTF-32 encoding "form"): encode a NULL codepoint with
> > > the pair of bytes (0xC0; 0x80).
> >
And Doug Ewell replied:
> > This is not allowed anywhere except in internal processing, where
> > anything goes. Do not recommend this. (Fortunately, the issue seldom
> > comes up in the real world because most people don't need to store
> > U+0000 in plain text files.)
>
Philippe Verdy followed up:
> No, this feature is needed because U+0000 is a legal Unicode
> character/codepoint that may be needed. (Not the IF that I
> used at the beginning of the paragraph).
Abdij Bhat did not indicate that he needed any such thing. He
said he had a problem trying to use UTF-16 serializations, since
the device he was communicating with understood only ASCII (which
I take to mean was limited to an 8-bit string communication
protocol), and thus could not handle the UTF-16 serialization,
since it stopped on NUL byte values.
If he is using some string communication protocol which is
ASCII compatible, then if there was some behavior (such as
signally the end of string input) for an ASCII NUL before, then
he will want equivalent behavior for a NUL for the UTF-8 data,
and hence would want it to actually *be* an 0x00 byte, not
something escaped. In other words, UTF-8, as it is, is exactly
what he would need for the ASCII compatibility, including NUL.
So Doug is correct. 0xC0 0x80 is not a permissible representation
of U+0000 in UTF-8, and it is bad advice to recommend to people
that they should use it.
> There are many uses of the (0xC0;0x80) bytes sequences if one
> wants to store NUL characters within strings that are NUL
> terminated (and not delimited by a separate encoding length field).
You can of course do such things internally (as Java does), but
in general this is not recommended even for internal use. If
you implement strings with NUL termination, don't try to
embed NUL characters in them -- it is just a needless complication,
particularly if you have to add epicycles on top of the character
encoding to accomplish it. It tends to lead to program bugs,
interoperability problems, and, in some cases, security problems.
> I clearly said that this sort of encoding was NOT standard,
> implying that this is used only as an "upper level protocol"
> using the Unicode terminology. This is fully allowed in this
> context, because he did not specify clearly the compatibility
> features needed by its hardware device (as part of its
> specification, one can fully describe its interface as using
> this exception on top of UTF-8, or on top of CESU-8).
One could, but it is still bad advice. UTF-8 is, I suspect,
exactly what would address Abdij Bhat's problem, and there is
no reason to layer some protocol on top of that to represent
NUL characters in some non-standard way.
BTW, the Unicode Technical Committee does *not* recommend
CESU-8 for open data interchange. It should not be
mentioned as parallel to UTF-8 as an option for people's
implementations. But that is another story...
--Ken
This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 19:34:15 EDT