From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 23 2003 - 21:46:19 EDT
Philippe Verdy continued:
> From: "Kenneth Whistler" <kenw@sybase.com>
> > And I think this is a *terrible* idea, which will be roundly
> > rejected. Let me state it one last time: it is bad advice to
> > recommend that people use <0xC0 0x80> to represent U+0000
> > (as any kind of extension to UTF-8).
>
> Say that to Sun;
People have said it to Sun. ;-)
> I think that it will not break its ascending compatibility for JNI
> or break its support for NULL characters within String's just
> because UTF-8 forbids it. All what Sun will do is to update its
> documentation by saying that it's interface is not UTF-8 but an
> extension to it.
What Philippe is referring to can be found at:
http://java.sun.com/j2se/1.3/docs/guide/jni/spec/types.doc.html#16542
"UTF-8 Strings"
That documentation *already* recognizes that the "UTF-8 Strings" for
JNI are not conformant UTF-8, to wit:
"There are two differences between this format and the 'standard'
UTF-8 format. First, the null byte (byte)0 is encoded using
the two-byte format rather than the one-byte format. This means
that Java VM UTF-8 strings never have embedded nulls. Second,
only the one-byte, two-byte, and three-byte formats are used. The
Java VM does not recognize the longer UTF-8 formats."
This has been a long-known fact about the internal Java VM "UTF-8 String"
format. And Sun and Java know that this internal Java VM format,
also used for the JNI interface, should not be exchanged openly,
labelled as "UTF-8".
The java.io InputStreamReader and OutputStreamWriter are for
public interchange, and they *do* use conformant UTF-8:
http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc
> The "string terminator" semantic of byte 0x00 is a standardized
> convention widely used, independantly of Unicode which does
> not specify this semantic,
Correct.
> and admittedly considers U+0000 as a plain abstract character,
> with a clear Control Character semantic, which does not prohibit
> its use in the significant part of the string or in the middle of it.
Also correct. But other than use of "U+0000" to refer to NULL, that
is also correct for US ASCII and ISO 8859-1 (and all the other ISO
8859 parts). For them, too, 0x00 is a plain abstract character,
a control code (whose semantics are defined by ISO 6429 or other
standards for control functions); they also do not prohibit its
use in the middle of a string. They, quite wisely, have nothing to
say in the matter.
You seem to keep missing the point that the behavior of NULL,
represented as 0x00 in UTF-8, represented as 0x00 in US ASCII,
represented as 0x00 in ISO 8859-1, and represented as 0x00 in
GB 2312-1980, MacRoman, Code Page 1252, ... is exactly the same
in strings for any of those encodings. The issues for embedding
NULL's in strings when those strings are used with C runtime
libraries or other C API's that use NULL-terminated string
conventions are exactly the same. This has nothing to do with
some Unicode-specific differences in how NULL is interpreted
or handled in C environments.
> In fact I have found several applications that use now the
> forbidden sequences of UTF-8 as a way to insert "escaped"
> markup within a UTF-8 string.
Then you have found non-conforming applications, if they claim
that they are using UTF-8 with such conventions.
> There are other escaping conventions used, notably with XML that
> makes another use of "<>" characters, quotes for attributes and
> ampersands.
What has that to do with the price of onions? I've never said
that escape mechanisms for the quoting of characters is bad.
It is obvious that many formal language syntaxes and markup
systems make use of them, for obvious reasons.
> Unicode cannot forbid or even recommend not using escaping
> mechanisms on top of any of its UTF encoding schemes, simply
> because there's no other way to build actual applications
> without such additional mechanism.
The Unicode Technical Committee cannot forbid people from doing
silly things or prevent people from making mistakes in their
string handling.
It can (and does) declare what conformant UTF-8 means. And people
who notice implementations that do things with UTF-8 which do
not follow the specification are within their rights to declare
such implementations to be nonconformant to the Unicode Standard
(and to ISO/IEC 10646).
And I dispute your claim that "there's no other way to build actual
applications without such additional mechanism," if what you
are talking about is specifically UTF-8. Lots of people have
done so, including me, and many of us use C runtime libraries,
and NULL-terminated strings in doing so.
>
> You are worry about the term "trivial extension". Consider
> it clearly, any trivial extension is an extension and thus
> not the standard. The term trivial just designates the ease
> with which it can be encoded as an exception from the standard,
> without breaking the encoding of all other characters.
Trivial extensions are often the most damaging, because the
differences tend not to be obvious to most implementers, who
get hit after the fact with subtle problems and interoperability
concerns that they were unaware of up front. Non-shortest UTF-8
and CESU-8 both fall in that category, since people can go along
assuming they are UTF-8 for a long time, and then suddenly get
whacked with a problem they didn't anticipate.
>
> I took enough precautions to explain it (notably by using "if",
> or "you can", "you could", and "extension") so that this cannot
> beconfused with the UTF-8 standard... I also did not want to
> explain fuly the details of the UTF-8 algorithm, pointing the
> user to the standard document for all details needed for its
> implementation. That's enough for me and should be enough for
> everybody. The question was not really about Unicode but about
> a concrete application of it, due to constraints. This makes a
> clear difference.
The problem that occasioned this thread was the result of
someone trying to push byte-serialized UTF-16 at a device
API that choked on embedded null bytes. The generic answer for
such a problem is use UTF-8 instead.
All the subsequent analysis and suggestions to use <0xC0, 0x80>
for NULL in UTF-8, and the wandering on about higher-level
protocols and whether the UTC can or cannot prevent people from
using them was basically irrelevant to the problem.
--Ken
This archive was generated by hypermail 2.1.5 : Fri May 23 2003 - 22:36:19 EDT