Re: UTF-8 syntax

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 08 2001 - 15:32:04 EDT


Jianping wrote:

> From your analysis, it make me more believe that we need a UTF-8S not only for the
> binary order but also for this ambiguity applying to both UTF-8S and UTF-16. As
> proposed UTF-8S encoding is logically equivalent to the UTF-16, they share the same
> property which is different from UTF-8 and UTF-32.

I agree that both UTF-8s and UTF-16 share a property in that they
cannot create unique mappings that distinguish a supplementary
character from a sequence of surrogate code points.

However, I think you err in assuming that UTF's must be "logically
equivalent" in some strong sense. The UTF's are defined as transforms
of each Unicode scalar value into a unique sequence of code units.

Logically, there was nothing that would have prevented UTF-8 from
being defined by performing all the bit slicing and OR-ing, and then
inverting all the bits on the resulting code units, so you would get:

U-00000000 --> 0xFF
U-00000001 --> 0xFE
U-0000007F --> 0x80
U-00000080 --> 0x3D 0x7F
U-00000081 --> 0x3D 0x7E
U-000007FF --> 0x20 0x40
U-00000800 --> 0x1F 0x5F 0x7F
U-0000FFFF --> 0x10 0x40 0x40
U-0010FFFF --> 0x0B 0x70 0x40 0x40

Pretty obviously this UTF (UTF-8bi for "bit-inverted") would have
a drastically different binary ordering than the code points --
essentially backwards, in fact. It would also be totally useless
for interworking with ASCII. But it would nonetheless fit the
*definition* of a UTF. Not only that, it would share all the length
and allocation properties with the existing UTF-8.

> Here we need either to fix UTF-16
> to make it have the some property with UTF-8,

Mark, Peter, and others have already indicated that it would be
completely out of the question to try to "fix UTF-16". It would
be ruinous -- catastrophic, in fact. It would completely
destroy the standard and break software that was trying to use it.

So in the future, please refrain from continuing to raise this
one as a possible alternative. It is only an alternative if
one is a bomb-thrower intent upon destroying the Unicode Standard.

> or to make another one as UTF-8S.

This one has more subtle difficulties, which is why people
are still continuing to discuss it at all. Since some people
apparently have "UTF-8" implementations that behave as UTF-8s,
and since the differences only impact supplemental characters,
which we are only starting to encounter, it seems like a much
lesser deal to advocate adding a UTF-8s.

I still claim that it would be very damaging to the standard,
but of course it wouldn't be credible for me to make the
same apocalyptic claims about it as for "fixing" UTF-16.

>
> This will fix the following problem for example:
> For a searching engine to search the character U-00010000 in UTF-8 string, and it
> could not find. But when UTF-8 is converted into UTF-16, it can found it there
> because <ED A0 80> and <ED B0 80> are converted into U-0001000 in UTF-16.

No, it does not.

If the task is for a search engine to find U-00010000 in a
UTF-8 string, it should be searching for <F0 90 80 80>.

If the UTF-8 is converted into UTF-16, it will be converted
into <D800 DC00> (note, *not* U-00010000, which is just the
Unicode scalar value). So if the search engine is looking
to find U-00010000 in UTF-16 string, it should be searching
for <D800 DC00>.

Where is the problem?

The problem comes when someone, contrary to the conformance
requirements of the standard, has emitted irregular UTF-8
for the character in question, so that instead of <F0 90 80 80>,
the string has <ED A0 80 ED B0 80> in it.

A *lenient* search engine could also search for the
irregular pattern, i.e., it could consider <F0 90 80 80> and
<ED A0 80 ED B0 80> both to be matches for U-00010000, but
that would slow it down. It turns into a normalization problem,
since thinking this way means that there would be two representations
for all supplementary characters: a regular UTF-8 form and an
irregular UTF-8 form. And this also sneaks the non-shortest
UTF-8 issue in the back door again, creating potential
security issues because of the alternate representations.

Thus a *strict* search engine looking at *UTF-8* data should
match <F0 90 80 80> but reject <ED A0 80 ED B0 80> as irregular.

Introduction of UTF-8s as a formal UTF would create an exactly
equal but opposite problem for supplementary characters.
A search engine looking at a UTF-8s string should expect
to find <ED A0 80 ED B0 80> but *not* <F0 90 80 80>.

In practice, however, both of these forms would leak all over
the place, especially (as Carl Brown points out) because Oracle
is labelling its UTF-8s as "UTF8", and search engines (and
every other process) are going to run up against completely
unnormalized data for supplementary characters and have to
deal with both forms.

The point is, you can't force everyone using the conformant
UTF-8 to switch over and only use UTF-8s. But unless you
do so, we are *all* going to have to deal with 4 UTF's,
instead of 3, and two of them are going to be mixed up
and confused for each other, and will result in data with
multiple representations of the same characters.

That is not a pretty picture. It is just *adding* to the
trouble, and not making it more consistent.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT