Re: UTF8 vs AL32UTF8

From: DougEwell2@cs.com
Date: Tue Jun 12 2001 - 11:29:44 EDT


In a message dated 2001-06-12 1:07:17 Pacific Daylight Time,
Peter_Constable@sil.org writes:

> There's a mistake being made here that has been made repeatedly throughout
> our discussion: that's to assume that there are two kinds of UTF-8: the
> original, in which the code unit sequence < ED A0 80 ED B0 80 > meant the
> coded character sequence < U-0000D800, U-0000DC00 >, and the new UTF-8 in
> which this sequence means U-00010000. The only sensible interpretation of
> the definitions of Unicode is that UTF-8 maps exactly one coded character
> to exactly one code unit sequence. As far as I know, the UTF-8 mapping
> hasn't changed; all that has changed are the range of USVs that are mapped
> into it, and the introduction of some terms like "irregular".

There has only ever been one kind of UTF-8, but the Unicode underneath it has
changed: from version 1.x, where there were no surrogates and U+D800, U+DC00
was just an ordinary sequence of two characters, to version 2.x and beyond,
where U+D800, U+DC00 is either (a) a surrogate pair representing U+10000 or
(b) much less likely, two loose surrogates that happened to appear together
by chance.

UTF-16, alone among UTFs (until this proposal), does not allow the
distinction required by definition D29, but UTF-8 does have this power. You
can say F0 90 80 80 to mean U+10000, or if you really want to, you can also
say ED A0 80 ED B0 80 to mean U+D800, U+DC00.

So Toby is correct -- UTF-8s is not UTF-8, but a completely different
encoding scheme (although it walks and talks just like UTF-8 as long as the
text in question contains no surrogates or supplementary characters).

There are problems, though. UTF-8s looks *so much* like UTF-8 that, as Peter
notes, there is considerable opportunity for the two to become mixed up.
Toby admits that although UTF-8s is intended to remain internal, sometimes
"internal" things leak out into the external world. Oracle's choice of names
("UTF8" to mean UTF-8s, the non-intuitive "AL32UTF8" to mean UTF-8) doesn't
help matters one bit. And I still don't think UTF-8s is truly capable of
round-tripping unpaired surrogates in the manner spelled out in D29.

All of these technical considerations need to be taken into account, as well
as those presented by the database vendors. The worst thing would be for
UTF-8s to be just swept into the standard because of the political clout
wielded by the proponents.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT