Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

From: DougEwell2@cs.com
Date: Tue Jun 19 2001 - 12:05:51 EDT


In a message dated 2001-06-19 6:46:14 Pacific Daylight Time,
mark@macchiato.com writes:

> If you take the original UCS-2 to UTF-8 mechanism
> (back when UTF-8 was called UTF-FSS) and apply it to surrogates, the
> sequence D800 DC00 would map to the sequence ED A0 80 ED B0 80.

Very true:
U+D800 U+DC00 == ED A0 80 ED B0 80
(assuming those are valid code points, which was true before 1993)

> The sequence D800 DC00 was changed in UTF-16 to represent U+10000. If one
did
> not correct the UCS-2 software,

EXACTLY. That is my point. It is the transformation from UCS-2 to UTF-16
that needs to be corrected, NOT the conversion to and from UTF-8.

> and simply interpreted it according to UTF-16 semantics,
> then one would end up with a (flawed) UTF-8 sequence representing U+10000.

U+10000 ==> (UTF-16) D800 DC00 ==> (UTF-8) F0 90 80 80

ED A0 80 ED B0 80 represents the two unpaired (but coincidentally
consecutive) code points 0xD800 0xDC00, which is why it fulfills definition
D29 which states that non-characters and unpaired surrogates have to be
round-tripped.

> This doesn't mean it was the correct thing to do. The ideal case would have
> been to correct the software when there were no supplementary characters
> (those requiring representation with surrogate pairs) that would cause a
> different in interpretation between UTF-16 and UCS-2. People like database
> vendors often have a huge requirement for stability, and must provide their
> customers with solutions that are bug-for-bug compatible with older
versions
> for quite some time into the future. Yet there was a long period of time in
> which to deprecate the older UCS-2 solution.

Absolutely. There was a time when every line of code that I wrote having to
do with Unicode assumed that all code points were 16 bits long and could fit
in an unsigned short, and everything was nice and neat and orderly. Like
many others, I was somewhat disappointed when surrogates came along and I had
to start playing the variable-length game. Some of my code (for internal use
only) was not corrected until well after 1993.

But none of that is the fault of the Unicode Consortium or ISO/IEC
JTC1/SC2/WG2 for failing to warn me that supplementary characters were
coming, some day.

I would not knowingly write code that failed to handle the Unicode code point
U+0220, even though no character is currently assigned to that position. The
same is true of U+10000 through U+10FFFF. Even the non-characters have to be
handled, in their own way.

What I am trying to do is refute claims like this one:

> As matter of fact, Oracle supported UTF-8 far earlier than surrogate or
4-byte
> encoding was introduced.

when there was NEVER a time in the history of UTF-FSS, UTF-2, or UTF-8 that
4-byte encodings were not part of the specification.

And I am trying to show that, while actual assigned supplementary characters
may not have appeared until Unicode 3.1, the *mechanism* to support them has
been in place for years and years. Waiting until characters were assigned
outside the BMP to start working on the UCS-2 problem is like waiting until
2000-01-01 to start working on the Y2K problem.

I think I am basically in agreement with Mark Davis here, which is good,
because he is the expert and authority and I should try to ensure that my
understanding matches his knowledge.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT