Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

From: Mark Davis (mark@macchiato.com)
Date: Tue Jun 19 2001 - 09:47:09 EDT


This is too strong a statement. Yes, UTF-FSS was designed to represent code
points above FFFF in 4 bytes. But let's look at the path that the sofware
would take over history. If you take the original UCS-2 to UTF-8 mechanism
(back when UTF-8 was called UTF-FSS) and apply it to surrogates, the
sequence D800 DC00 would map to the sequence ED A0 80 ED B0 80. The sequence
D800 DC00 was changed in UTF-16 to represent U+10000. If one did not correct
the UCS-2 software, and simply interpreted it according to UTF-16 semantics,
then one would end up with a (flawed) UTF-8 sequence representing U+10000.
Nobody pulled this out of a hat. It is simply the natural result of not
fixing your mapping to UTF-8 when starting to reinterpret 16-bit codes as
UTF-16.

This doesn't mean it was the correct thing to do. The ideal case would have
been to correct the software when there were no supplementary characters
(those requiring representation with surrogate pairs) that would cause a
different in interpretation between UTF-16 and UCS-2. People like database
vendors often have a huge requirement for stability, and must provide their
customers with solutions that are bug-for-bug compatible with older versions
for quite some time into the future. Yet there was a long period of time in
which to deprecate the older UCS-2 solution.

We designed UTF-16 to be as as backwards compatible as possible with UCS-2,
so that UCS-2 software would continue to work. Most importantly, if UTF-16
data was sent through a UCS-2 process, and that process ignored the
surrogate codes (since they were unassigned), all would be well. The one
area where it really does make a difference -- since you are transforming
unassigned code points -- is in mapping to other UTFs.

Mark

----- Original Message -----
From: <DougEwell2@cs.com>
To: <unicode@unicode.org>
Cc: <Jianping.Yang@oracle.com>
Sent: Monday, June 18, 2001 23:01
Subject: Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

> In a message dated 2001-06-18 12:56:47 Pacific Daylight Time,
> Jianping.Yang@oracle.com writes:
>
> > As matter of fact, Oracle supported UTF-8 far earlier than surrogate or
> 4-byte
> > encoding was introduced. As database vendor, Oracle took fully
advantages
> of
> > Unicode and also a victim of Unicode in sense of compatibility. As no
> burden of
> > fonts and IME issue for a database to store Unicode at its server.
Oracle
> > supported very early version of Unicode in its Oracle 7 release as
database
> > character set AL24UTFFSS which means 3-byte encoding for UTF-FSS. When
> > Unicode came to version 2.1, we found our AL24UTFFSS had trouble for
2.1 as
> > Hangul's reallocation, and we could not simply update AL24UTFFSS to 2.1
> definition
> > as it would mess existing users' data in their database. So we came up
> with a new
> > character set as UTF8 which is still 3-byte encoding to support Unicode
> 2.1. The
> > choice of 3-byte encoding is also bound to AL24UTFFSS implementation as
it
> would
> > not break when users migrate AL24UTFFSS into UTF8.
>
> The Hangul mess took place with Unicode 2.0, not 2.1. And this is a red
> herring anyway when we are talking about UTF-8. As stated before, UTF-8
has
> never changed even though the Unicode beneath it has changed:
>
> * by moving the Hangul block in version 2.0
> * by creating the UTF-16 mechanism to support surrogates in 1993 (not
2001)
>
> The mechanism in UTF-8 to encode characters from U+10000 to U+10FFFF
> (actually U+1FFFFF) in 4 bytes was part of the original FSS-UTF specified
in
> 1992. Check the records. It was never "added on" at some later date,
> causing existing conformant UTF-8 to break. If Oracle or any other vendor
or
> developer originally interpreted UTF-8 to use a maximum of 3 bytes to
encode
> a character, that is either their own misreading of the specification or a
> deliberate subsetting of the problem, but in any case that company cannot
> claim to be a "victim of Unicode" when they have implemented a clearly
> specified Unicode standard incorrectly.
>
> -Doug Ewell
> Fullerton, California
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT