Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

From: DougEwell2@cs.com
Date: Tue Jun 19 2001 - 02:01:36 EDT


In a message dated 2001-06-18 12:56:47 Pacific Daylight Time,
Jianping.Yang@oracle.com writes:

> As matter of fact, Oracle supported UTF-8 far earlier than surrogate or
4-byte
> encoding was introduced. As database vendor, Oracle took fully advantages
of
> Unicode and also a victim of Unicode in sense of compatibility. As no
burden of
> fonts and IME issue for a database to store Unicode at its server. Oracle
> supported very early version of Unicode in its Oracle 7 release as database
> character set AL24UTFFSS which means 3-byte encoding for UTF-FSS. When
> Unicode came to version 2.1, we found our AL24UTFFSS had trouble for 2.1 as
> Hangul's reallocation, and we could not simply update AL24UTFFSS to 2.1
definition
> as it would mess existing users' data in their database. So we came up
with a new
> character set as UTF8 which is still 3-byte encoding to support Unicode
2.1. The
> choice of 3-byte encoding is also bound to AL24UTFFSS implementation as it
would
> not break when users migrate AL24UTFFSS into UTF8.

The Hangul mess took place with Unicode 2.0, not 2.1. And this is a red
herring anyway when we are talking about UTF-8. As stated before, UTF-8 has
never changed even though the Unicode beneath it has changed:

* by moving the Hangul block in version 2.0
* by creating the UTF-16 mechanism to support surrogates in 1993 (not 2001)

The mechanism in UTF-8 to encode characters from U+10000 to U+10FFFF
(actually U+1FFFFF) in 4 bytes was part of the original FSS-UTF specified in
1992. Check the records. It was never "added on" at some later date,
causing existing conformant UTF-8 to break. If Oracle or any other vendor or
developer originally interpreted UTF-8 to use a maximum of 3 bytes to encode
a character, that is either their own misreading of the specification or a
deliberate subsetting of the problem, but in any case that company cannot
claim to be a "victim of Unicode" when they have implemented a clearly
specified Unicode standard incorrectly.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT