RE: UTF-16 vs UTF-32 (was IBM AIX 5 and GB18030

From: John McConnell (johnmcco@windows.microsoft.com)
Date: Fri Nov 15 2002 - 12:28:48 EST

Next message: Peter_Constable@sil.org: "mixed-script writing systems"

Previous message: Markus Scherer: "Re: IBM AIX 5 and GB18030"
Maybe in reply to: Carl W. Brown: "UTF-16 vs UTF-32 (was IBM AIX 5 and GB18030"
Next in thread: Michael \(michka\) Kaplan: "Re: IBM AIX 5 and GB18030"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

My experience is that the UCS-2 to UTF-16 conversion can be much easier than the SBCS to DBCS conversion, depending on how your original code is organized.

In the case of Windows, much of the text processing was already done by modules (e.g. Uniscribe, NLS) that processed text elements rather than individual characters. This was necessary even with UCS-2 because of combining characters and because globalized text operations must always multiple characters e.g. sorting Traditional Spanish or displaying Hindi. In Windows XP, We've achieved significant UTF-16 support with a modest effort to upgrade these modules and haven't had to touch most of the OS code.

Concerning "bait & switch", I was always amused by the contrast between the Introduction to Unicode 1.0, which celebrated the return to a fixed width character set, and the General Principles in the next section, which taught me that I shouldn't care.

John
Microsoft

-----Original Message-----
From: Doug Ewell [mailto:dewell@adelphia.net]
Sent: Thursday, November 14, 2002 8:26 PM
To: Unicode Mailing List
Cc: Carl W. Brown
Subject: Re: UTF-16 vs UTF-32 (was IBM AIX 5 and GB18030

Carl W. Brown <cbrown at xnetinc dot com> wrote:

> Converting from UCS-2 to UTF-16 is just like converting from SBCS to
> DBCS. For folks who think DBCS it is no problem. Those who went from
> DBCS to Unicode to simplify their lives I am sure are not happy.

Ken made me laugh last March by referring to this as

    "... a bait and switch tactic, whereby implementers were lulled
    into thinking they had a simple, fixed-width 16-bit system, only
    to discover belatedly that they had bought into yet another
    mixed-width character encoding after all."

At least with surrogate pairs, we don't have to deal with overlapping
ranges for lead bytes and trail bytes, or for trail bytes and
single-byte characters, and we don't have to go through crazy gymnastics
to "find the last lead byte" if we ever get lost in the middle of a
UTF-16 string.

> I think that worst problem is that many systems still sort in binary
> not code point order. Then you get Oracle and the like wanting to set
> up a UTF-8 variant that encode each surrogate rather than the
> character.

As Michka noted, the mechanism for surrogates has existed for almost a
decade now. Individuals and companies that ignored surrogates because
"there aren't any characters there anyway, and when they do add some
they'll be extremely rare," and are now behind in supporting UTF-16,
really have nobody else to blame.

> However, 16 bit characters were a hard enough sell in the good old
> days. If we had started out withug 2bit characters we would still be
> dreaming about Unicode.

I think Carl meant "with 32-bit characters." I don't know what kind of
word "withug" is (Old English?), but I like it.

-Doug Ewell
Fullerton, California

Next message: Peter_Constable@sil.org: "mixed-script writing systems"
Previous message: Markus Scherer: "Re: IBM AIX 5 and GB18030"
Maybe in reply to: Carl W. Brown: "UTF-16 vs UTF-32 (was IBM AIX 5 and GB18030"
Next in thread: Michael \(michka\) Kaplan: "Re: IBM AIX 5 and GB18030"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 15 2002 - 13:09:55 EST