RE: UTF-16 vs UTF-32 (was IBM AIX 5 and GB18030

From: John McConnell (johnmcco@windows.microsoft.com)
Date: Fri Nov 15 2002 - 12:28:48 EST

  • Next message: Peter_Constable@sil.org: "mixed-script writing systems"

    My experience is that the UCS-2 to UTF-16 conversion can be much easier than the SBCS to DBCS conversion, depending on how your original code is organized.

    In the case of Windows, much of the text processing was already done by modules (e.g. Uniscribe, NLS) that processed text elements rather than individual characters. This was necessary even with UCS-2 because of combining characters and because globalized text operations must always multiple characters e.g. sorting Traditional Spanish or displaying Hindi. In Windows XP, We've achieved significant UTF-16 support with a modest effort to upgrade these modules and haven't had to touch most of the OS code.

    Concerning "bait & switch", I was always amused by the contrast between the Introduction to Unicode 1.0, which celebrated the return to a fixed width character set, and the General Principles in the next section, which taught me that I shouldn't care.

    John
    Microsoft

    -----Original Message-----
    From: Doug Ewell [mailto:dewell@adelphia.net]
    Sent: Thursday, November 14, 2002 8:26 PM
    To: Unicode Mailing List
    Cc: Carl W. Brown
    Subject: Re: UTF-16 vs UTF-32 (was IBM AIX 5 and GB18030

    Carl W. Brown <cbrown at xnetinc dot com> wrote:

    > Converting from UCS-2 to UTF-16 is just like converting from SBCS to
    > DBCS. For folks who think DBCS it is no problem. Those who went from
    > DBCS to Unicode to simplify their lives I am sure are not happy.

    Ken made me laugh last March by referring to this as

        "... a bait and switch tactic, whereby implementers were lulled
        into thinking they had a simple, fixed-width 16-bit system, only
        to discover belatedly that they had bought into yet another
        mixed-width character encoding after all."

    At least with surrogate pairs, we don't have to deal with overlapping
    ranges for lead bytes and trail bytes, or for trail bytes and
    single-byte characters, and we don't have to go through crazy gymnastics
    to "find the last lead byte" if we ever get lost in the middle of a
    UTF-16 string.

    > I think that worst problem is that many systems still sort in binary
    > not code point order. Then you get Oracle and the like wanting to set
    > up a UTF-8 variant that encode each surrogate rather than the
    > character.

    As Michka noted, the mechanism for surrogates has existed for almost a
    decade now. Individuals and companies that ignored surrogates because
    "there aren't any characters there anyway, and when they do add some
    they'll be extremely rare," and are now behind in supporting UTF-16,
    really have nobody else to blame.

    > However, 16 bit characters were a hard enough sell in the good old
    > days. If we had started out withug 2bit characters we would still be
    > dreaming about Unicode.

    I think Carl meant "with 32-bit characters." I don't know what kind of
    word "withug" is (Old English?), but I like it.

    -Doug Ewell
     Fullerton, California



    This archive was generated by hypermail 2.1.5 : Fri Nov 15 2002 - 13:09:55 EST