Re: Korean [Was: 28th IUC paper - Tamil Unicode New]

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Aug 23 2005 - 17:50:07 CDT

  • Next message: Patrick Andries: "Re: Windows Glyph Handling"

    Antoine noted:

    > Ken (probably purposely) is dropping a fifth (!) encoding scheme for Hangul:
                    ^^^^^^^^^
    > U+3400..U+4DB5 (removed in 1996). Which is probably a forgotten thing now
    > (for the best), but certainly was a headache some years ago.

    Yes. It wasn't really a distinct way of encoding Hangul syllables --
    just a different encoding of a subset of the ones that ended up
    being moved to AC00..D7A3 by Amendment 5 to 10646-1:1993. They are
    a mess for anyone who has to interoperate with very early, early
    implementations of Unicode 1.1, but fortunately are ancient history
    now for most of us.

    > BTW, diacriticked Latin is encoded at least thrice, and the same algorithms
    > used for reduction of the latter could be used for the former, couldn't
    > they?

    Well, the Unicode Collation Algorithm makes use of equivalences
    built into the standard (canonical and compatibility decompositions)
    to produce meaningful weight foldings for Latin letters in the
    Default Unicode Collation Element Table, if that's what you mean.

    The same thing *could* be done for a double encoding of
    the Tamil script. But of course that begs the question: Why do such
    a thing in the first place?

    The monstrous messes in the standard for the Hangul and Latin
    scripts are not a *desirable* state to be wished for in other
    scripts. :-)

    > > But sorting *Korean* in Unicode
    >
    > Doesn't it mean collating Hanja characters as well? (so intermixing them
    > with their Hangul reading, etc.)

    There are at least *two* additional issues when you get to the
    Hanja characters.

    First, you have to ensure the syllabic integrity of Hangul when
    doing the weighting, so that the injection of a Hanja character
    doesn't cause a wrong ordering by being compared incorrectly with
    a part of a Hangul syllable. This is handled in part in the
    UCA by giving CJK characters very high weights at the end of
    the table. But you have to worry about the weighting of final
    jamos, as well.

    Second, if doing a phonetic order sort of Korean, *including*
    Hanja, then you need to have the correct reading of the Hanja.
    The legacy character sets for Korean had hacks in place for
    this, involving separate encoding of duplicate Hanja for
    pronunciation variants, so that strings could sort correctly
    depending on *which* duplicate was used to encode the Hanja
    character. Those duplicates ended up in Unicode as compatibility
    mapping characters: U+F900..U+FA0B. Ordinarily you wouldn't
    see those in de novo Korean data on a Unicode system, but
    if you are interoperating with a legacy system, you might --
    and if you have to sort that data, you need to track separate
    weights for the duplicates and *not* normalize them away.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Aug 23 2005 - 17:51:13 CDT