From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Aug 23 2005 - 17:50:07 CDT
Antoine noted:
> Ken (probably purposely) is dropping a fifth (!) encoding scheme for Hangul:
^^^^^^^^^
> U+3400..U+4DB5 (removed in 1996). Which is probably a forgotten thing now
> (for the best), but certainly was a headache some years ago.
Yes. It wasn't really a distinct way of encoding Hangul syllables --
just a different encoding of a subset of the ones that ended up
being moved to AC00..D7A3 by Amendment 5 to 10646-1:1993. They are
a mess for anyone who has to interoperate with very early, early
implementations of Unicode 1.1, but fortunately are ancient history
now for most of us.
> BTW, diacriticked Latin is encoded at least thrice, and the same algorithms
> used for reduction of the latter could be used for the former, couldn't
> they?
Well, the Unicode Collation Algorithm makes use of equivalences
built into the standard (canonical and compatibility decompositions)
to produce meaningful weight foldings for Latin letters in the
Default Unicode Collation Element Table, if that's what you mean.
The same thing *could* be done for a double encoding of
the Tamil script. But of course that begs the question: Why do such
a thing in the first place?
The monstrous messes in the standard for the Hangul and Latin
scripts are not a *desirable* state to be wished for in other
scripts. :-)
> > But sorting *Korean* in Unicode
>
> Doesn't it mean collating Hanja characters as well? (so intermixing them
> with their Hangul reading, etc.)
There are at least *two* additional issues when you get to the
Hanja characters.
First, you have to ensure the syllabic integrity of Hangul when
doing the weighting, so that the injection of a Hanja character
doesn't cause a wrong ordering by being compared incorrectly with
a part of a Hangul syllable. This is handled in part in the
UCA by giving CJK characters very high weights at the end of
the table. But you have to worry about the weighting of final
jamos, as well.
Second, if doing a phonetic order sort of Korean, *including*
Hanja, then you need to have the correct reading of the Hanja.
The legacy character sets for Korean had hacks in place for
this, involving separate encoding of duplicate Hanja for
pronunciation variants, so that strings could sort correctly
depending on *which* duplicate was used to encode the Hanja
character. Those duplicates ended up in Unicode as compatibility
mapping characters: U+F900..U+FA0B. Ordinarily you wouldn't
see those in de novo Korean data on a Unicode system, but
if you are interoperating with a legacy system, you might --
and if you have to sort that data, you need to track separate
weights for the duplicates and *not* normalize them away.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Aug 23 2005 - 17:51:13 CDT