From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 07 2003 - 20:18:13 EDT
Asmus wrote:
> Both our external environment and our practical experience with the use and
> effect of decompositions has expanded since they were designed nearly 10
> years ago. It's time to take the consequences. If the existing
> decompositions are essentially frozen (and I agree that they must be), that
> means adding additional properties, so implementers can get back a clear
> set of mappings that are graduated by their effect and suitable context of
> applicability.
Amen, brother! Testi-fie!
Seriously, the existing decompositions, which have a long history,
and which were originally created (starting in 1993) as a kind
of set of helpful annotations, before they morphed into the
basis for formal normalization framework they now serve, are
often getting in the way of people understanding the Unicode Standard,
rather than helping them.
Only people who have had a long, continued experience with the
twists and turns of the last decade, or who make the effort
to lay out the Unicode 1.0, 1.1, 2.0, and 3.0 documentation
side-by-side and to fire up the greps and diffs on all the
versions of UnicodeData.txt over the years can really follow
what has gone on or why many of these mappings ended up the
way they are now.
I agree that it is probably time to start on the process of
creating a new set of more nuanced (and documented) equivalence
mappings for the Unicode Standard -- ones that are not
encumbered by the immutability implied by the Normalization
algorithm.
Who knows, it could even become a fun group project, where
one person gets to track down all the instances of characters
that are equivalent to a base letter + accent sequence,
another gets to track down all instances of characters
that might evaluate to 6 (including, for instance, U+03DD
GREEK SMALL LETTER DIGAMMA), another gets to track down
all the glottal stops (including U+02BB, U+02BC, U+02BE,
U+02C0, and -- trivia question, not yet in Unicode 4.0 --
U+097D DEVANAGARI GLOTTAL STOP), and another gets to track
down all the characters whose glyphs look like a dot, ...
Another consideration to keep in mind is that the compatibility
decompositions have always been implicated in an oft
suggested, never-completed project for "Cleanicode" -- Unicode
as she ought to have been, if legacy compatibility hadn't
been an issue for the encoding. I think there may still be
some value in someone trying to sift out all the legacy
compatibility hacks in Unicode to express how the various
scripts (and symbol sets) could have been encoded right (and
in some cases, still can be implemented). For example, the
Braille symbols set is a *good* example. It is a complete,
rationalized set, and it is hard to imagine, now, doing
it any differently. Korean represents the extreme opposite,
with 3 different legacy encodings represented (Hangul
syllables, compatibility jamos, half-width jamos), in
addition to the recommended conjoining jamos. And even the
conjoining jamos have some issues still, when applied to
Old Korean syllables.
That Cleanicode project is (or ought to be) distinct from
the kind of project that Asmus has in mind of providing
more precise, graduated, equivalence mappings that can
be useful to implementations to actually produce the
results that people expect, but which they may not get today
just based on normalization forms.
--Ken
This archive was generated by hypermail 2.1.5 : Wed May 07 2003 - 21:13:13 EDT