From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed May 07 2003 - 17:30:41 EDT
At 03:17 PM 5/7/03 -0400, John Cowan wrote:
>Q: What's the difference between canonical and compatibility decomposition?
>
>A: Replacing a character by its canonical decomposition, which is either
>one or two characters long, does not destroy information, and makes no
>practical difference for most purposes.
>
>Replacing a character by its compatibility decomposition, which may be
>of any length, does destroy information, but typically transforms the
>character into better-known characters that may be easier to process.
Actually, that describes the ideal - in the historic process of creating
and maintaining these decompositions, that ideal has been compromised.
The canonical decompositions were applied to CJK compatibility characters,
essentially negating their purpose, and causing big practical problems in
all environments where they are used. It's arguable that they should have
been made compatibility decompositions.
The compatibility decompositions of positional Arabic forms in principle
don't destroy any information - applying their compatibility decompositions
makes little practical difference. As far as compatibility decompositions
go, they
are as close to canonical as they come.
Finally there are surprisingly many contexts in which applying
compatibility decompositions doesn't merely destroy some information about
the character, but can radically alter or destroy the meaning of the text.
We would be better off with a different classification: (*)
- informationally equivalent
- semantically equivalent (or semantically neutral)
- simplifying (or fuzzy equivalent)
The first would be limited to a core of current canonical decompositions
The second would contain the CJK compatibiliy (canonical) decompositions, the
Arabic positional form (compatibility), etc.
The third would contain the remainder, but would be augmented by other
types of fuzzy equivalence not currently in compatibility mappings.
Mappings (foldings) like HalfWidth/FullWidth folding either go into the
semantically neutral category or they may need a category of their own.
They are fairly semantically neutral, but unlike the other two I gave as
examples, they are fairly visible.
(See for example http://www.unicode.org/reports/tr30
which contains an earlier draft of a discussion of character folding, and
which I plan to update soon).
This all fits by the way into the ongoing discussion of making
Normalization tailorable, primarily in order to remove the deficiencies of
having included some merely semantically equivalent mappings with the pure
informational equivalences (primarily this affects the CJK compatibility
characters, but
nobody, having learned from our previous experience, feels inclined to
settle this once and for all, therefore the more general concept of
'tailoring', which would allow for better adjustments in the future.)
A./
(*) as we can't take away the existing decompositions, and their
definitions, any such proposal would have to be considered as adding
This archive was generated by hypermail 2.1.5 : Wed May 07 2003 - 18:32:00 EDT