From: Jeff Senn (senn@maya.com)
Date: Tue Sep 22 2009 - 14:36:45 CDT
Can someone sort out an ambiguity for me in composition during
normalization?
Either I misunderstand, or a couple of widely deployed implementations
have bugs,
and/or the standards docs imply an inconsistency.
Here are 2 test cases and the question is, can the characters be
*canonically*
combined during normalization?
case 1: 1B11, 1B35
case 2: 0CCA, 0CD5
There are (non-compatibility) decompositions for both of these
sequences:
1B12 --> 1B11, 1B35
0CCB --> 0CCA, 0CD5
All of these characters have combining class 0. Can they be canonically
combined? Even though the 2nd characters are NOT "combining"?
My read of the current UAX15 implies "yes".
UAX15 currently says: "D2. In any character sequence beginning with a
starter S, a character C is blocked from S if and only if there is
some character B between S and C, and either B is a starter or it has
the same or higher combining class as C."
Since there is no character between S and C, I assume C is not
"blocked".
However a previous of draft of UAX15 uses the phrase "A combining
character C
can be canonically combined with a base character B..." which implies
"No".
http://unicode.org/reports/tr15/
http://unicode.org/reports/tr15/pdtr15.html
At least 2 implementations do the combination in case 2: Python and
the ICU library
(e.g. http://minaret.info/test/normalize.msp ) However ICU seems
inconsistent in that
it does NOT combine case 1!
So, if the answer is indeed "YES", one might add
case 3: 0CCA, 0300, 0CD5 (admittedly unusual)
which clearly should not compose since ccc(0300) >= ccc(0CD5)
(http://www.unicode.org/review/pr-29.html)
Python, however, (incorrectly) yields: 0CCB, 0300
Help!
This archive was generated by hypermail 2.1.5 : Tue Sep 22 2009 - 15:36:12 CDT