NFC/NFKC Normalization Edge Case

From: Jeff Senn (senn@maya.com)
Date: Tue Sep 22 2009 - 14:36:45 CDT

Next message: Bjoern Hoehrmann: "Re: NFC/NFKC Normalization Edge Case"

Previous message: Asmus Freytag: "Re: Run-time checking of fonts for Sinhala support"
Next in thread: Bjoern Hoehrmann: "Re: NFC/NFKC Normalization Edge Case"
Reply: Bjoern Hoehrmann: "Re: NFC/NFKC Normalization Edge Case"
Maybe reply: Kenneth Whistler: "Re: NFC/NFKC Normalization Edge Case"
Maybe reply: Jeff Senn: "Re: NFC/NFKC Normalization Edge Case"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Can someone sort out an ambiguity for me in composition during
normalization?
Either I misunderstand, or a couple of widely deployed implementations
have bugs,
and/or the standards docs imply an inconsistency.

Here are 2 test cases and the question is, can the characters be
*canonically*
combined during normalization?

case 1: 1B11, 1B35
case 2: 0CCA, 0CD5

There are (non-compatibility) decompositions for both of these
sequences:

1B12 --> 1B11, 1B35
0CCB --> 0CCA, 0CD5

All of these characters have combining class 0. Can they be canonically
combined? Even though the 2nd characters are NOT "combining"?

My read of the current UAX15 implies "yes".

UAX15 currently says: "D2. In any character sequence beginning with a
starter S, a character C is blocked from S if and only if there is
some character B between S and C, and either B is a starter or it has
the same or higher combining class as C."

Since there is no character between S and C, I assume C is not
"blocked".

However a previous of draft of UAX15 uses the phrase "A combining
character C
can be canonically combined with a base character B..." which implies
"No".

http://unicode.org/reports/tr15/
http://unicode.org/reports/tr15/pdtr15.html

At least 2 implementations do the combination in case 2: Python and
the ICU library
(e.g. http://minaret.info/test/normalize.msp ) However ICU seems
inconsistent in that
it does NOT combine case 1!

So, if the answer is indeed "YES", one might add

case 3: 0CCA, 0300, 0CD5 (admittedly unusual)

which clearly should not compose since ccc(0300) >= ccc(0CD5)
(http://www.unicode.org/review/pr-29.html)

Python, however, (incorrectly) yields: 0CCB, 0300

Help!

Next message: Bjoern Hoehrmann: "Re: NFC/NFKC Normalization Edge Case"
Previous message: Asmus Freytag: "Re: Run-time checking of fonts for Sinhala support"
Next in thread: Bjoern Hoehrmann: "Re: NFC/NFKC Normalization Edge Case"
Reply: Bjoern Hoehrmann: "Re: NFC/NFKC Normalization Edge Case"
Maybe reply: Kenneth Whistler: "Re: NFC/NFKC Normalization Edge Case"
Maybe reply: Jeff Senn: "Re: NFC/NFKC Normalization Edge Case"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Sep 22 2009 - 15:36:12 CDT