Public Review Issues

Accumulated Feedback on PRI #292

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Mon Mar 9 12:46:53 CDT 2015
Name: Roozbeh Pournader
Report Type: Public Review Issue
Opt Subject: UTS #39 has limitations that make it hard to use for some common confusables

Presently, the data model used in UTS #39 has various limitations, mostly
around recursive references and assumptions of transitiveness.

For example, it assumes that a character cannot be confusable with another
string that includes the character, and it assumes that if the character X is
confusable with Y and Y is confusable with Z, then X is also confusable with
Y. It also assumes that if X is confusable with Y, then XZ is confusable with
YZ.

None of these assumptions are really true in real life, and it does create
problems in using the standard and maintaining the data. For example:

1. It appears that the most frequent Latin grapheme cluster on the web with
more than one character in NFC is <0069, 0307> <small letter i, combining dot
above>. (The sequence has no real use, but is still very frequent, most
probably due to faulty Turkish and Azerbaijani input methods or case
conversion algorithms.) According to the standard, this sequence should be
rendered exactly the same as <0069>. Most good fonts and rendering engines do
exactly that. All this makes the pair a very important case for confusability,
but the current model in UTS #39 doesn't allow representing this.
(Interestingly, if you add another 0307 to both sequences, the resulting
sequences are no longer necessarily confusable: <0069, 0307> is not
necessarily confusable with <0069, 0307, 0307>, as the latter has two dots
above it.)

2. U+066C ARABIC THOUSANDS SEPARATOR appears in various different shapes,
including a high-6 quote, and high-9 quote, a European comma, and an Arabic
comma, and should thus be considered confusable with all of them. But this
shouldn't make the European comma confusable with a single quote.

3. Due to its contextual forms, the Unicode 9.0 Arabic letter U+08BC AFRICAN
QAF is confusable with U+066F DOTLESS QAF and U+0641 FEH. DOTLESS QAF is
confusable with U+06A1 DOTLESS FEH, while FEH is confusable U+06A7 QAF WITH
DOT ABOVE. DOTLESS FEH is in turn confusable with U+08BB AFRICAN FEH, which is
confusable with U+06A2 FEH WITH DOT MOVED BELOW. This basically puts all one-
dotted-above, one-dotted-below, and dotless FEH and QAF Arabic letters in one
class, a lot of which are not confusable at all.

Mark Davis has mentioned some ideas about how to fix some of these issues, but
I think we need to track this at the UTC level.