This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Mon Mar 9 12:46:53 CDT 2015
Name: Roozbeh Pournader
Report Type: Public Review Issue
Opt Subject: UTS #39 has limitations that make it hard to use for some common confusables
Presently, the data model used in UTS #39 has various limitations, mostly around recursive references and assumptions of transitiveness. For example, it assumes that a character cannot be confusable with another string that includes the character, and it assumes that if the character X is confusable with Y and Y is confusable with Z, then X is also confusable with Y. It also assumes that if X is confusable with Y, then XZ is confusable with YZ. None of these assumptions are really true in real life, and it does create problems in using the standard and maintaining the data. For example: 1. It appears that the most frequent Latin grapheme cluster on the web with more than one character in NFC is <0069, 0307> <small letter i, combining dot above>. (The sequence has no real use, but is still very frequent, most probably due to faulty Turkish and Azerbaijani input methods or case conversion algorithms.) According to the standard, this sequence should be rendered exactly the same as <0069>. Most good fonts and rendering engines do exactly that. All this makes the pair a very important case for confusability, but the current model in UTS #39 doesn't allow representing this. (Interestingly, if you add another 0307 to both sequences, the resulting sequences are no longer necessarily confusable: <0069, 0307> is not necessarily confusable with <0069, 0307, 0307>, as the latter has two dots above it.) 2. U+066C ARABIC THOUSANDS SEPARATOR appears in various different shapes, including a high-6 quote, and high-9 quote, a European comma, and an Arabic comma, and should thus be considered confusable with all of them. But this shouldn't make the European comma confusable with a single quote. 3. Due to its contextual forms, the Unicode 9.0 Arabic letter U+08BC AFRICAN QAF is confusable with U+066F DOTLESS QAF and U+0641 FEH. DOTLESS QAF is confusable with U+06A1 DOTLESS FEH, while FEH is confusable U+06A7 QAF WITH DOT ABOVE. DOTLESS FEH is in turn confusable with U+08BB AFRICAN FEH, which is confusable with U+06A2 FEH WITH DOT MOVED BELOW. This basically puts all one- dotted-above, one-dotted-below, and dotless FEH and QAF Arabic letters in one class, a lot of which are not confusable at all. Mark Davis has mentioned some ideas about how to fix some of these issues, but I think we need to track this at the UTC level.