This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Thu May 22 17:20:22 CDT 2014
Name: Laurentiu Iancu
Report Type: Public Review Issue
Opt Subject: PRI #273 - Suggested edits and additions to UTS #39 data and documentation
[Ported from BugTrack ticket #52, http://www.unicode.org/edcom/bugtrack/ticket/52] UTR #39 text (Revision 8 Draft 2): 1. The examples in Section 4 of UTS #39 currently use Latin characters for both “paypal” words and both “scope” words. It seems that the point of giving those examples is to illustrate the actual confusability. So I think that the latter words of the pairs should be “pаypal” (with first ‘а’ in Cyrillic, or some variation of it) and, respectively, “ѕсоре” (all Cyrillic) – the strings here use Cyrillic characters, so they can be copied/pasted directly. confusables.txt: 2. Some of the comment fields in confusables.txt start with #, others with #*. I could not infer what the distinction is between the two. It would help if that were documented in UTS #39. 3. [N/A anymore] 4. The set of confusables is open ended, but on the pattern of m vs. r-followed-by-n, which had some expansion in the 6.3.0 release, several other pairs can be constructed, such as: - m with hook vs. r followed by eng (0271 vs. 0072 014B) – this is more confusable than the existing entry with n and comma below; - m with palatal hook vs. r followed by n with palatal hook (1D86 vs. 0072 1D87); - others can be constructed from pieces with various protrusions like hooks and legs – e.g., heng with hook vs. long-s followed by dotless j (0267 vs. 017F 0237); I’m wondering what approach is taken to identify such cases and how systematic that approach is. 5. Some precomposed overlaid letters like t with stroke (0167) are listed as confusable with character sequences with combining overlays. I would infer that is because the former do not have decompositions, but if that is the case, then there are more candidates missing such as the Sencoten letters starting at 023A and other precomposed characters with strokes. 6. Is it the case that the SL confusables form a proper subset of the SA confusables, and so on compared to ML and then to MA confusables? If yes, the duplication in confusables.txt would be reduced quite a bit if each set only listed what that set contains in addition to the previous set, and inherited everything else from the previous set. It may be by design to list the sets fully, but from a different perspective it would be clearer to list only what each sets adds specific to it. (Of course, if there is partial overlap, then that argument would not apply.) confusablesSummary.txt: 7. Is the order of the entries significant? Some entries were reordered in the 6.3.0 release (e.g., 2010 was moved above 02D7, around line 810) and entries were inserted in an order which is not code point order in the 7.0.0 release (e.g., 144A and 16CC inserted between 07F5 and 0374, around line 93). If certain criteria are applied (some measure of confusability?), it would help to document them.
Date/Time: Tue Jun 3 00:02:22 CDT 2014
Name: Roozbeh Pournader
Report Type: Public Review Issue
Opt Subject: PRI 273 issues
The living Arabic, Cyrillic, and Myanmar characters in Unicode 7.0 should be moved to "recommended" instead of "limited-use". Here is a list: 0528..0529 052E..052F 08A1 08B2 A9E7..A9FE AA7C..AA7F They are not rarer than several other character already in "allowed" for these scripts. By putting them into "limited-use", we would be arbitrarily drawing the line of allowed/limited-use by what is encoded in Unicode 6.3 vs what is encoded in 7.0. For example, U+08A9, which is "recommended" and U+08A1, which would be "limited-use" if we go with current data, are actually both used in the same language, Fulfulde (and just in Fulfulde, as far as we know). We had simply postponed encoding U+08A1 to 7.0 for some architectural reasons, and that shouldn't make Fulfulde users to be able to use all letters in their alphabet in identifiers, except one. Alternatively, we can try to define a line about what to push out from recommended for existing major scripts, then do the research and find out which characters fall into each bucket. But before we do that homework, I believe the ranges I gave should all go to recommended. Also, Mro should be moved from historic to limited-use. Here's the ranges: 16A40..16A5E 16A60..16A69
Date/Time: Wed Jun 25 09:04:33 CDT 2014
Name: Yahyaoui
Report Type: Other Question, Problem, or Feedback
Opt Subject: UTS #39 idempotency of skeleton transform
Hello, In UTS #39, section 4 Confusable Detection, I read: > > Implementations do not have to recursively apply the mappings, because the transforms are idempotent. That is, > > skeleton(skeleton(X)) = skeleton(X) However, for the table MA in confusables.txt, I find the following mappings: 0049 ; 006C ; MA # ( I → l ) LATIN CAPITAL LETTER I → LATIN SMALL LETTER L # 042E ; 0049 004F ; MA # ( Ю → IO ) CYRILLIC CAPITAL LETTER YU → LATIN CAPITAL LETTER I, LATIN CAPITAL LETTER O # Thus, for a string X composed only of the character U+042E, we have: X = 042E skeleton(X) = 0049 004F skeleton(skeleton(X)) = 006C 004F Thus for this code point, skeleton(skeleton(X)) is not equal to skeleton(X). This problem exists with a dozen other mappings, where a destination code point in a mapping is also a source code point in another mapping. Can you please confirm if my understanding is correct? Or should I have recursively applied mappings in skeleton(X)? Best regards, Waïl Yahyaoui
Date/Time: Thu Jul 31 11:23:21 CDT 2014
Name: Michael Bobeck
Report Type: Error Report
Opt Subject: Different treatment of Yot cases in xidmodifications.txt
I noticed that in http://unicode.org/Public/security/7.0.0/xidmodifications.txt and by extension in other related Unicode files different cases of Yot are in different categories, one in historic, other in technical, as follows: 037F ; restricted ; historic # (J) GREEK CAPITAL LETTER YOT 03F3 ; restricted ; technical # (j) GREEK LETTER YOT I think that both Yot cases should be in common historic category. Can you get Unicode to correct this separate category treatment, so both Yot cases will be in the same historic category, like all other archaic Greek letters (digamma stigma heta san koppa sampi sho) are in historic category inside both http://www.unicode.org/Public/security/revision-03/xidmodifications.txt and http://unicode.org/Public/security/7.0.0/xidmodifications.txt Michael Bobeck
Date/Time: Fri Aug 1 03:25:01 CDT 2014
Name: Mark Davis
Report Type: Public Review Issue
Opt Subject: pri273: add short note
There are two kinds of confusability that we should separate in #39 and #36. One is where the goal is to "fool the user", such as the "paypal" case. The other is where the goal is to "slip by the gatekeeper", such as the "Ⓥ*ⓘ*ⓐ*ⓖ*ⓡ*ⓐ" case. In this latter case, the end user isn't fooled by the characters; instead, the goal is just to be recognizable to the user. The real goal is to fool mechanical gatekeepers, such as spam detectors. In this, it is related to CAPTCHA examples. We should include a short note documenting the "gatekeeper" case, and note that it has not been a goal for the current data.