Accumulated Feedback on PRI #273

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Thu May 22 17:20:22 CDT 2014
Name: Laurentiu Iancu
Report Type: Public Review Issue
Opt Subject: PRI #273 - Suggested edits and additions to UTS #39 data and documentation

[Ported from BugTrack ticket #52, http://www.unicode.org/edcom/bugtrack/ticket/52]

UTR #39 text (Revision 8 Draft 2):

1. The examples in Section 4 of UTS #39 currently use Latin characters for both “paypal” 
words and both “scope” words. It seems that the point of giving those examples is to 
illustrate the actual confusability. So I think that the latter words of the pairs should 
be “pаypal” (with first ‘а’ in Cyrillic, or some variation of it) and, respectively, 
“ѕсоре” (all Cyrillic) – the strings here use Cyrillic characters, so they can be 
copied/pasted directly.

confusables.txt:

2. Some of the comment fields in confusables.txt start with #, others with #*. I could 
not infer what the distinction is between the two. It would help if that were documented in UTS #39.

3. [N/A anymore]

4. The set of confusables is open ended, but on the pattern of m vs. r-followed-by-n, which 
had some expansion in the 6.3.0 release, several other pairs can be constructed, such as:
    - m with hook vs. r followed by eng (0271 vs. 0072 014B) – this is more confusable than 
the existing entry with n and comma below;
    - m with palatal hook vs. r followed by n with palatal hook (1D86 vs. 0072 1D87);
    - others can be constructed from pieces with various protrusions like hooks and legs – 
e.g., heng with hook vs. long-s followed by dotless j (0267 vs. 017F 0237); I’m wondering 
what approach is taken to identify such cases and how systematic that approach is.

5. Some precomposed overlaid letters like t with stroke (0167) are listed as confusable 
with character sequences with combining overlays. I would infer that is because the former 
do not have decompositions, but if that is the case, then there are more candidates missing 
such as the Sencoten letters starting at 023A and other precomposed characters with strokes.

6. Is it the case that the SL confusables form a proper subset of the SA confusables, and 
so on compared to ML and then to MA confusables? If yes, the duplication in confusables.txt 
would be reduced quite a bit if each set only listed what that set contains in addition to 
the previous set, and inherited everything else from the previous set. It may be by design 
to list the sets fully, but from a different perspective it would be clearer to list only 
what each sets adds specific to it. (Of course, if there is partial overlap, then that 
argument would not apply.)

confusablesSummary.txt:

7. Is the order of the entries significant? Some entries were reordered in the 6.3.0 release 
(e.g., 2010 was moved above 02D7, around line 810) and entries were inserted in an order 
which is not code point order in the 7.0.0 release (e.g., 144A and 16CC inserted between 
07F5 and 0374, around line 93). If certain criteria are applied (some measure of confusability?), 
it would help to document them.

Date/Time: Tue Jun 3 00:02:22 CDT 2014
Name: Roozbeh Pournader
Report Type: Public Review Issue
Opt Subject: PRI 273 issues

The living Arabic, Cyrillic, and Myanmar characters in Unicode 7.0 should be 
moved to "recommended" instead of "limited-use". Here is a list:

0528..0529
052E..052F
08A1
08B2
A9E7..A9FE
AA7C..AA7F

They are not rarer than several other character already in "allowed" for these
scripts.

By putting them into "limited-use", we would be arbitrarily drawing the line
of allowed/limited-use by what is encoded in Unicode 6.3 vs what is encoded in
7.0. For example, U+08A9, which is "recommended" and U+08A1, which would be
"limited-use" if we go with current data, are actually both used in the same
language, Fulfulde (and just in Fulfulde, as far as we know). We had simply
postponed encoding U+08A1 to 7.0 for some architectural reasons, and that
shouldn't make Fulfulde users to be able to use all letters in their alphabet
in identifiers, except one.

Alternatively, we can try to define a line about what to push out from
recommended for existing major scripts, then do the research and find out
which characters fall into each bucket. But before we do that homework, I
believe the ranges I gave should all go to recommended.

Also, Mro should be moved from historic to limited-use. Here's the ranges:

16A40..16A5E
16A60..16A69

Date/Time: Wed Jun 25 09:04:33 CDT 2014
Name: Yahyaoui
Report Type: Other Question, Problem, or Feedback
Opt Subject: UTS #39 idempotency of skeleton transform

Hello,

In UTS #39, section 4 Confusable Detection, I read:
> > Implementations do not have to recursively apply the mappings, because the transforms 
are idempotent. That is,
> > skeleton(skeleton(X)) = skeleton(X)
However, for the table MA in confusables.txt, I find the following mappings:
0049 ;    006C ;    MA    # ( I → l ) LATIN CAPITAL LETTER I → LATIN SMALL LETTER L    #
042E ;    0049 004F ;    MA    # ( Ю → IO ) CYRILLIC CAPITAL LETTER YU → LATIN CAPITAL LETTER I, LATIN CAPITAL LETTER O    #

Thus, for a string X composed only of the character U+042E, we have:
X = 042E
skeleton(X) = 0049 004F
skeleton(skeleton(X)) = 006C 004F

Thus for this code point, skeleton(skeleton(X)) is not equal to skeleton(X). This problem 
exists with a dozen other mappings, where a destination code point in a mapping is also 
a source code point in another mapping.

Can you please confirm if my understanding is correct? Or should I have recursively 
applied mappings in skeleton(X)?

Best regards,

Waïl Yahyaoui

Date/Time: Thu Jul 31 11:23:21 CDT 2014
Name: Michael Bobeck
Report Type: Error Report
Opt Subject: Different treatment of Yot cases in xidmodifications.txt

I noticed that in
http://unicode.org/Public/security/7.0.0/xidmodifications.txt and by
extension in other related Unicode files


different cases of Yot are in different categories, one in historic, other
in technical, as follows:

037F          ; restricted ; historic          #      (J)  GREEK CAPITAL
LETTER YOT

03F3          ; restricted ; technical        #      (j)  GREEK LETTER YOT

I think that both Yot cases should be in common historic category. Can you
get Unicode to correct this separate category treatment, so both Yot cases
will be in the same historic category, like all other archaic Greek letters
(digamma stigma heta san koppa sampi sho) are in historic category inside
both http://www.unicode.org/Public/security/revision-03/xidmodifications.txt
and http://unicode.org/Public/security/7.0.0/xidmodifications.txt

Michael Bobeck

Date/Time: Fri Aug 1 03:25:01 CDT 2014
Name: Mark Davis
Report Type: Public Review Issue
Opt Subject: pri273: add short note


There are two kinds of confusability that we should separate in #39 and #36. 
One is where the goal is to "fool the user", such as the "paypal" case. The 
other is where the goal is to "slip by the gatekeeper", such as the 
"Ⓥ*ⓘ*ⓐ*ⓖ*ⓡ*ⓐ" case. In this latter case, the end user isn't fooled by 
the characters; instead, the goal is just to be recognizable to the user. 
The real goal is to fool mechanical gatekeepers, such as spam detectors. 
In this, it is related to CAPTCHA examples.

We should include a short note documenting the "gatekeeper" case, and note 
that it has not been a goal for the current data.