L2/13-050

re: Fixes to UAX#31, UTS#39

from: Mark Davis

date: 2013-03-26

Here are a few fixes that are needed, resulting from an investigation of a report of problems from Mozilla (Firefox).

1. We removed colon ( : ) from MidLetter in #29:

http://www.unicode.org/reports/tr29/proposed.html#Table_Word_Break_Property_Values

However, the reason that it is in the inclusion table in #31 is  is because it was in MidLetter. So we should remove it from that table in #31 as well.

That is, remove “003A (:) COLON” from:

http://www.unicode.org/reports/tr31/proposed.html#Table_Candidate_Characters_for_Inclusion_in_Identifiers 

2. Being in that table in #31 is the basis for the ‘inclusion’ value in #39:

http://www.unicode.org/reports/tr39/index.html#Identifier_Modification_Key 

However, the data in http://www.unicode.org/Public/security/revision-05/xidmodifications.txt is not aligned with the values in #31. In particular, the KATAKANA MIDDLE DOT is missing, which is part of IDNA2008.

 30FB; CONTEXTO   # KATAKANA MIDDLE DOT

That is, add the following lines to xidmodification (and remove the corresponding entries from the other values):

0027 ; allowed ; inclusion # ( ' ) APOSTROPHE

058A ; allowed ; inclusion # ( ֊ ) ARMENIAN HYPHEN

2010 ; allowed ; inclusion #  ( ‐ ) HYPHEN

2027 ; allowed ; inclusion #  ( ‧ ) HYPHENATION POINT

30A0 ; allowed ; inclusion #  ( ゠ ) KATAKANA-HIRAGANA...

30FB ; allowed ; inclusion #  ( ・ ) KATAKANA MIDDLE DOT

However, we should also add text that makes it clear that:

Target applications may need to filter these characters. In particular, IDNs have specific requirements on characters that would exclude some of this; some other characters may be restricted on confusability grounds, notably hyphen.

3. There are 4 other characters that are in IDNA2008, but not in the inclusion list.

3007; PVALID     # IDEOGRAPHIC NUMBER ZERO

Of these, U+3007 is already in the recommended list (it is in XID_Continue). The three others are listed below.

06FD; PVALID     # ARABIC SIGN SINDHI AMPERSAND

06FE; PVALID     # ARABIC SIGN SINDHI POSTPOSITION MEN

0375; CONTEXTO   # GREEK LOWER NUMERAL SIGN (KERAIA)

These are allowed in #46, but not in #39, because they are no XID_Continue (they are General_Category=Other_Symbol and General_Category=Modifier_Symbol). These are bizarre additions to IDNA2008, but for consistency I propose that we broaden the definition of the ‘inclusion’ value in #39 to add these three characters, and document the reason: that it is for compatibility with IDNA2008 and consistency with #46. That would mean adding to the data file as:

06FD ; allowed ; inclusion # ARABIC SIGN SINDHI AMPERSAND

06FE ; allowed ; inclusion # ARABIC SIGN SINDHI POSTPOSI...

0375 ; allowed ; inclusion # GREEK LOWER NUMERAL SIGN...

4. We should have a special review of ASCII non-alphanumerics for  confusables. We have focused on alphanumerics, but these characters are often used as syntax characters, so the confusables are especially interesting. For example, possibilities to review for # and + are:

U+0023 ( # ) NUMBER SIGN

U+FE5F ( ﹟ ) SMALL NUMBER SIGN

U+FF03 ( # ) FULLWIDTH NUMBER SIGN

U+266F ( ♯ ) MUSIC SHARP SIGN

U+002B ( + ) PLUS SIGN

U+1429 ( ᐩ ) CANADIAN SYLLABICS FINAL PLUS

U+207A ( ⁺ ) SUPERSCRIPT PLUS SIGN

U+208A ( ₊ ) SUBSCRIPT PLUS SIGN

U+FE62 ( ﹢ ) SMALL PLUS SIGN

U+FF0B ( + ) FULLWIDTH PLUS SIGN

and a bit further afield:

U+2795 ( ➕ ) HEAVY PLUS SIGN

U+2629 ( ☩ ) CROSS OF JERUSALEM

U+16ED ( ᛭ ) RUNIC CROSS PUNCTUATION

U+2719 ( ✙ ) OUTLINED GREEK CROSS

U+271A ( ✚ ) HEAVY GREEK CROSS

U+271B ( ✛ ) OPEN CENTRE CROSS

U+1F542 ( 🕂 ) CROSS POMMEE