L2/13-203

Re:

#36 & #39 Recommendations

To:

UTC

From:

Mark Davis

Live:

http://goo.gl/NKeRVB 

        

While doing the latest version of UTR #36 & UTS #39 and associated data I uncovered a few issues, for which I have the following recommendations. Some of these could be implemented in the upcoming release of #36 & #39, while others are probably more appropriate for the version built on UCD 7.0.

1. Additional section in #36

2. Version

3. Format

4. Be specific about NFC in IdMod

5. Mechanical derivation of IdMod

A. Additions

B. Subtractions

See also http://unicode-inc.blogspot.ch/2013/10/proposed-updates-for-utr-36-and-uts-39.html 

1. Additional section in #36

Idempotence of a normalization/canonicalization function is a crucial invariant to maintain, and it would be worth adding a section about that, and transitivity.

There is very nicely written article at http://labs.spotify.com/2013/06/18/creative-usernames/ that details how bad idempotence caused a security breach. We could just have a short section that points to that detailed explanation.

Another useful discussion is on the transitivity of comparison functions for Unicode. These are easy to get wrong, and sorting behaves bizarrely when faced with non-transitive comparisons, and can also cause unexpected breaches.

In each PRI for #36 we should also  solicit pointers to other useful articles.

2. Version

Each of the versions of the data files for UTS #39 is for a specific version of the UCD. However, the versioning of the data files for UTS #39 does not reflect that. I recommend that we switch the versioning to use the version of the UCD that it was built for.

This is definitely easier for users to understand and manage in their implementations. For comparison, here are the versions so far:

Server folder

release date

Version

Unicode version

Date

revision-02/

2006-08-15

Version 1

5.1.0

2008-04

revision-03/

2006-08-11

draft only!

n/a

n/a

revision-04/

2010-08-05

Version 2

6.0.0

2010-10

revision-05/

2012-07-23

Version 3

6.1.0

2012-01

revision-06/

2013-xx-x?

Version 4*

6.3.0

2013-09

If for some reason we had to issue a modified version between releases, we could use the 3rd field, eg 6.3.1.

Now, there are a few ways we could manage using the same version number.

  1. Do not necessarily produce a new version for every version of the UCD, nor at the same time as the release of the UCD.
  2. Produce an update for each UCD, but do not necessarily release at the same time as the UCD.
  1. That is, like UCA and TR46, the release may lag the UCD by a month or so.
  2. However, beta data files should be available while the UCD beta is released.
  1. Produce an update for each UCD, and release at the same time as that UCD.
  1. I don’t recommend this option, simply because it is difficult to manage the production.
  2. However, like UCA, we can try to produce it during the month after the UCD.

3. Format

The data files are gratuitously different from the UCD data files, yet many people will want to use the same parser. We could make the following changes to make it more consistent.

Use values for Idmod Status and Type that are more like UCD, eg Limited_Use instead of limited-use.

Change the format for confusables.txt to be more easily parsed in the same way as the UCD files.

01C3        ;        0021                ;        SL        #...

=>

01C3        ;        Confusable_SL        ;        0021        #...

4. Be specific about NFC in IdMod

In the current algorithm using idmod data, it doesn’t say which normalization form the text is presumed to be in, if any. However, an underlying presumption is that the text is in NFC, and we should make that clear in the text.

In theory, there are complications because of canonical equivalence. I mention them below  for completeness, although I don’t think that they present a problem in practice at this point.

a. Consider the case of x + under_dot (as on http://www.unicode.org/standard/where/). Because we base our data on single code points, not sequences, to allow <U+0078 U+0323> requires that we mark U+0323 as recommended (thus with any base). To be more fine-grained, we’d have to extend the data file to also allow a sequence of code points, like the following. However, that doesn’t seem worth the effort for the few cases where this arises.

0078 0323    ; allowed ; recommended # LATIN SMALL LETTER X + COMBINING DOT BELOW

b.  Consider the NFC combining character sequence <a-dot-below, umlaut>, where:

  1. a-umlaut is recommended
  2. dot-below is recommended (in all combination)

Because this sequence is canonically equivalent to <a-umlaut, dot-below>, it should also be allowed. Again, however, complicating an algorithm to handle this case doesn’t seem worth the effort for the few cases (if any!) where this would arise.

If in the future we want to handle (a) and (b), we could extend the algorithm and data at that time.

5. Mechanical derivation of IdMod

Replace historic, technical in curated list with those that match UAX 31 or are derived from cldr: Aspirational, Limited_Use, Exclusion, Not_Cldr.

1. We currently derive most of historic and limited-use from UAX 31, but the terms are not the same. Aligning them would make the tables and the derivation clearer.

2. For technical and some of historic and limited-use, the original data was put together before CLDR was as well-developed, and depended on review of information in the UTC plus curated contributions. We now have more resources at our disposal, however, and can base the majority of the data on the CLDR exemplar characters. There still will be need, however, for an exception file, but it should be much smaller and easier to maintain.

My suggested algorithm is to collect all the characters for locales in the latest release of CLDR where the following conditions are true. We’d modify Step 7 in http://unicode.org/draft/reports/tr39/tr39.html#IDMOD_Data_Collection so that any character outside of this list would be marked as ‘Not_Cldr’.

  1. The exemplars have approved status
  2. The language is Living according to ISO639-3 (or is Esperanto), and
  3. Either
  1. has a literate population of over 1M, or
  2. the language has 50% modern coverage in CLDR

Clause C is a heuristic, because the languages that are below that level don’t get as much attention. As we improve the coverage and quality, this clause can be refined. Esperanto is added to B simply because it is the only “non-living” language with sizable usage (however, its addition doesn’t really make a difference.

A. Additions

If we applied the CLDR-based methodology, we’d get add the following characters (compared to the previous version). Note that the uppercase equivalents would also be included; they are just omitted from this list for brevity.

[ƒƙƴɓɔɖɗɛɣɲʋạảấầẩẫậắ ằẳẵặẹẻẽềểễệỉịọỏốồổỗộớ ờởỡợụủứừửữựỳỵỷỹ]

Most of these are Vietnamese characters, perhaps previously excluded by being covered by NFD (see above).

B. Subtractions

If we applied the CLDR-based methodology, we’d remove the following characters (compared to the previous version). Note that the uppercase equivalents would also be included; they are just omitted from this list for brevity.

NOTE: because this would be a more significant change, I’d suggest that we just add the additions above in the version for UCD 6.3. For the beta files for the version built on UCD 7.0, we could switch over to using CLDR; that would allow time for people to review the Subtractions to make sure that we can make adjustments where necessary.

Latin+Greek+Cyrillic: 324 Code Points

[ǻǟȧǡȁȃḁ ǽǣ ḃḅḇ ĉċḉ ꞓ ḋḑḓḏ ĕȩḝḗḕȅȇḙḛ ḟ ǵĝ ġḡꞡ ĥȟḧḣḩḫẖħ ĭḯȉȋḭ ĵǰ ḱǩꞣḳḵ ḷḹḽḻ ꞎ ḿṁṃ ǹꞥṇṋ ṉ ꞑ ŏȫṍṏȭȯȱǿǫǭṓṑȍȏ ṕṗ ĸ ṙŗꞧȑȓṝṟ ṥŝṧṡꞩṩ ẗṫţṱ ṯ ŧ ŭǔǘǜǚǖṹṻȕȗṳṷṵ ꟺ ṽ ṿ ẘẇẉ ẍẋ ẙẏȳ ẑẕ ǯ ἀἄᾄ ἂᾂἆᾆᾀἁἅᾅἃᾃἇᾇᾁᾴὰᾲᾰᾶᾷᾱᾳ ἐἔἒἑἕἓὲ ἠἤᾔἢᾒἦᾖᾐἡἥᾕἣᾓ ἧᾗᾑῄὴῂῆῇῃ ἰἴἲἶἱἵἳἷὶῐῖ ῒῗῑ ὀὄὂὁὅὃὸ ῤ ῥ ϼ ͼ ͻ ͽ ὐὔὒὖὑὕὓὗὺῠῦῢῧῡ ὠὤᾤὢ ᾢὦᾦᾠὡὥᾥὣᾣὧᾧᾡῴὼῲῶῷῳ ӑ ӓ ӛ ӕ ӻ ҕ ӷ ԁ ԃ ҙ ѐ ӗ ꙴ ӂ ӝ җ ԅ ԑ ӟ ӡ ԇ ꙵѝ ҋ ӥ ꙶ ӄ ҡ ҟ ԟ ԛ ӆ ԓ ԡ ԉ ԕ ӎ ӈ ԣ ҥ ԋ ӧ ӫ ԥ ҧ ҏ ԗ ԍ ҫ ԏ ҭ ꙷ ӱ ӳ ӽ ӿ ԧ ꙻ ꙡ ҵ ӵ ӌ ҽ ҿ ꙿ ꙸ ꙹ ӹ ꙺ ҍ ӭ ԙ ꚟ ҩ ԝ ӏ]

Arabic + Ethiopic: 294 Code Points

[፞ ፝ ٴ ٱ ݳ ݴ ࢨ ࢩ ࢬ ٻ ڀ ݐ-ݕ ࢠ ݖ ٺ ٽ ٿ ڃ ڄ ڿ ڇ ࢢ ڂ ݗ ݘ ݮ ݯ ݲ ݼ ڊ-ڍ ڏ ڐ ۮ ݙ ݚ ڒ ڔ ڕ ڗ ڙ ۯ ݛ ݫ ݬ ݱ ࢪ ڛ ڜ ۺ ݜ ݭ ݰ ݽ ݾ ڝ ڞ ۻ ڟ ࢣ ڠ ۼ ݝ-ݟ ڡ-ڤ ࢤ ڥ ڦ ݠ ݡ ڧ ڨ ࢥ ڪ ڬ ݿ ڭ ڮ ڰ-ڴ ݢ ػ ؼ ݣ ݤ ڵ-ڸ ݪ ࢦ ݥ ݦ ࢧ ڻ ڽ ڹ ݧ-ݩ ۃ ۿ ەۀ ۥ ۅ ۆ ۈ ۊ ۋ ۏ ݸ ݹ ࢫ ۦ ێ ۑ ؽ-ؿ ؠ ݵ-ݷ ۓ ݺ ݻ ሇ ⶀ ᎀ-ᎃ ⶁ-ⶃ ꬁ-ꬆ ⶄ ቇ ᎄ-ᎇ ⶅ-ⶇ ኇ ⶈ-ⶊ ኯ ዏ ⶋ ꬑ-ꬖ ዯ ⶌ ꬉ-ꬎ ዸ-ዿ ⶍ ⶎ ጏ ጘ-ጟ ⶓ-ⶖ ⶏ ⶐ ꬠ-ꬦ ⶑ ꬨ-ꬮ ᎈ-ᎏ ⶒ ፘ-ፚ ⶠ-ⶦ ⶨ-ⶮ ⶰ-ⶶ ⶸ-ⶾ ⷀ-ⷆ ⷈ-ⷎ ⷐ-ⷖ ⷘ-ⷞ]

Remainder: 458 Code Points

[༵ ༷༾༿ ྂ ྃ ྆ ྇ ࿆ ̓ ̔ ̆ ̊ ͂ ̋ ̇ ̸ ̧ ̨ ̄ ̉ ̏-̑ ̛ ̣-̦ ̭ ̮ ̰ ̱ ̵ ̹ ִ ٓ ਁ ਂ ஂਃ ႍ ゙ ゚ ˬ ꜗ-ꜟ ꞈ ॱ ៗ ꧏ ꩰ ૦୦௦౦൦๐໐႐០ ૧୧௧ ౧൧๑໑႑១ ૨୨௨౨൨๒໒႒២ ૩୩௩౩ ൩๓໓႓៣ ૪୪௪౪൪๔໔႔៤ ૫୫௫౫൫ ๕໕႕៥ ૬୬௬౬൬๖໖႖៦ ૭୭௭౭൭๗ ໗႗៧ ૮୮௮౮൮๘໘႘៨ ૯୯௯౯൯๙໙ ႙៩ ჷ ⴧჇ ჸ-ჺ ჽ ⴭჍ ჾ ჿ ՙ װ-ײ ހ ޙ ޚ ށ-ރ ޜ ބ-އ ޢ ޣ ވ ޥ މ-ދ ޛ ތ ޘ ޠ ޡ ލ ގ ޤ ޏ ސ ޝ-ޟ ޑ-ޗ ޱ ަ-ް ॲ ऄ ॳ-ॷ ॠ ॡ ऎ ऒ ॻ ॹ ॼ ॾ ऩ ॿ ॺ ऱ ऴ ॽ ऺ ऻ ॏ ॖ ॗ ॢ ॣ ॆ ॊ ઌ ૡ-ૣ ୠ ଌ ୡ ଽ ୖ ୗ ௐ ஶ ௗ ఽ ೱ ೲ ೢ ೣ ഩ ൎ ഺ ഽ ඎ ໞ ໟ ཫ ཬ ༀ ྈ ྍ ྉ ྎ ྌ ྏ ྊ ྋ ၵ-ၷ ꩠ ၚ ၸ ꩡ-ꩣ ၹ ꩲ ၛ ꩤ ၡ ၺ ꩥ-ꩩ ၮ ၻ ꩪ ၼ ꩫ ၞ ၽ ၾ ꩯ ႎ ၿ ၟ ꩳ ꩺ ၠ ႂ ႀ ၐ ၑ ၥ ꩬ ႁ ꩭ ꩮ ꩱ ၜ ၝ ၯ ၰ ၦ ဢ ၒ-ၕ ဨ ႃ ၲ ႜ ၱ ဳ ၳ ၴ ၖ-ၙ ႄ ဵ ႅ ႝ ဴ ၢ ၧ ၨ ႆ ၣ ၤ ၩ-ၭ ႇ ႋ ႈ ႌ ႉ ႊ ႏ ႚ ႛ ꩻ ꩴ-ꩶ ឝ ឞ ៜ ゔ &#x1b000; ゕ ゖ &#x1b001; ヷ-ヺ ㄅ ㆠ ㄆㆴ ㄇ ㄈ ㄪ ㄉ ㄊㆵ ㄋ-ㄍㆣ ㄎㆶ ㄫ ㆭ ㄏㆷ ㄐㆢ ㄑ ㄒ ㄬ ㄓ-ㄗ ㆡ ㄘ ㄙ ㆸ-ㆺ ㄚㆩ ㄛㆧ ㆦ ㄜ ㄝ ㆤ ㆥ ㄞㆮ ㄟ ㄠㆯ ㄡ-ㄤ ㆲ ㄥ ㆰ ㆱ ㆬ ㄦ ㄧㆪㆳ ㄨㆫㆨ ㄩ ㄭ 〆 〇]

In addition, we’d remove the CJK ideographs that are not in CLDR. Note that they would still be a superset of the IDNA characters accepted by the CJK NICs.