[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #10497(closed data: fixed)

Opened 11 months ago

Last modified 7 weeks ago

Serious CLDR 30/31 regression in zh stroke collation

Reported by: pedberg Owned by: pedberg
Component: collation Data Locale: zh
Phase: dvet Review: markus
Weeks: Data Xpath:
Xref:

ticket:9414

ticket:9765

ticket:10055

ticket:10642

Description

In CLDR 30/31, there was a serious regression in the zh stroke collation. For example, the following common characters/radicals were missing in the stroke collation:

乛   \u4E5B
冂   \u5182
卜   \u535C
又   \u53C8
小   \u5C0F
日   \u65E5
月   \u6708
牛   \u725B

This is related to changes for the following tickets:

There is likely some problem in the tooling or the way it was run.

I am checking to see whether the problem is still present in CLDR 32 data, updated per cldrbug 10055: Unicode 10

Attachments

Change History

comment:1 Changed 11 months ago by pedberg

  • Keywords Apple-31575085,33066940 added

comment:2 Changed 11 months ago by pedberg

  • Data Locale set to zh
  • Owner changed from anybody to pedberg
  • Status changed from new to accepted
  • Phase changed from dsub to dvet
  • Milestone changed from UNSCH to 32

Peter to check whether the problem still exists in current CLDR data, then reassign as necessary

comment:3 Changed 10 months ago by pedberg

This is still true in current CLDR trunk. Here for example is the data for <collation type='stroke' alt='short'> for index 2 in versions since CLDR 29:

CLDR 29, index 2 has 冂 \u5182 卜 \u535C 又 \u53C8

  <collation type='stroke' alt='short'>
    <cr><![CDATA[
       ...
       <'\uFDD0\u2802' # INDEX 2
       <*丁丂七丄丅丆丩丷乂乃乄𠂆𠂇𠂊乜九了𠄎二亠人亻儿入八⺆冂冖冫⺇几凵⺈刀刁刂力勹匕匚匸十⺊卜卩厂厶⺀又巜讠⻏⻖𨸏〢〤〦 # 2

CLDR 30 through current trunk, they are missing:

  <collation type='stroke' alt='short'>
    <cr><![CDATA[
      ...
      <'\uFDD0\u2802' # INDEX 2
      <*丁丂七丄丅丆丩丷乂乃乄𠂆𠂇𠂊乜九了𠄎二亠人亻儿入八⺆冖冫几⺇凵刀刁⺈力勹匕匚匸十⺊卩厂厶⺀巜讠⻏𨸏〢〤〦 # 2
                                                  ^ 冂 used to be here

The odd thing is that we have later rules still referring to the position of e.g. 冂 \u5182 which no longer has a position:

    &冂<<<⼌
    i.e. &\u5182<<<\u2F0C

comment:4 Changed 9 months ago by pedberg

  • Cc dongyuan added; dongyuan_liu@… removed
  • Xref changed from 9414, 9765, 10055 to 9414, 9765, 10055, 10642
  • Milestone changed from 32 to 33

This fix is for tooling, and is too much for CLDR 32 at this point. In CLDR 32 we will have revert to CLDR 29 stroke collation to address the problem, per cldrbug 10642:. The real fix in tooling is per this bug, which is moving to CLDR 33.

comment:5 Changed 9 months ago by pedberg

The problem was introduced in r12662

comment:6 Changed 4 months ago by pedberg

  • Milestone changed from 33 to 34

comment:7 Changed 7 weeks ago by pedberg

  • Milestone changed from 34 to 33.1

comment:8 Changed 7 weeks ago by pedberg

The problem is due to change r 1047 in the unicodetools project, part of the changes for cldrbug 9414: "UCA 9", but of course the unicodetools portion of the changes do not show in the review link for that ticket. The relevant change is in unicodetools/trunk/unicodetools/org/unicode/draft/GenerateUnihanCollators.java, in RSComparator.compare (line 1429); this is used by StrokeComparator.compare to provide a result when two characters have the same stroke count. In the version used for CLDR 29 it looked like this:

  public int compare(String o1, String o2) {
    final int s1 = RsInfo.getSortOrder(o1);
    final int s2 = RsInfo.getSortOrder(o2);
    final int result = s1 - s2;
    if (result != 0) {
      return result;
    }
    return codepointComparator.compare(o1, o2);
  }

That is, if RsInfo.getSortOrder returned the same value for both characters, the method used the codepoint to distinguish them. In the version used for CLDR 30 and later it looks like this:

  public int compare(String s1, String s2) {
    int c1 = s1.codePointAt(0);
    assert Character.charCount(c1) == s1.length();
    int c2 = s2.codePointAt(0);
    assert Character.charCount(c2) == s2.length();
    long order1 = getRSLongOrder(c1);
    long order2 = getRSLongOrder(c2);
    return order1 < order2 ? -1 : order1 > order2 ? 1 : 0;
  }

If getRSLongOrder returns the same result for both characters, there is no longer a fallback to a difference based on code point order.

In showSorting, in the first loop over Strings from unicodeMap, at line 875, rsSorted.add(s) is called to add strings. However they are not added if the comparator indicates they are equal to something already in rsSorted; and this is what I now see happening for the characters that have gone missing. For example;

  • rsSorted.add is called for 2-stroke \u2E86 ⺆, adds it successfully, then
  • rsSorted.add is called for 2-stroke \u5182 冂 (the more important char), does NOT add because the comparator treats it as equal to \u2E86

And in fact in GenerateUnihanCollators console log, after it writes out the strokeT files, I see output like the following; not sure what this is supposed to indicate, but the middle character (at least) in each of these groups is not getting added to rsSorted, and thus not written out as part of the collation:

unihan: 9 ⺄ [2E84/1]	rules: 9 ⺄ [2E84/1]
unihan: 10 乛 [4E5B/1]
unihan: 11 𠃊 [200CA/1]	 rules: 10 𠃊 [200CA/1]
-----
unihan: 48 ⺆ [2E86/2]	rules: 47 ⺆ [2E86/2]
unihan: 49 冂 [5182/2]
unihan: 50 冖 [5196/2]	 rules: 48 冖 [5196/2]
-----
unihan: 57 ⺈ [2E88/2]	rules: 55 ⺈ [2E88/2]
unihan: 58 刂 [5202/2]
unihan: 59 力 [529B/2]	 rules: 56 力 [529B/2]
-----
unihan: 65 ⺊ [2E8A/2]	rules: 62 ⺊ [2E8A/2]
unihan: 66 卜 [535C/2]
unihan: 67 卩 [5369/2]	 rules: 63 卩 [5369/2]
-----
unihan: 70 ⺀ [2E80/2]	rules: 66 ⺀ [2E80/2]
unihan: 71 又 [53C8/2]
unihan: 72 巜 [5DDC/2]	 rules: 67 巜 [5DDC/2]
-----
unihan: 74 ⻏ [2ECF/2]	rules: 69 ⻏ [2ECF/2]
unihan: 75 ⻖ [2ED6/2]
unihan: 76 𨸏 [28E0F/2]	 rules: 70 𨸏 [28E0F/2]
-----
...

So: It seems like a reasonable fix is to restore the code point comparison as a fallback in RSComparator.compare, and THEN run the tools to generate the updated Unicode 11 collation and transform data for CLDR 33.1.

I ran this by Markus, who suggested that I make the change, verify that it fixes the problem, and commit the unicodetools change, but not the updated data (replacing the current stroke data which is from CLDR 29 and has the characters that were missing in CLDR 30). He will run all of the tools for the Unicode 11 data update.

comment:9 Changed 7 weeks ago by pedberg

  • Status changed from accepted to reviewing
  • Review set to markus

OK, I made the change as r 1468 in the unicodetools project. The diff is:

Index: unicodetools/org/unicode/draft/GenerateUnihanCollators.java
===================================================================
--- unicodetools/org/unicode/draft/GenerateUnihanCollators.java	(revision 1467)
+++ unicodetools/org/unicode/draft/GenerateUnihanCollators.java	(revision 1468)
@@ -1433,7 +1433,10 @@
             assert Character.charCount(c2) == s2.length();
             long order1 = getRSLongOrder(c1);
             long order2 = getRSLongOrder(c2);
-            return order1 < order2 ? -1 : order1 > order2 ? 1 : 0;
+            if (order1 != order2) {
+                return order1 < order2 ? -1 : 1;
+            }
+            return codepointComparator.compare(s1, s2);
         }
     };
 

And running GenerateUnihanCollators with this chnage:

  • It now generates the zh stroke collations (the strokeT files) with the characters that were formerly missing
  • It also adds characters to the pinyin and unihan collations, and to the generated kMandarin.txt and kTotalStrokes.txt files
  • And it eliminates all of the GenerateUnihanCollators console log output that had lines prefixed with "unihan: ", e.g.
    unihan: 9 ⺄ [2E84/1]	rules: 9 ⺄ [2E84/1]
    unihan: 10 乛 [4E5B/1]
    unihan: 11 𠃊 [200CA/1]	 rules: 10 𠃊 [200CA/1]
    -----
    unihan: 48 ⺆ [2E86/2]	rules: 47 ⺆ [2E86/2]
    unihan: 49 冂 [5182/2]
    unihan: 50 冖 [5196/2]	 rules: 48 冖 [5196/2]
    -----
    unihan: 57 ⺈ [2E88/2]	rules: 55 ⺈ [2E88/2]
    unihan: 58 刂 [5202/2]
    unihan: 59 力 [529B/2]	 rules: 56 力 [529B/2]
    -----
    

comment:10 Changed 7 weeks ago by markus

  • Status changed from reviewing to closed
  • Resolution set to fixed

https://www.unicode.org/utility/trac/changeset/1468 LGTM

I will update the data files for Unicode 11 ticket:10978.

View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.