[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #8290(accepted data)

Opened 2 years ago

Last modified 21 months ago

3 collators fail (follow-up)

Reported by: emmons Owned by: markus
Component: collation Data Locale:
Phase: rc Review:
Weeks: Data Xpath:
Xref:

ticket:8288

ticket:8289

ticket:8345

Description

This is the follow-up to Mark's ticket Cldrbug:8288, which we had to back out fixes for late in the 27 cycle. This is for 28....

After finding a problem in one collator, I put together a quick test, and found that the following 3 collators in cldr fail to build. In view of the time difference, I'm going to go ahead and check in the fixes and test, in case John want's to pick them up for the build.

Test Failures

TestCollators {
  TestBuildable {
    Error: (TestCollators.java:32) java.lang.IllegalArgumentException: en_US_POSIX, standard, java.text.ParseException: range start greater than end in starred-relation string at index 3 near "
&A!<*'\u0020'-'/'<"
    Error: (TestCollators.java:32) java.lang.IllegalArgumentException: ko, searchjl, java.text.ParseException: not a valid setting/option at index 0 near "![เ-ไ ເ-ໄ ꪵ ꪶ ꪹ "
    Error: (TestCollators.java:32) java.lang.IllegalArgumentException: root, emoji, java.text.ParseException: missing relation string at index 1533 near "📱📳📴📲📵☎📞
!<#⃣
<*⃣
<0⃣
<1⃣"
  } (7,815s) FAILED (3 failure(s))
} (7,816s) FAILED (3 failure(s))

Fixes

en_US_POSIX, standard: change the ranges to list all the characters:

&A<*'\u0020'-'/'<*0-'@'<*ABCDEFGHIJKLMNOPQRSTUVWXYZ<*'['-'`'<*abcdefghijklmnopqrstuvwxyz<*'{'-'\u007F'
=>
&A<*' !"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

ko, searchjl: move the command to after <cr><![CDATA[, and use ICU format:

			<suppress_contractions>[เ-ไ ເ-ໄ ꪵ ꪶ ꪹ ꪻ ꪼ]</suppress_contractions>
=>
				[suppressContractions [เ-ไ ເ-ໄ ꪵ ꪶ ꪹ ꪻ ꪼ]]

root,emoji: surround each of the characters on the first two lines by '...' (this may look funny in your browser):

<'#⃣'
<'*⃣'

The ko issue was strange. The exact same line worked in a root search collator, but didn't in ko.xml. And I couldn't figure out why it was breaking, so I did the simple thing of moving the command inside the CDATA using the ICU formulation.

Attachments

Change History

comment:1 Changed 2 years ago by markus

  • Cc mark, pedberg, yoshito, emmons, markus added
  • Component changed from unknown to data-collation

I can look into these for 28. If there was really a problem, then building these collators in the ICU4C data build should fail. Or maybe it's only a problem in Java?

comment:2 Changed 2 years ago by markus

Just looking at the ticket description again, I suspect that the problems with en_US_POSIX and ko/searchjl are due to bad processing of the collation data in the CLDR test code (\u0020 probably not unescaped, suppress_contractions element contents text probably dumped without adding ICU syntax). When CLDR test code reads collation data, it needs to process it like the combination of the LDML-to-ICU-converter (which turns XML settings into ICU syntax) + ICU's genrb (which unescapes) before feeding rule strings into the RuleBasedCollator constructor. There should be some common code for converter & test. Note that the test needs to do both less than the converter (no quoting lines with "") and more (unescape).

I suspect that the root/emoji tailoring does need the escaping.

comment:3 Changed 2 years ago by emmons

  • Owner changed from anybody to markus
  • Phase changed from dsub to rc
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 28

comment:4 Changed 2 years ago by markus

  • Xref changed from 8288 to 8288 8289 8345

en_US_POSIX: I believe that this will work once the test code is fixed to unescape the rule string before handing it to the collation builder.

ko/searchjl: This should be handled via ticket:8289

root/emoji: I submitted separate ticket:8345

Please use this ticket here to fix the test, to unescape rule strings.

comment:5 Changed 2 years ago by markus

  • Type set to data

comment:6 Changed 2 years ago by srl

  • Status changed from assigned to accepted

comment:7 Changed 23 months ago by markus

  • Milestone changed from 28 to 29

comment:8 Changed 21 months ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.