[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #7039(closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

Fix lt casing transforms to match SpecialCasing

Reported by: mark Owned by: pedberg
Component: translit Data Locale:
Phase: rc Review: mark
Weeks: Data Xpath:





The lt transforms should match SpecialCasing for all normal text. It turns out that there are differences.

TestTransforms {
  TestCasing {
    Error: File TestTransforms.java, Line 283: lt-Title Vs SpecialCasing: expected "I Ï J J̈ Į Į̈ Ì Í Ĩ Xi̇̈ Xj̇̈ Xį̇̈ Xi̇̀ Xi̇́ Xi̇̃ Xi Xï Xj Xj̇̈ Xį Xį̇̈", got "I Ï J J̈ Į Į̈ Ì Í Ĩ Xi̇̈ Xj̇̈ Xį̇̈ Xi̇̀ Xi̇́ Xi̇̃ Xi Xi̇̈ Xj Xj̇̈ Xį Xį̇̈"
    Error: 	I Ï J J̈ Į Į̈ Ì Í Ĩ Xi̇̈ Xj̇̈ Xį̇̈ Xi̇̀ Xi̇́ Xi̇̃ Xi X
		Special: 	00EF,0020,0058,006A,0020,0058,006A,0307,0308,0020,0058,012F,0020,0058,012F,0307,0308
		Transform: 	0069,0307,0308,0020,0058,006A,0020,0058,006A,0307,0308,0020,0058,012F,0020,0058,012F,0307,0308
    Error: File TestTransforms.java, Line 283: lt-Lower Vs SpecialCasing: expected "i ï j j̇̈ į į̇̈ i̇̀ i̇́ i̇̃ xi̇̈ xj̇̈ xį̇̈ xi̇̀ xi̇́ xi̇̃ xi xï xj xj̇̈ xį xį̇̈", got "i i̇̈ j j̇̈ į į̇̈ i̇̀ i̇́ i̇̃ xi̇̈ xj̇̈ xį̇̈ xi̇̀ xi̇́ xi̇̃ xi xi̇̈ xj xj̇̈ xį xį̇̈"
    Error: 	i 
		Special: 	00EF,0020,006A,0020,006A,0307,0308,0020,012F,0020,012F,0307,0308,0020,0069,0307,0300,0020,0069,0307,0301,0020,0069,0307,0303,0020,0078,0069,0307,0308,0020,0078,006A,0307,0308,0020,0078,012F,0307,0308,0020,0078,0069,0307,0300,0020,0078,0069,0307,0301,0020,0078,0069,0307,0303,0020,0078,0069,0020,0078,00EF,0020,0078,006A,0020,0078,006A,0307,0308,0020,0078,012F,0020,0078,012F,0307,0308
		Transform: 	0069,0307,0308,0020,006A,0020,006A,0307,0308,0020,012F,0020,012F,0307,0308,0020,0069,0307,0300,0020,0069,0307,0301,0020,0069,0307,0303,0020,0078,0069,0307,0308,0020,0078,006A,0307,0308,0020,0078,012F,0307,0308,0020,0078,0069,0307,0300,0020,0078,0069,0307,0301,0020,0078,0069,0307,0303,0020,0078,0069,0020,0078,0069,0307,0308,0020,0078,006A,0020,0078,006A,0307,0308,0020,0078,012F,0020,0078,012F,0307,0308
    Error: File TestTransforms.java, Line 283: lt-Upper Vs SpecialCasing: expected "I Ï J J̈ Į Į̈ Ì Í Ĩ XÏ XJ̈ XĮ̈ XÌ XÍ XĨ XI XÏ XJ XJ̈ XĮ XĮ̈", got "I Ï J J̈ Į Į̈ Ì Í Ĩ XÏ XJ̈ XĮ̈ XÌ XÍ XĨ XI XÏ XJ XJ̈ XĮ XĮ̈"
    Error: 	I Ï J J̈ Į Į̈ Ì Í Ĩ X
		Special: 	0049,0308,0020,0058,004A,0308,0020,0058,012E,0308,0020,0058,0049,0300,0020,0058,0049,0301,0020,0058,0049,0303,0020,0058,0049,0020,0058,00CF,0020,0058,004A,0020,0058,004A,0308,0020,0058,012E,0020,0058,012E,0308
		Transform: 	00CF,0020,0058,004A,0308,0020,0058,012E,0308,0020,0058,00CC,0020,0058,00CD,0020,0058,0128,0020,0058,0049,0020,0058,00CF,0020,0058,004A,0020,0058,004A,0308,0020,0058,012E,0020,0058,012E,0308

These have to do with the handling of ï and Ï, see cldrbug 7010: for more information. The lt casing transforms were mostly added per cldrbug 4779:

This is split out of cldrbug 6921: which just added the tests for this.


Change History

comment:1 Changed 3 years ago by tomzhang

It is duplicate of cldrbug 7010. Based on the ticket description, it seems like the transform is wrong (while UCharacter.toXXXXXCase is right). More specifically, this rule is not honored:

# Introduce an explicit dot above when lowercasing capital I's and J's
# whenever there are more accents above.
# (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)

It seems like "an explicit dot" (\u0307) is added regardless whether there are "more accents" or not.

Here are the files involved(both transform rules & test file):

  1. lt-Lower.xml / lt-Title.xml / lt-Upper.xml
  2. TestTransforms.java

I tried to fix the toLowerCase one, and if I delete "\u0307" in lt-Lower.xml and changes corresponding test data in TestTransform.java, it seems to work. I doubt this is the right way to change the data, but want to give some ideas here.

Please correct me if I misunderstand any, and let me know whether it is alright for me to directly change the data, or maybe I should wait for some "transform" expert on this.

comment:2 Changed 3 years ago by pedberg

There were problems with:

  1. The CLDR transforms, the rules needed to be expressed in NFD
  2. cldr-unittest TestTransforms.java, it needed to normalize the results of UCharacter.toXxxCase to NFC, since the transforms normalize but UCharacter.toXxxCase does not. Also it should specify the test strings using java escapes for non-ASCII.
  3. UCharacter.toLowerCase/toTitleCase handling of "I\u0308"; filed http://bugs.icu-project.org/trac/ticket/11094 about that and updated the logKnownIssue in TestTransforms.java to refer to that instead of this bug.

comment:3 Changed 3 years ago by pedberg

  • Status changed from new to reviewing
  • Review set to mark

comment:4 Changed 3 years ago by mark

  • Status changed from reviewing to closed
  • Resolution set to fixed

good catches, thanks

comment:5 Changed 3 years ago by markus

  • Phase set to rc
  • Milestone changed from 26rc to 26

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.