[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #6691(accepted unittest)

Opened 5 years ago

Last modified 14 months ago

Fix mismatch in transliterators vs NFD/NFC

Reported by: mark Owned by: sascha
Component: translit Data Locale:
Phase: rc Review:
Weeks: Data Xpath:


If a transliterator is written like the following it will fail, because by the time the ü is reached, the source is in NFD.

:: NFD;
ü > x;

The ü rule would have to be:

u \u0038 > x;

To fix this, either

  1. add tests to verify that this doesn't happen (eg that the rules match the operand normalization form, or
  2. add a 'normalizing' tool to ensure that the rules are correct.


Change History

comment:1 Changed 5 years ago by emmons

  • Status changed from new to assigned
  • Component changed from unknown to data-supplemental
  • Priority changed from assess to medium
  • Milestone changed from UNSCH to 25rc
  • Owner changed from anybody to pedberg
  • Type changed from unknown to enhancement

comment:2 Changed 4 years ago by pedberg

  • Milestone changed from 25rc to 26rc

comment:3 Changed 4 years ago by pedberg

  • Component changed from data-supplemental to test

comment:4 Changed 4 years ago by pedberg

  • Cc mark added
  • Milestone changed from 26rc to 27rc

There is a third option for how to fix this, which is to eliminate the initial :: NFD; or :: NFD (NFC); rule.

This might be the best approach for several of the Cyrillic script/language -> Latin transforms, which often have :: NFD (NFC); at the beginning but then have rules for Й and й which in NFD are e.g. И + /u0306 etc. It is not clear that the NFD is needed for anything in the transform.

Need some discussion on this.

comment:5 Changed 4 years ago by mark

The advantage of doing NFD is that if you have an odd accent, it gets pulled out, the base character gets converted, and then the accent applies to the new base in the new script.

We should have a test that

If :: NFD occurs at the top, that all the right sides are in NFD with > or <> rules

If :: .. (NFD) occurs at the bottom, then all the left sides are in NFC for < or <> rules.

And the same for the other forms: NFC, NFKC, NFKD.

The API gives a way to walk through the rules, so the files don't have to be parsed by hand to do this.

comment:6 Changed 4 years ago by markus

  • Phase set to rc
  • Milestone changed from 27rc to 27

comment:7 Changed 3 years ago by pedberg

  • Milestone changed from 27 to 28

comment:8 Changed 3 years ago by markus

  • Type changed from enhancement to unittest
  • Component changed from test to unknown

comment:9 Changed 3 years ago by srl

  • Status changed from assigned to accepted

comment:10 Changed 3 years ago by emmons

  • Component changed from unknown to translit

comment:11 Changed 3 years ago by pedberg

  • Milestone changed from 28 to 29

Out of time, look at early in 29 if possible

comment:12 Changed 3 years ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming

comment:13 Changed 16 months ago by robert@…

Whichever option you choose for the rules, either NFC or NFD, the global filter at the top should accept both types of characters. Otherwise the transliterator will only function correctly for just the one type which the filter is intended for.

If the filter contains diacritics so that NFD text may pass, I wonder if that may cause problems for combinations of letters + diacritics which are not intended to be acted on. At least if the rules are NFD you may conclude that in general the order of the rules is critical to make sure the correct action occurs. (Same order of rules as the order of diacritics.) Note that additional diacritics applied to the letter could interfere with the rule if they happen to fall between the letter and the diacritic(s) which the rule is designed to act upon.

Due to this level of complexity (and I am not sure there is a solution to it) I would not want to have to design and use NFD rules just so that extra diacritics can be accommodated. Besides, the transliteration standards such as BGN do not specify the existence of letters with additional diacritics. I would therefore feel safer to use NFC rules, which should be sufficient from a standards perspective.

There is still a problem of how to pass NFD text in to a NFC transliterator. Of course the NFC converter could be called first before the transliterator is even invoked, but that makes the (second) NFC conversion inside the transliterator a waste of CPU time. I think one approach would be to use a global filter which allowed strings as well as characters. In this way the NFC characters and the NFD characters (which would be expressed as strings in curly braces) could be filtered. Unfortunately ICU currently does not support this in the global filter. Another approach, which is easier to imagine, could be achieved if the NFC conversion could occur before the global filter. But of course ICU does not support this either.

For your interest my personal solution has been to rewrite (and fix) the transliterators I need as NFC rules & filtering only, with no internal NFC conversion. All the input text is then converted to NFC before the transliterator is invoked. This works well, and luckily I do not care for reverse transliteration and the output can remain in NFC.

I hope you will find my comments useful to spur on further ideas and progress with the project!

comment:14 Changed 16 months ago by pedberg

  • Cc pedberg, sascha added
  • Owner changed from pedberg to sascha
  • Milestone changed from upcoming to 32

Sascha, didn't we already fix this problem under a different ticket? Perhaps we fixed the data but did not add a test.

comment:15 Changed 14 months ago by sascha

  • Milestone changed from 32 to upcoming

Partially, but afaik we don’t have systematic tests and a few transliterators might still be broken. Sadly (or not), I’ll be offline until mid August, so I probably won’t be able to do this in time for CLDR v32. Sorry about that. I’ll look into this when I’m back, unless of course somebody else does it in the meantime.


Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.