[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #5551(closed enhancement: fixed)

Opened 2 years ago

Last modified 23 months ago

deprecate XML syntax for collation tailorings

Reported by: markus Owned by: markus
Component: collation Version: svn
Load: Data Locale:
Phase: Review: emmons
Weeks: 0.5 Data Xpath:




We have two kinds of Collation Tailorings syntax: Basic Syntax (ICU syntax) and XML Syntax.

The XML syntax is harder to read, and it needs to be converted into ICU syntax for building ICU and for custom tailorings when using ICU. When we add syntax items, we need to add them to each syntax, and adjust the XML-to-basic syntax converter.

The CLDR Collation Guidelines say for filing a request: "For readability, the rules should be supplied in the core syntax" (where core==basic). Once such a request has been reviewed, it needs to be converted into XML syntax for the current set of collation data files.

This seems like unnecessary work.

For the CLDR root collation, we have already decided to stop publishing the UCA_Rules in XML syntax.

Does anyone outside CLDR use the XML syntax?

The simplest change might be to replace <rules> with <basic_rules> or similar, which would contain the tailoring data in basic syntax. The rest of the collation .xml files would be unchanged (validSubLocales, type, settings, etc.).

Issue: In an .xml file, the basic syntax like &a<<x becomes even more unreadable: &amp;a&lt;&lt;x


  • We could put the tailoring rules into .txt files and refer to them from the .xml files which continue to carry all of the other data. Naming could be implied, using <locale ID>_<collation type>.txt, for example common/collation/rules/de_search.txt. The .xml file would get each <rules> element replaced with something like <import_rules_txt/>. Again, the rest of the collation .xml files would be unchanged.
  • We could put the tailoring rules into attribute values, for example: <rules><basic reset="a" tailor="<<x"/>... -- but non-distinguishing attribute values are generally not desirable.
  • We could add alternate, XML-friendly syntax characters to the basic syntax, like the `,` and `;` that ICU used before ICU 1.8 and that appear to still work in ICU 50 (so we would only need to come up with alternates for & and <). However, that would seem to diminish the benefit of deprecating the XML syntax.
  • We could replace the entire collation/*.xml files with the corresponding ICU .txt files. However, if there are CLDR tools that parse the collation data, then they would have to parse at least some of the ICU resource bundle format.

It should be possible to find something that is simpler, more readable, and more easily extended than what we have now.


Change History

comment:1 Changed 2 years ago by markus

  • Keywords collation added
  • Weeks set to 0.5

comment:2 Changed 2 years ago by markus

  • Owner changed from anybody to markus
  • Priority changed from assess to medium
  • Status changed from new to assigned
  • Component changed from spec to design
  • Milestone changed from UNSCH to 23

Investigate, look for users of the XML collation syntax (e.g., ask on CLDR & ICU mailing lists), prototype data file changes.

comment:3 Changed 2 years ago by markus

Consider using XML CDATA sections with basic syntax:

Consider adding alternate, XML-friendly syntax characters to the basic syntax. Latin-1 and related characters, and/or Unicode 1.1 characters, are likely best supported via fonts & keyboards. It would also be nice to stay within Pattern_Syntax. For example:
<basic>§a¬x«y~Y</basic> (all ASCII/Latin-1 Pattern_Syntax) or
<basic>§a‹x«y⁖Y</basic> or
<basic>§a‹x≪y⋘Y</basic> or
<basic>§a≺x⪻y≋Y</basic> or
<basic>§a←x⇐y⇚Y</basic> or
<basic>§a→x⇒y⇛Y</basic> (not sure whether left-pointing or right-pointing arrows would be better) or
<basic>§a~x≈y≋Y</basic> or
<basic>§a①x②y③Y</basic> (not Pattern_Syntax)

We could make only old syntax valid with a '&' reset and allow old and new syntax with a '§' reset, where we then also reserve all Pattern_Syntax (that is, all Pattern_Syntax would have to be escaped/quoted).

comment:4 Changed 2 years ago by markus

Apparently MySQL supports XML collation syntax for custom tailorings. Curiously, they mix the XML syntax with \uhhhh escapes from the basic syntax. See MySQL 5.6 Reference Manual: LDML Syntax Supported in MySQL

SIL Fieldworks uses ICU collation but stores them in XML collation syntax.

Unknown if anyone would have a problem with CLDR changing the format in its own data files.

comment:5 Changed 2 years ago by markus

Actually, I think we can do this with just ASCII Pattern_Syntax.

JDK and ICU already support ; for secondary differences and , for tertiary. We could just un-deprecate them to minimize new syntax. The logical extension would be . for primary and @ for the reset. If we do not already require all Pattern_Syntax to be reserved, then we could do that where the @ reset is used.

If and when we add quaternary tailorings we could use something like ~ ("almost same"). We already also use ASCII = for "identical".

Mnemonic 1: Stronger text separators for stronger differences. . ends a sentence, ; and then , are weaker separators.

Mnemonic 2: . is a period and starts with 'p' like "primary". ; is a semicolon and starts with 's' like "secondary".

A side benefit would be that the rules would get slightly shorter.


    <reset before="primary">ǀ</reset>

Basic syntax (ICU 1.8+):
&[before 1]ǀ<æ<<<Æ<<ä<<<Ä<<ę<<<Ę<ø<<<Ø<<ö<<<Ö<<ő<<<Ő<<œ<<<Œ

Basic syntax (old + proposed):
@[before 1]ǀ.æ,Æ;ä,Ä;ę,Ę.ø,Ø;ö,Ö;ő,Ő;œ,Œ

comment:6 Changed 2 years ago by markus

  • Xref set to 5549 5565
  • Milestone changed from 23 to 24

In CLDR collation .xml files, wrap the old+proposed syntax into <basic> elements that can be mixed with <reset> and <import>.

This will be useful for

  1. adding syntax elements like in ticket:5549 so that we need not add them in two syntaxes
  2. the new LDML2ICUConverter which would need only a simpler conversion of the collation tailoring rules (see ticket:5565)

Neither will be done for CLDR 23, so moving this ticket to 24.

comment:7 Changed 2 years ago by markus

Looking at the ICU4C rules parser, it supports undocumented syntax where '@' turns on backward-secondary sorting.

I wonder if we can ignore this in defining the new syntax.


        /* '@' is french only if the strength is not currently set */
        /* if it is, it's just a regular character in collation rules */
    case 0x0040/*'@'*/:
        if (newStrength == UCOL_TOK_UNSET) {
            src->opts->frenchCollation = UCOL_ON;

Note that this code has a bug -- it falls through to the '|' parsing code if newStrength != UCOL_TOK_UNSET.

comment:8 Changed 2 years ago by markus

  • Status changed from assigned to accepted

comment:9 Changed 2 years ago by markus

  • Component changed from design to data-collation

comment:10 Changed 2 years ago by markus

2013-04-10 CLDR TC agreed to the following.

I propose that we put ICU syntax into XML element contents, using CDATA sections.

I propose that we add a <cr> element for the ICU-syntax Collation Rules. It will have no attributes.
The ICU-syntax collation rule string will be in the <cr> element's text contents.
In CLDR files, we will use CDATA sections, to keep the rules readable.


I propose that we use the (currently undocumented) ICU collation comment syntax in the rules: Shell-script-style # starts a comment which continues up to the end of the line.

I propose that the LDML2ICUConverter will

  • split the text contents into lines (split at \n)
  • remove from each line any characters starting with the first # (saves space in compiled resource bundles)
  • trim leading and trailing Pattern_White_Space
  • wrap each line in "" for ICU *.txt files (which are parsed by genrb which then hands them to the ICU collation rule parser)

I propose that <cr> be a child of <collation>. We can allow multiple <cr> elements but normally we will use only one.

I propose that the existing <import> element be allowed on the same level, before the <settings> element.

Since the import pulls in not just the source's <cr> but also the settings etc., I think the import should be first, and then local settings override the imported ones. (ICU rule strings contain all of the settings, and [import] includes the entire source rule string with all its settings.)

This puts <import> and <cr> at the same level as <rules>, <settings>, and <optimize> etc.

<!ELEMENT collation (alias | (base?, import*, settings?, suppress_contractions?, optimize?, (rules? | cr*), special*)) >

I propose that we document that \uhhhh etc. is used for escaping hard-to-read and hard-to-type characters. For example, \uFDD0.

I propose that we document the ASCII apostrophe for literal white space and reserved characters (with existing ICU syntax details). For example, ' ' (U+0020 space) and '\u0022' (U+0022 double quote) and (a pair of U+0027 apostrophe encodes one of them).

As a consequence of the LDML2ICUConverter's unconditional stripping of #, a literal # must be written as '\u0023'.

The plan is:

1 Deprecate the XML collation syntax (mark relevant elements as deprecated in the DTD)
2 Remove the description of XML collation syntax from the CLDR 24 spec, briefly mention it and refer to CLDR 23 spec
3 Change the old LDML2ICUConverter to support <cr>
4 Change the CLDR collation data from <rules> to <cr>
5 Support collation with <cr> in new LDML2ICUConverter
6 Not support old <rules> in new LDML2ICUConverter

comment:11 Changed 23 months ago by markus

  • Review set to emmons

comment:12 Changed 23 months ago by emmons

  • Status changed from accepted to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
The ticket will be disowned. The resolution will be deleted. Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.