CLDR Ticket #5551(closed enhancement: fixed)
deprecate XML syntax for collation tailorings
|Reported by:||markus||Owned by:||markus|
We have two kinds of Collation Tailorings syntax: Basic Syntax (ICU syntax) and XML Syntax.
The XML syntax is harder to read, and it needs to be converted into ICU syntax for building ICU and for custom tailorings when using ICU. When we add syntax items, we need to add them to each syntax, and adjust the XML-to-basic syntax converter.
The CLDR Collation Guidelines say for filing a request: "For readability, the rules should be supplied in the core syntax" (where core==basic). Once such a request has been reviewed, it needs to be converted into XML syntax for the current set of collation data files.
This seems like unnecessary work.
For the CLDR root collation, we have already decided to stop publishing the UCA_Rules in XML syntax.
Does anyone outside CLDR use the XML syntax?
The simplest change might be to replace <rules> with <basic_rules> or similar, which would contain the tailoring data in basic syntax. The rest of the collation .xml files would be unchanged (validSubLocales, type, settings, etc.).
Issue: In an .xml file, the basic syntax like &a<<x becomes even more unreadable: &a<<x
- We could put the tailoring rules into .txt files and refer to them from the .xml files which continue to carry all of the other data. Naming could be implied, using <locale ID>_<collation type>.txt, for example common/collation/rules/de_search.txt. The .xml file would get each <rules> element replaced with something like <import_rules_txt/>. Again, the rest of the collation .xml files would be unchanged.
- We could put the tailoring rules into attribute values, for example: <rules><basic reset="a" tailor="<<x"/>... -- but non-distinguishing attribute values are generally not desirable.
- We could add alternate, XML-friendly syntax characters to the basic syntax, like the `,` and `;` that ICU used before ICU 1.8 and that appear to still work in ICU 50 (so we would only need to come up with alternates for & and <). However, that would seem to diminish the benefit of deprecating the XML syntax.
- We could replace the entire collation/*.xml files with the corresponding ICU .txt files. However, if there are CLDR tools that parse the collation data, then they would have to parse at least some of the ICU resource bundle format.
It should be possible to find something that is simpler, more readable, and more easily extended than what we have now.
- Owner changed from anybody to markus
- Priority changed from assess to medium
- Status changed from new to assigned
- Component changed from spec to design
- Milestone changed from UNSCH to 23