CLDR Ticket #10128(new spec)

Opened 2 weeks ago

Last modified 2 weeks ago

LDML collation document ICU escapes in tools & demos but not in library

Reported by: Richard Wordingham <richard.wordingham@…> Owned by: anybody
Component: collation Data Locale:
Phase: spec-beta Review:
Weeks: 0.05 Data Xpath:


(Markus submitting on behalf of Richard.)

Richards points out that the ICU Collator-from-rules constructor does not understand \uhhhh escapes, but he expected it to based on a read of the LDML collation spec. Clarify that only ICU's tools (genrb) and demos handle it before passing unescaped strings into the library code.


Change History

comment:1 Changed 2 weeks ago by richard.wordingham@…

More generally, with the original title of "Ill-defined Collation Rule Syntax":

Neither the LDML (UTS#35 Version 30 Part 5 Section 3.5) nor the ICU collation rule syntax appears to be defined, though fortunately I have not yet found a problem with the common meanings they are intended to support.

The LDML specification misstates, "The CLDR rule syntax is a subset of the [ICUCollation] syntax". This means that anything that is valid as CLDR rule syntax is also valid as ICU collation rule syntax. However, I have just spent a good deal of effort tracking down a problem that arises because the \uhhhh syntax of LDML is not supported in ICU, despite being recorded in the documentation of UnicodeString::unescape().

Neither LDML not ICU documents what appears to be an end of line comment syntax introduced by '#'. I therefore do not know what constitutes an 'end of line'. Conceivably the examples given in LDML and the CLDR files employing such comments are simply in error! From appearing in XML 1.0 files I can deduce that the characters CR and LF each force an end of line, but I have no idea about the handling of:

U+0085 (a.k.a. NEXT LINE)


