[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #10128(accepted spec)

Opened 13 months ago

Last modified 2 months ago

Ill-defined Collation Rule Syntax

Reported by: Richard Wordingham <richard.wordingham@…> Owned by: markus
Component: collation Data Locale:
Phase: spec-beta Review:
Weeks: 0.05 Data Xpath:

Description (last modified by mark) (diff)

(Markus submitting on behalf of Richard.)

Richards points out that the ICU Collator-from-rules constructor does not understand \uhhhh escapes, but he expected it to based on a read of the LDML collation spec. Clarify that only ICU's tools (genrb) and demos handle it before passing unescaped strings into the library code.

(Mark: restored Richard's broader title. I suggest, like we plan to do for UnicodeSet, we supply an EBNF definition of the collation rules.)


Change History

comment:1 Changed 13 months ago by richard.wordingham@…

More generally, with the original title of "Ill-defined Collation Rule Syntax":

Neither the LDML (UTS#35 Version 30 Part 5 Section 3.5) nor the ICU collation rule syntax appears to be defined, though fortunately I have not yet found a problem with the common meanings they are intended to support.

The LDML specification misstates, "The CLDR rule syntax is a subset of the [ICUCollation] syntax". This means that anything that is valid as CLDR rule syntax is also valid as ICU collation rule syntax. However, I have just spent a good deal of effort tracking down a problem that arises because the \uhhhh syntax of LDML is not supported in ICU, despite being recorded in the documentation of UnicodeString::unescape().

Neither LDML not ICU documents what appears to be an end of line comment syntax introduced by '#'. I therefore do not know what constitutes an 'end of line'. Conceivably the examples given in LDML and the CLDR files employing such comments are simply in error! From appearing in XML 1.0 files I can deduce that the characters CR and LF each force an end of line, but I have no idea about the handling of:

U+0085 (a.k.a. NEXT LINE)

comment:2 Changed 7 months ago by mark

  • Status changed from new to accepted
  • Description modified (diff)
  • Summary changed from LDML collation document ICU escapes in tools & demos but not in library to Ill-defined Collation Rule Syntax
  • Priority changed from minor to medium
  • Milestone changed from UNSCH to 33
  • Owner changed from anybody to markus

comment:3 Changed 2 months ago by markus

  • Keywords punt33 added
  • Milestone changed from 33 to 34

Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.