[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #10226(closed data: fixed)

Opened 12 months ago

Last modified 7 months ago

Fix segmentation rules

Reported by: mark Owned by: mark
Component: segmentation Data Locale:
Phase: dvet Review: andy
Weeks: Data Xpath:


See email chain on cldr-users, titled "Word break question". (Richard, Cameron, Philippe, thanks for tracking this down...)

The [...] syntax must only be used for valid UnicodeSets, so the last two lines below are invalid.

<variable id="$Hebrew_Letter">($Hebrew_Letter $FEZ*)</variable>
<variable id="$ALetter">($ALetter $FEZ*)</variable>
<variable id="$MidNumLet">($MidNumLet $FEZ*)</variable>
<variable id="$Single_Quote">($Single_Quote $FEZ*)</variable>
<variable id="$AHLetter">[$ALetter $Hebrew_Letter]</variable>
<variable id="$MidNumLetQ">[$MidNumLet $Single_Quote]</variable>

As for the fix. I think the cleanest approach is to move the two lines above the definition of "$FEZ", then add the following rewrite lines at the end of the <variables> section. That way everything is parallel:

<variable id="$AHLetter">($AHLetter $FEZ*)</variable>
<variable id="$MidNumLetQ">($MidNumLetQ $FEZ*)</variable>

This should be posted as a known issue, and fixed soon, since it may have an impact on some of the Unicode 10.0 work.

The CLDR test code is too lenient about this syntax, and also needs to be fixed to check for it.


rootAddon.xml (23.9 KB) - added by mark 7 months ago.
Comparison file generated from Segment.java

Change History

comment:1 Changed 12 months ago by mark

  • Priority changed from assess to critical
  • Type changed from unknown to data
  • Milestone changed from UNSCH to 32

comment:2 Changed 11 months ago by mark

  • Owner changed from anybody to mark
  • Phase changed from dsub to dvet
  • Status changed from new to accepted

comment:3 Changed 8 months ago by mark

  • Status changed from accepted to reviewing
  • Review set to markus

fixed in Unicode tools since Segmenter.java has moved there, under this ticket.

comment:4 Changed 7 months ago by markus

  • Cc andy added
  • Status changed from reviewing to reviewfeedback
  • Component changed from unknown to segmentation

With some treasure hunting, I found the change in http://www.unicode.org/utility/trac/changeset/1333

However, Andy says that there should be resulting data changes in CLDR source:trunk/common/segments or in the Unicode segmentation test files, or both. Please explain why there are no such changes.

Changed 7 months ago by mark

Comparison file generated from Segment.java

comment:5 Changed 7 months ago by mark

  • Status changed from reviewfeedback to reviewing
  • Review changed from markus to andy

Changed reviewer to Andy.

The Segment.java code always internally produced xml rules, so I added a main to print them out into a file. That file (rootAddon.xml) is attached. I then compared line by line the differences.

Unfortunately, the files drifted quite a ways out of alignment, so the first thing was to rearrange variable lines in root to be the same as rootAddon. I copied over all the updated comments from rootAddon. Then I changed the variable names to be the same where different, eg $RI and $FEZ => $FE.

Once I did that, i isolated the real differences that needed to remain in root.xml, which were to do with the changes to disallow breaks not only between emoji, but also emoji-like characters. Those were left in the root.xml file.

So for review, I suggest that you first review the changes between root and rootAddon. Then look at the changes between the old root and the new one to check that I didn't make a mistake in fixing the alignment (as described above).

I will fix the BRS item A20 to split into the comparison and update compared to Segment.java, and the sync over to ICU.

comment:7 Changed 7 months ago by andy

  • Status changed from reviewing to closed
  • Resolution set to fixed

I will fix the BRS item A20 to split into the comparison and update compared to Segment.java, and the sync over to ICU.

ICU does not sync from CLDR, ICU rules are updated directly from UAX 14/29 and other published Unicode documents. The published CLDR segmentation rules really don't drive anything right now. I'd like to change that, and have them be the master for ICU reference testing and for CLDR tooling.


Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.