[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #5220(closed: fixed)

Opened 6 years ago

Last modified 6 years ago

Non-BMP characters in XML attributes aren't parsed correctly

Reported by: emmons Owned by: jchye
Component: xxx-tools Data Locale:
Phase: Review: emmons
Weeks: Data Xpath:


Converting numbering systems data - the converter doesn't deal very well with it if the digit string contains digits outside the BMP. Just sits and spins. Did a bunch of debugging on this but can't seem to narrow down the problem, other than the fact that digit string is much longer than you would expect ( i.e. > 10 ), and that all the code seems to be working on code units instead of code point lengths ( use UCharacterIterator... ? ).

This started failing after Yoshito's r7607

I'll let Jennifer chew on this one - in the meantime will change the numberingSystems.xml to make "osma" digits provisional ( which will temporarily dodge the problem ) until we have a solution.


Change History

comment:1 Changed 6 years ago by jchye

  • Status changed from new to accepted

comment:2 Changed 6 years ago by jchye

The problem isn't in the converter. The Java XMLParser correctly processes high/low surrogate characters if they're part of the element data, but if they're in an attribute value the parser returns rubbish. In this case, parsing digits="𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩" results in a 510-character-long string being passed to the converter to line-break, resulting in an endless loop.

This bug occurs in JDK version 1.6.0. I'm going to download the latest version to see if the bug occurs there as well, but 2 possible solutions in the meantime:
1) change the numberingSystems dtd to put digits in the element data instead of in an attribute
2) Unicode-escape all characters outside the BMP that are in attributes

comment:3 Changed 6 years ago by jchye

  • Milestone changed from UNSCH to 22.1

The Xerces parser doesn't have this problem, so if we add the Xerces library as a dependency to java/tools the bug will be fixed.

comment:4 Changed 6 years ago by jchye

  • Summary changed from New LDML2ICUConverter: converting digits outside the BMP to Non-BMP characters in XML attributes aren't parsed correctly

comment:5 Changed 6 years ago by jchye

  • Review set to emmons

comment:6 Changed 6 years ago by pedberg

  • Milestone changed from 22.1 to 22

comment:7 Changed 6 years ago by emmons

  • Status changed from accepted to closed
  • Resolution set to fixed

comment:8 Changed 6 years ago by pedberg

  • Priority changed from assess to critical
  • type changed from unknown to defect
  • Component changed from unknown to tools

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.