[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #11251(accepted)

Opened 5 months ago

Last modified 8 weeks ago

Survey tool voting on Alphabetic info for Tajik doesn't work

Reported by: kristi Owned by: tbishop
Component: survey-backend Data Locale:
Phase: dvet Review:
Weeks: Data Xpath:
Xref:

ticket:11244

Description

Try voting on the Approved data in Tajik: http://st.unicode.org/cldr-apps/v#/tg/Alphabetic_Information/7ed88347aa1b55ed

Once you add your vote, the Data changes. (see attached image)

Attachments

Tajik alphabet.JPG (60.9 KB) - added by kristi 5 months ago.

Change History

Changed 5 months ago by kristi

comment:1 Changed 4 months ago by mark

  • Status changed from new to closed
  • Resolution set to duplicate

comment:2 Changed 4 months ago by kristi

  • Status changed from closed to new

Maori

comment:3 Changed 4 months ago by kristi

  • Owner changed from anybody to kristi
  • Status changed from new to accepted
  • Resolution duplicate deleted

Add the Ref ticket and verify and vote on the languages we have gotten reports from.

  1. Maori
  2. Tajik

comment:4 Changed 4 months ago by kristi

  • Phase changed from dsub to dvet
  • Milestone changed from UNSCH to 34

comment:5 Changed 4 months ago by kristi

  • Xref set to 11244

comment:6 Changed 4 months ago by kristi

  • Owner changed from kristi to tbishop

Both tickets were resolved as dupe.
Adding in the Xref of the dupe and assigning this to Tom for investigation.

comment:7 Changed 8 weeks ago by tbishop

  • type changed from unknown to surveytool
  • Component changed from unknown to survey
  • Milestone changed from 34 to 35

comment:8 Changed 8 weeks ago by tbishop

I can reproduce this as follows: click the Add button, then paste in "а б в г ғ д е ё ж з и ӣ й к қ л м н о п р с т у ӯ ф х ҳ ч ҷ ш ъ э ю" -- which is the same as the old value, except that the last letter я is missing. The Winning column then contains an item for "а-хчшъэюёғқҳҷӣӯ", which I didn't expect. I hadn't read this documentation yet:

http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters

It says: Any range of characters, such as “a b c d e” can be represented compactly as “a-e”.

Maybe at least part of what happens is correct, as an automatic conversion to the compact representation. In fact, entering "a b c d e" does result in "a-e".

I don't see any evidence that the conversion itself is buggy, other than it being a surprising feature that's only half documented. The documentation doesn't mention automatic conversion to the compact representation. Also, the English examples don't use compact representation; if they did, the conversion might be less surprising.

Furthermore, the winning values from previous release are NOT displayed in compact representation. If you copy one of those values and paste it in, it gets treated as a distinct item. I'd say THAT is a bug. Whatever automatic conversion gets applied, to or from compact representation, it should be done consistently for all values including examples, previous release, inherited, etc.

Last edited 8 weeks ago by tbishop (previous) (diff)

comment:9 Changed 8 weeks ago by tbishop

The conversion to compact representation is done by inputUnicodeSet and getCleanedUnicodeSet in DisplayAndInputProcessor.java. Where is the reverse conversion done?

In cldr/common/main/tg.xml we have:

<exemplarCharacters>[а б в г ғ д е ё ж з и ӣ й к қ л м н о п р с т у ӯ ф х ҳ ч ҷ ш ъ э ю я]</exemplarCharacters>

Compact representation isn't used in the xml.

This setting for UnicodeSetPrettyPrinter determines whether to get abcde (false) or a-e (true):

    /**
     * @param compressRanges if you want abcde instead of a-e, make this false
     * @return
     */
    public UnicodeSetPrettyPrinter setCompressRanges(boolean compressRanges) {
        this.compressRanges = compressRanges;
        return this;
    }

compressRanges is true by default, and the function is always called with true, with one exception:

    public static String getCleanedUnicodeSet(UnicodeSet exemplar, UnicodeSetPrettyPrinter prettyPrinter,
        ExemplarType exemplarType) {
        if (prettyPrinter == null) {
            return exemplar.toPattern(false);
        }
        String value;
        prettyPrinter.setCompressRanges(exemplar.size() > 300);
   ...

For locales with less than 300 "main letters", compressRanges will get false there if prettyPrinter isn't null. When I enter a new value like "а б в г ғ д е ё ж", however, inputUnicodeSet has this.pp == null when it calls getCleanedUnicodeSet.

The solution must be to enforce a consistent representation for all candidate items. I'll consult with cldr-dev, whether that representation should be compact or something else, or if it should depend on (exemplar.size() > 300).

Last edited 8 weeks ago by tbishop (previous) (diff)

comment:10 Changed 8 weeks ago by mark

The conversion to compact representation is done by inputUnicodeSet and getCleanedUnicodeSet in DisplayAndInputProcessor.java. Where is the reverse conversion done?

The pretty printer will always generate a format that is readable by UnicodeSet. Eg, [a-c] or [a b c] are both read by the standard parser because the conform to the input specification. So there doesn't need to be an explicit parser.

There is one complication for the pretty printing. To format nicely for the user requires spinning up a collator with the CLDR rules for that language, and then using that collator to sort it. So that is why there is a limit. We can certainly set the limit higher, but don't want to include the really big sets like for Japanese.

It also means that we want to have some kind of limited cache, otherwise we end up spinning up multiple copies of collators (there might be code to do this already.

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.