[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #6765(accepted data)

Opened 4 years ago

Last modified 19 months ago

Fix Arabic Index Exemplars, add tests

Reported by: mark Owned by: markus
Component: collation Data Locale:
Phase: rc Review:
Weeks: Data Xpath:
Xref:

Description

I wrote the following sample code for someone. The goal is to get the initial letter that someone's name would be sorted under. The code uses AlphabeticIndex, which depends on the Index Exemplars.

It reveals a few things. I'm using the country names as samples. In some languages the country name goes into an under/overflow bucket; that should never happen!

The two cases that show up are:

Arabic
آ آروبا

Japanese
英 英領インド洋地域

Japanese is a special case, since the Alphabetic index assumes yomi. Perhaps we should supply a better backup? But Arabic is definitely a failure.

In any event, we should write a test that checks that every character of the locale's script is in one of the alphabetic buckets (ie, has an index exemplar).

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Set;
import java.util.TreeSet;

import com.ibm.icu.text.AlphabeticIndex;
import com.ibm.icu.text.AlphabeticIndex.Bucket;
import com.ibm.icu.text.AlphabeticIndex.Bucket.LabelType;
import com.ibm.icu.text.AlphabeticIndex.ImmutableIndex;
import com.ibm.icu.text.BreakIterator;
import com.ibm.icu.text.Collator;
import com.ibm.icu.util.Output;
import com.ibm.icu.util.ULocale;


public class InitialTest {
  public static void main(String[] args) {
    Output<LabelType> labelType = new Output<LabelType>();
    for (ULocale locale : ULocale.getAvailableLocales()) {
      if (!locale.getCountry().isEmpty()) {
        continue; // for a quick check, just do language locales
      }
      InitialMaker maker = new InitialMaker(locale);
      
      // get samples, in sorted order
      Set<String> samples = new TreeSet(Collator.getInstance(locale));
      for (String countryCode : ULocale.getISOCountries()) {
        String displayName = ULocale.getDisplayCountry("und-" + countryCode, locale);
        if (displayName.equals(countryCode)) {
          continue; // skip fallback code
        }
        samples.add(displayName);
      }
      // add some odd cases
      samples.add("U\u0308berall"); // decomposed u-umlaut
      
      // now display
      System.out.printf("\n%s\t%s\n", locale.getDisplayName(ULocale.ENGLISH), locale.getDisplayName(locale));
      for (String displayName : samples) {
        String initial = maker.getInitial(displayName, labelType);
        System.out.printf("%s\t%s\t%s\n", labelType.value.equals(LabelType.NORMAL) ? "N" : "-", initial, displayName);
      }
    }
  }
  
  public static class InitialMaker {
    private ImmutableIndex<String> immutableIndex;
    private BreakIterator breakIterator; // mutable, so synchronize
    
    public InitialMaker(ULocale locale) {
      immutableIndex = ((AlphabeticIndex<String>) new AlphabeticIndex(locale)).buildImmutableIndex();
      breakIterator = BreakIterator.getCharacterInstance(locale);
    }
    
    public String getInitial(String item, Output<LabelType> labelType) {
      int index = immutableIndex.getBucketIndex(item);
      Bucket<String> bucket = immutableIndex.getBucket(index);
      labelType.value = bucket.getLabelType();
      if (labelType.value == LabelType.NORMAL) {
        return bucket.getLabel();
      } else {
        // fallback to graphame cluster
        synchronized (breakIterator) {
          breakIterator.setText(item);
          int end = breakIterator.next();
          return item.substring(0,end);
        }
      }
    }
  }
}

Attachments

Change History

comment:1 Changed 3 years ago by emmons

  • Status changed from new to assigned
  • Component changed from unknown to data
  • Priority changed from assess to medium
  • Milestone changed from UNSCH to 25rc
  • Owner changed from anybody to markus
  • Type changed from unknown to defect

comment:2 Changed 3 years ago by markus

  • Cc roozbeh, mark, srl added
  • Keywords collation added

Arabic: I see that U+0627 Alef is the first index character, but in the root collator many Arabic letters (starting with Hamza) sort before Alef, and the tailoring does not change that.

If the sort order is right, then we either need to add U+0621 Hamza as an index character, or we need to find a way to say that the first Arabic bucket has the Hamza as the lower boundary but Alef (or something) as the label string.

comment:3 Changed 3 years ago by markus

Note: The FractionalUCA.txt file for UCA 6.3 includes mappings for the first primaries for each script. Once we have an ICU Collator implementation that supports that ("collv2"), we could modify AlphabeticIndex to detect that a string sorts between the script start and the script's first index character. We might then choose to move the string into the first index bucket. It would have the same outcome as the second option I mentioned in comment:2, but without needing any CLDR change.

comment:4 Changed 3 years ago by roozbeh

Fallback issues aside, I think it's a good idea to add a new bucket for Hamza (U+0621) for Arabic. Many many words do start with a hamza form.

comment:5 Changed 3 years ago by markus

Ok, so the plan is:

  1. We will add U+0621 Hamza as the first Arabic (main/ar.xml) index character.
  2. IcuBug:9858 "AlphabeticIndex: adjust for first-in-script mappings" will implement what I said in comment:3. That is, any character that sorts before the first index character will go into that first bucket for the script anyway.

comment:6 Changed 3 years ago by markus

  • Milestone changed from 25rc to 26rc

comment:7 Changed 3 years ago by markus

  • Milestone changed from 26rc to 27rc

comment:8 Changed 3 years ago by markus

  • Phase set to rc
  • Milestone changed from 27rc to 27

comment:9 Changed 2 years ago by markus

  • Milestone changed from 27 to 28

Re "the plan" item 1 in comment:5 -- CLDR 26 made hamza and alef primary-equal for Arabic, which means that we can only have one or the other in AlphabeticIndex. I think this means that if I were to add hamza (U+0621), it would replace the alef in an actual index, which is probably not what we want.

Pushing to 28 to reconsider, but I think we should close this ticket.

Last edited 2 years ago by markus (previous) (diff)

comment:10 Changed 2 years ago by markus

  • Type changed from defect to data
  • Component changed from data-main to collation

Moving to component=collation because this is really collation data, although it lives in the "main" data tree. We are cleaning up Trac so that we do not need queries any more like (component=data-collation|uca OR keywords.contains(collation)).

comment:11 Changed 2 years ago by srl

  • Status changed from assigned to accepted

comment:12 Changed 21 months ago by markus

  • Milestone changed from 28 to 29

comment:13 Changed 19 months ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.