CLDR Ticket #6765(accepted data)
Fix Arabic Index Exemplars, add tests
Reported by: | mark | Owned by: | markus |
---|---|---|---|
Component: | collation | Data Locale: | |
Phase: | rc | Review: | |
Weeks: | Data Xpath: | ||
Xref: |
Description
I wrote the following sample code for someone. The goal is to get the initial letter that someone's name would be sorted under. The code uses AlphabeticIndex, which depends on the Index Exemplars.
It reveals a few things. I'm using the country names as samples. In some languages the country name goes into an under/overflow bucket; that should never happen!
The two cases that show up are:
Arabic
آ آروبا
Japanese
英 英領インド洋地域
Japanese is a special case, since the Alphabetic index assumes yomi. Perhaps we should supply a better backup? But Arabic is definitely a failure.
In any event, we should write a test that checks that every character of the locale's script is in one of the alphabetic buckets (ie, has an index exemplar).
import java.util.ArrayList; import java.util.Arrays; import java.util.List; import java.util.Set; import java.util.TreeSet; import com.ibm.icu.text.AlphabeticIndex; import com.ibm.icu.text.AlphabeticIndex.Bucket; import com.ibm.icu.text.AlphabeticIndex.Bucket.LabelType; import com.ibm.icu.text.AlphabeticIndex.ImmutableIndex; import com.ibm.icu.text.BreakIterator; import com.ibm.icu.text.Collator; import com.ibm.icu.util.Output; import com.ibm.icu.util.ULocale; public class InitialTest { public static void main(String[] args) { Output<LabelType> labelType = new Output<LabelType>(); for (ULocale locale : ULocale.getAvailableLocales()) { if (!locale.getCountry().isEmpty()) { continue; // for a quick check, just do language locales } InitialMaker maker = new InitialMaker(locale); // get samples, in sorted order Set<String> samples = new TreeSet(Collator.getInstance(locale)); for (String countryCode : ULocale.getISOCountries()) { String displayName = ULocale.getDisplayCountry("und-" + countryCode, locale); if (displayName.equals(countryCode)) { continue; // skip fallback code } samples.add(displayName); } // add some odd cases samples.add("U\u0308berall"); // decomposed u-umlaut // now display System.out.printf("\n%s\t%s\n", locale.getDisplayName(ULocale.ENGLISH), locale.getDisplayName(locale)); for (String displayName : samples) { String initial = maker.getInitial(displayName, labelType); System.out.printf("%s\t%s\t%s\n", labelType.value.equals(LabelType.NORMAL) ? "N" : "-", initial, displayName); } } } public static class InitialMaker { private ImmutableIndex<String> immutableIndex; private BreakIterator breakIterator; // mutable, so synchronize public InitialMaker(ULocale locale) { immutableIndex = ((AlphabeticIndex<String>) new AlphabeticIndex(locale)).buildImmutableIndex(); breakIterator = BreakIterator.getCharacterInstance(locale); } public String getInitial(String item, Output<LabelType> labelType) { int index = immutableIndex.getBucketIndex(item); Bucket<String> bucket = immutableIndex.getBucket(index); labelType.value = bucket.getLabelType(); if (labelType.value == LabelType.NORMAL) { return bucket.getLabel(); } else { // fallback to graphame cluster synchronized (breakIterator) { breakIterator.setText(item); int end = breakIterator.next(); return item.substring(0,end); } } } } }
Attachments
Change History
comment:1 Changed 4 years ago by emmons
- Status changed from new to assigned
- Component changed from unknown to data
- Priority changed from assess to medium
- Milestone changed from UNSCH to 25rc
- Owner changed from anybody to markus
- Type changed from unknown to defect
comment:2 Changed 4 years ago by markus
- Cc roozbeh, mark, srl added
- Keywords collation added
Arabic: I see that U+0627 Alef is the first index character, but in the root collator many Arabic letters (starting with Hamza) sort before Alef, and the tailoring does not change that.
If the sort order is right, then we either need to add U+0621 Hamza as an index character, or we need to find a way to say that the first Arabic bucket has the Hamza as the lower boundary but Alef (or something) as the label string.
comment:3 Changed 4 years ago by markus
Note: The FractionalUCA.txt file for UCA 6.3 includes mappings for the first primaries for each script. Once we have an ICU Collator implementation that supports that ("collv2"), we could modify AlphabeticIndex to detect that a string sorts between the script start and the script's first index character. We might then choose to move the string into the first index bucket. It would have the same outcome as the second option I mentioned in comment:2, but without needing any CLDR change.
comment:4 Changed 4 years ago by roozbeh
Fallback issues aside, I think it's a good idea to add a new bucket for Hamza (U+0621) for Arabic. Many many words do start with a hamza form.
comment:5 Changed 4 years ago by markus
Ok, so the plan is:
- We will add U+0621 Hamza as the first Arabic (main/ar.xml) index character.
- IcuBug:9858 "AlphabeticIndex: adjust for first-in-script mappings" will implement what I said in comment:3. That is, any character that sorts before the first index character will go into that first bucket for the script anyway.
comment:9 Changed 3 years ago by markus
- Milestone changed from 27 to 28
Re "the plan" item 1 in comment:5 -- CLDR 26 made hamza and alef primary-equal for Arabic, which means that we can only have one or the other in AlphabeticIndex. I think this means that if I were to add hamza (U+0621), it would replace the alef in an actual index, which is probably not what we want.
Pushing to 28 to reconsider, but I think we should close this ticket.
comment:10 Changed 3 years ago by markus
- Type changed from defect to data
- Component changed from data-main to collation
Moving to component=collation because this is really collation data, although it lives in the "main" data tree. We are cleaning up Trac so that we do not need queries any more like (component=data-collation|uca OR keywords.contains(collation)).
comment:13 Changed 3 years ago by emmons
- Milestone changed from 29 to upcoming
Auto move of all 29 -> upcoming