[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #11112(accepted data)

Opened 2 months ago

Last modified 4 weeks ago

String start with letter alif (ا) should not be indexed under hamza (ء) when using both locale ur and ar

Reported by: vichang@… Owned by: markus
Component: collation Data Locale: ur, ar
Phase: dsub Review:
Weeks: Data Xpath:
Xref:

Description

hamza (ء) used in Arabic and Urdu, but string start with letter alif (ا) should not be indexed under hamza (ء). It should be indexed under alif (ا).

It may be ICU bug, but it sounds more like locale issue than ICU issue, so I reported the issue here.

Disclaimer: I am not a native speaker of arabic nor urdu. But apparently, alif (ا) is commonly used in arabic.

Arabic Collator in ICU put alif (ا) and hamza (ء) into the same bucket, but Urdu Collator in ICU doesn't. If hamza (ء) should be in a different index, it could be a collation bug in Arabic. Here is the code to reproduce the issue.

===================================
Collator collator = Collator.getInstance(arabic);
collator.setStrength(Collator.PRIMARY); The strength level used for AlphabeticIndex
System.out.println(collator.compare("\u0621", "\u0627"));
0. same bucket for AlphabeticIndex
collator = Collator.getInstance(urdu);
collator.setStrength(Collator.PRIMARY); The strength level used for AlphabeticIndex
System.out.println(collator.compare("\u0621", "\u0627"));
1. different buckets for AlphabeticIndex
===================================

GoogleIssue:31034811

Attachments

Change History

comment:1 Changed 4 weeks ago by mark

  • Keywords google added

comment:2 Changed 4 weeks ago by mark

  • Cc pedberg added
  • Owner changed from anybody to markus
  • Priority changed from assess to critical
  • Status changed from new to accepted
  • Milestone changed from UNSCH to 34

comment:3 Changed 4 weeks ago by pedberg

  • Cc mark added

I looked at Apple overrides. We don't currently have any specific to the Arabic collation. We do have some overrides in the root collations, most of them specific to the search collator, but none specific to Arabic. For the root search collator the changes are:

// instead of the following
                "&[last primary ignorable]<<׳"
                "<<״"
                "<<ـ"
                "<<ฺ"
// we do this (further makes ignorable some Thai/Lao vowels + emoji modifiers):
                "&[last primary ignorable ]<<׳<<״<<ـ"
                "<<ะ<<ั<<า<<ำ<<ิ<<ี<<ึ<<ื<<ุ<<ู<<ฺ<<ๅ"
                "<<ະ<<ັ<<າ<<ຳ<<ິ<<ີ<<ຶ<<ື<<ຸ<<ູ<<ົ<<ຼ<<ຽ"
                "<<\U0001F3FB<<\U0001F3FC<<\U0001F3FD<<\U0001F3FE<<\U0001F3FF"
// and we also add the following for Japanese (later in the search collator data):
                "&う < ゔ <<< ヴ <<< ヴ"
                "&か < が <<< ガ <<< ガ"
                "&き < ぎ <<< ギ <<< ギ"
                "&く < ぐ <<< グ <<< グ"
                "&け < げ <<< ゲ <<< ゲ"
                "&こ < ご <<< ゴ <<< ゴ"
                "&さ < ざ <<< ザ <<< ザ"
                "&し < じ <<< ジ <<< ジ"
                "&す < ず <<< ズ <<< ズ"
                "&せ < ぜ <<< ゼ <<< ゼ"
                "&そ < ぞ <<< ゾ <<< ゾ"
                "&た < だ <<< ダ <<< ダ"
                "&ち < ぢ <<< ヂ <<< ヂ"
                "&つ < づ <<< ヅ <<< ヅ"
                "&て < で <<< デ <<< デ"
                "&と < ど <<< ド <<< ド"
                "&は < ば <<< バ <<< バ < ぱ <<< パ <<< パ"
                "&ひ < び <<< ビ <<< ビ < ぴ <<< ピ <<< ピ"
                "&ふ < ぶ <<< ブ <<< ブ < ぷ <<< プ <<< プ"
                "&へ < べ <<< ベ <<< ベ < ぺ <<< ペ <<< ペ"
                "&ほ < ぼ <<< ボ <<< ボ < ぽ <<< ポ <<< ポ"
                "&わ < ヷ <<< ヷ"
                "&ゐ < ヸ"
                "&ゑ < ヹ"
                "&を < ヺ"
                "&ゝ < ゞ"
                "&ヽ < ヾ"

Also, for all of the root collators that do not already do this, we add specific collations for the England/Scotland/Wales emoji flags:

                "&🇿"
                "< 🏴󠁧󠁢󠁥󠁮󠁧󠁿"
                "< 🏴󠁧󠁢󠁳󠁣󠁴󠁿"
                "< 🏴󠁧󠁢󠁷󠁬󠁳󠁿"
View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.