[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #4207(closed enhancement: fixed)

Opened 5 years ago

Last modified 2 years ago

Update to Arabic Ordering Specification

Reported by: srl Owned by: srl
Component: collation Data Locale: ar
Phase: rc Review: roozbeh
Weeks: 0.2 Data Xpath:
Xref:

ticket:7766

Description

Attachments

ArabicRules-UTF8.txt (323 bytes) - added by ramys@… 4 years ago.
Original and new Arabic collation rules, file encoding is UTF-8.
Arabic Collation Rules.doc (56.0 KB) - added by ramys@… 2 years ago.
Arabic Collation Rules
root_updated.xml (15.8 KB) - added by ramys@… 2 years ago.
root tailoring
ar_updated.xml (875 bytes) - added by ramys@… 2 years ago.
Arabic tailoring
Arabic Collation Rules_v2.doc (17.5 KB) - added by ramys@… 2 years ago.
Updated Arabic collation rules
ar_update_20140721.txt (2.1 KB) - added by srl 2 years ago.
as per meeting
pres_form.txt (23.0 KB) - added by srl 2 years ago.
presentation form rules (as per Markus)
arabic4207.zip (152.7 KB) - added by srl 2 years ago.
sample words with comparisons. Just a few differences.
ar.txt (16.1 KB) - added by srl 2 years ago.
updated ICU ar.txt (informative)
ar.xml (24.5 KB) - added by srl 2 years ago.
Updated ar.xml ( for committing to CLDR) !

Change History

comment:1 Changed 5 years ago by mark

  • Status changed from new to closed
  • Resolution set to needs-more-info

Awaiting new proposal.

comment:2 Changed 4 years ago by markus

I wonder if a significant change of the Arabic ordering should be turned into a Unicode Public Review Issue (PRI). http://www.unicode.org/review/

comment:3 Changed 4 years ago by ramys@…

  • Status changed from closed to reopened
  • Resolution needs-more-info deleted

The collation rules differentiate between Alef group characters:

  • Arabic letter Hamza
  • Arabic letter Alef with Madda above
  • Arabic letter Alef with Hamza above
  • Arabic letter Waw with Hamza above
  • Arabic letter Alef with Hamza below
  • Arabic letter Yeh with Hamza Above
  • Arabic letter Alef

We need to sort Alef group characters based on the levels below.

Level 1:

  • Arabic letter Alef with Madda above will be treated like (Arabic letter Hamza + Arabic letter Alef).
  • Alef group characters other than Alef with Madda above will be considered the same character.
  • Tashkeel characters are ignored.
  • Tatweel characters are ignored.

Level 2:

  • Alef group characters will be sorted in the following order:
    • Arabic letter Hamza
    • Arabic letter Alef with Hamza above
    • Arabic letter Waw with Hamza above
    • Arabic letter Alef with Hamza below
    • Arabic letter Yeh with Hamza Above
    • Arabic letter Alef
  • Tashkeel characters are ignored.
  • Tatweel characters are ignored.

Level 3:

  • Tashkeel characters are considered.
  • Tatweel characters are ignored.

Changed 4 years ago by ramys@…

Original and new Arabic collation rules, file encoding is UTF-8.

comment:4 Changed 4 years ago by markus

  • Cc markus.icu@… added
  • Component changed from unknown to data-collation

Would it be useful for Arabic if we could make a small number of distinctions on level 4? Would four distinctions (up to three differences from "equal down to level 3", or up to three "<<<<" operators in a row in the tailoring) be useful/sufficient?

I have a partial collation prototype in an ICU branch that supports limited level 4 tailoring.

comment:5 Changed 3 years ago by ramys@…

We can consider Tatweel character on level 4. We can discuss and confirm it then we will update the proposal if you are OK with that.

comment:6 Changed 3 years ago by srl

  • Status changed from reopened to new

comment:7 Changed 3 years ago by markus

  • Cc markus added; markus.icu@… removed
  • Data Locale set to ar
  • Weeks set to 0.2

comment:8 Changed 3 years ago by emmons

  • Owner changed from somebody to anybody
  • Status changed from new to assigned

comment:9 Changed 3 years ago by emmons

  • Owner anybody deleted
  • Status changed from assigned to new

comment:10 Changed 2 years ago by ramys@…

The final proposal we agreed on is the 3 levels proposal added under comment #3.

comment:11 Changed 2 years ago by emmons

  • Status changed from new to assigned
  • Component changed from data-collation to design
  • Priority changed from assess to critical
  • Milestone changed from UNSCH to 26rc
  • Owner set to srl
  • Type changed from defect to enhancement

comment:12 Changed 2 years ago by emmons

  • Status changed from assigned to design

comment:13 Changed 2 years ago by mark

Produce doc with summary from comment 3, plus readable rules.
Distribute to cldr-users, unicode, unicore, Ake for comment

comment:14 Changed 2 years ago by markus

  • Cc roozbeh added
  • Component changed from design to data-collation

comment:15 Changed 2 years ago by ramys@…

The collation rules differentiate between the Alef group of characters:

Arabic letter Hamza
Arabic letter Alef with Madda above
Arabic letter Alef with Hamza above
Arabic letter Waw with Hamza above
Arabic letter Alef with Hamza below
Arabic letter Yeh with Hamza Above
Arabic letter Alef

We need to sort Alef group characters based on the levels below, and to ensure different representations of the same character (Alef, Yeh, Alef Maksura and Tashkeel) in different ranges are considered the same character.

Level 1:

Arabic letter Alef with Madda above will be treated like (Arabic letter Hamza + Arabic letter Alef).
Alef group characters other than Alef with Madda above will be considered the same character.
Similar Arabic Alef character representations in different ranges should considered the same character as per the rules below:

o Arabic letter Hamza and Arabic letter Hamza isolated form should be considered the same character.
o Arabic letter Alef with Madda Above, Arabic letter Alef with Madda Above final form and Arabic letter Alef with Madda Above isolated form should be considered the same character.
o Arabic letter Alef with Hamza Above, Arabic letter Alef with Hamza Above final form and Arabic letter Alef with Hamza Above isolated form should be considered the same character.
o Arabic letter Waw with Hamza Above, Arabic letter Waw with Hamza Above final form and Arabic letter Waw with Hamza Above isolated form should be considered the same character.
o Arabic letter Alef with Hamza Below, Arabic letter Alef with Hamza Below final form and Arabic letter Alef with Hamza Below isolated form should be considered the same character.
o Arabic letter Yeh with Hamza Above, Arabic letter Yeh with Hamza Above final form, Arabic letter Yeh with Hamza Above isolated form, Arabic letter Yeh with Hamza Above medial form and Arabic letter Yeh with Hamza Above initial form should be considered the same character.
o Arabic letter Alef, Arabic letter Alef final form and Arabic letter Alef isolated form should be considered the same character.

Similar Arabic Alef Maksura character representations in different ranges should considered the same character. Arabic letter Alef Maksura, Arabic letter Alef Maksura final form and Arabic letter Alef Maksura isolated form should be considered the same character.
Similar Arabic Yeh character representations in different ranges should considered the same character. Arabic letter Yeh, Arabic letter Yeh final form, Arabic letter Yeh isolated form, Arabic letter Yeh medial form and Arabic letter Yeh initial form should be considered the same character.
Similar Arabic Tashkeel characters' representations in different ranges should considered the same character as per the rules below:

o Arabic Fathatan, Arabic Fathatan isolated form and Arabic Tatweel with Fathatan above should be considered the same character.
o Arabic Dammatan and Arabic Dammatan isolated form should be considered the same character.
o Arabic Kasratan and Arabic Kasratan isolated form should be considered the same character.
o Arabic Fatha, Arabic Fatha isolated form and Arabic Fatha medial form should be considered the same character.
o Arabic Damma, Arabic Damma isolated form and Arabic Damma medial form should be considered the same character.
o Arabic Kasra, Arabic Kasra isolated form and Arabic Kasra medial form should be considered the same character.
o Arabic Shadda, Arabic Shadda medial form and Arabic Shadda isolated form should be considered the same character.
o Arabic Sukun, Arabic Sukun isolated form and Arabic Sukun medial form should be considered the same character.

Tashkeel characters are ignored.
Tatweel characters are ignored.

Level 2:

Alef group characters will be sorted in the following order:

Arabic letter Hamza
Arabic letter Alef with Hamza above
Arabic letter Waw with Hamza above
Arabic letter Alef with Hamza below
Arabic letter Yeh with Hamza Above
Arabic letter Alef

Tashkeel characters are ignored.
Tatweel characters are ignored.

Level 3:

Tashkeel characters are considered and will be sorted in the following order:

o Arabic Fathatan
o Arabic Dammatan
o Arabic Kasratan
o Arabic Fatha
o Arabic Damma
o Arabic Kasra
o Arabic Shadda
o Arabic Sukun

Tatweel characters are ignored.

Changed 2 years ago by ramys@…

Arabic Collation Rules

comment:17 follow-up: ↓ 24 Changed 2 years ago by mark

We've been looking this over, but this is hard for non-Arabic speakers to assess. What we need is a comparison table for:

CharacterDUCETCURRENT CLDRPROPOSAL
XX3.25.11.1

(I'm using 3.2 to mean primary order = 3 (among these characters!) and secondary = 2

comment:18 Changed 2 years ago by markus

I would like to understand this on a somewhat higher level.

To start with, Wikipedia says: "Modern dictionaries and other reference books do not use the abjadī order to sort alphabetically; instead, the newer hijā’ī order (with letters partially grouped together by similarity of shape) is used"

Which of these does the DUCET provide, or is it none of them?

We have an existing Arabic-language tailoring, what does it achieve, why is the DUCET not good enough?

We have a root "search" tailoring which is much more extensive. Why does it need to do more than both the DUCET and the Arabic tailoring?

What does the proposed tailoring achieve that the other ones don't? Does it switch between abjadī and hijā’ī orders?

In email I had asked:

I would like to know why the DUCET order for Arabic is the way it is if it's not good for Arabic. Was there a rationale that we have to argue against? Is the DUCET Arabic order maybe a multi-language compromise?

Should we change Persian, Urdu, ... tailorings in similar ways?
What CLDR collation tailorings do we have that involve the Arabic script?

comment:19 Changed 2 years ago by markus

I spent a little time with the code charts. The biggest change from the DUCET to the proposed order is to move many differences from level x to level x+1 (primary to secondary, sec to ter, ter to identical), mostly without changing the relative order.

In particular, the vowel marks change from secondary CEs to tertiary CEs. http://www.unicode.org/charts/collation/chart_Secondary.html

The search rules also change some primary differences to secondary ones, but also change the relative order (except that search does not care about order, only about the level of difference).

The current Arabic tailoring has &teh<<teh-marbuta and &yeh<<alef-maksura.
Compared with root which has teh-marbuta<teh and alef-maksura<yeh (opposite order, different level).
Search also has &yeh<<alef-maksura (like ar.xml) but has &heh<<teh-marbuta (similar shape).

comment:20 Changed 2 years ago by markus

Here are the proposed rules with added LRMs and line breaks. I hope I did not mess them up.

&ت<<‎ة<<<‎ﺔ<<<‎ﺓ
&‎ي<<‎ى<<<‎ﯨ<<<‎ﯩ<<<‎ﻰ<<<‎ﻯ<<<‎ﲐ<<<‎ﱝ
&‎ء=‎آ‎/ا
&‎ء<<‎أ<<‎ؤ<<‎إ<<‎ئ<<‎ا
&‎ء=‎ﺀ
&‎آ=ﺁ=ﺂ
&‎أ=ﺃ=ﺄ
&‎ؤ=ﺅ=ﺆ
&‎إ=ﺇ=ﺈ
&‎ئ=‎ﺉ=‎ﺊ=‎ﺋ=‎ﺌ
&‎ا=‎ﺍ=‎ﺎ
&‎ى=‎ﻰ=‎ﻯ
&‎ي=‎ﻳ=‎ﻴ=‎ﻲ=‎ﻱ
&ً<<<ٌ<<<ٍ<<<َ<<<ُ<<<ِ<<<ّ<<<ْ‎
‎&ً=ﹰ=‎ﹱ
&ٌ‎=ﹲ
&ٍ‎=ﹴ
&َ‎=ﹶ=‎ﹷ
&ُ‎=ﹸ=‎ﹹ
&ِ‎=ﹺ=‎ﹻ
&ّ‎=ﹼ=‎ﹽ
&ْ‎=ﹾ=‎ﹿ

Here are the current tailorings with added LRMs.

ar.xml

‎&ت<<‎ة<<<‎ﺔ<<<‎ﺓ
‎&ي<<‎ى<<<‎ﯨ<<<‎ﯩ<<<‎ﻰ<<<‎ﻯ<<<‎ﲐ<<<‎ﱝ

root.xml type=search

# root search rules for Arabic, Hebrew
‎&ا	# 0627 ARABIC LETTER ALEF
‎<<<ﺎ<<<‎ﺍ	# FE8E, FE8D: FINAL FORM, ISOLATED FORM
‎<<آ‎		# 0622 ARABIC LETTER ALEF WITH MADDA ABOVE
‎<<<ﺂ<<<‎ﺁ	# FE82, FE81: FINAL FORM, ISOLATED FORM
‎<<أ‎		# 0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
‎<<<ﺄ<<<‎ﺃ	# FE84, FE83: FINAL FORM, ISOLATED FORM
‎<<إ‎		# 0625 ARABIC LETTER ALEF WITH HAMZA BELOW
‎<<<ﺈ<<<‎ﺇ	# FE88, FE87: FINAL FORM, ISOLATED FORM
‎&و‎	# 0648 ARABIC LETTER WAW
‎<<<ۥ‎	# 06E5: SMALL WAW
‎<<<ﻮ<<<‎ﻭ	# FEEE, FEED: FINAL FORM, ISOLATED FORM
‎<<ؤ‎		# 0624 ARABIC LETTER WAW WITH HAMZA ABOVE
‎<<<ﺆ<<<‎ﺅ	# FE86, FE85: FINAL FORM, ISOLATED FORM
‎&ي‎	# 064A ARABIC LETTER YEH
‎<<<ۦ‎	# 06E6: ARABIC SMALL YEH
‎<<<ﻳ<<<‎ﻴ<<<‎ﻲ<<<‎ﻱ	# FEF3, FEF4, FEF2, FEF1: INITIAL FORM, MEDIAL FORM, FINAL FORM, ISOLATED FORM
‎<<ئ‎		# 0626 ARABIC LETTER YEH WITH HAMZA ABOVE
‎<<<ﺋ<<<‎ﺌ<<<‎ﺊ<<<‎ﺉ	# FE8B, FE8C, FE8A, FE89: INITIAL FORM, MEDIAL FORM. FINAL FORM, ISOLATED FORM
‎<<ى‎		# 0649 ARABIC LETTER ALEF MAKSURA
‎<<<ﯨ<<<‎ﯩ	# FBE8, FBE9: UIGHUR KAZAKH KIRGHIZ ALEF MAKSURA INITIAL FORM, MEDIAL FORM
‎<<<ﻰ<<<‎ﻯ	# FEF0, FEEF: FINAL FORM, ISOLATED FORM
‎&ه‎	# 0647 ARABIC LETTER HEH
‎<<<ﻫ<<<‎ﻬ<<<‎ﻪ<<<‎ﻩ	# FEEB, FEEC, FEEA, FEE9: INITIAL FORM, MEDIAL FORM, FINAL FORM;, ISOLATED FORM
‎<<ة‎		# 0629 ARABIC LETTER TEH MARBUTA
‎<<<ﺔ<<<‎ﺓ	# FE94, FE93: FINAL FORM, ISOLATED FORM
&[last primary ignorable]<<׳‎	# 05F3 HEBREW PUNCTUATION GERESH

‎<<״‎	# 05F4 HEBREW PUNCTUATION GERSHAYIM
‎<<ـ‎	# 0640 ARABIC TATWEEL
# Don't need explicit entries for 064B - 0652 ARABIC FATHATAN - ARABIC SUKUN;
# these are already ignorable at level 1, and are not involved in contractions
<<ฺ	# 0E3A THAI CHARACTER PHINTHU

comment:21 Changed 2 years ago by markus

  • Cc mark, emmons, pedberg, yoshito, srl, mow added

Roozbeh and I reviewed the proposed Arabic tailoring. We agree with the goals of the proposal, but there are some defects and issues of consistency, and we recommend minor changes, in particular with an eye towards eventually moving some of the changes to the root table. We can help with modifying the Arabic tailoring, including fixes, formatting, and comments.

The proposal retains the current tailoring's ordering Teh Marbuta secondary-after Teh (except see later comments about presentation forms).
&ت<<‎ة<<<‎ﺔ<<<‎ﺓ

By comparison, the DUCET orders Teh Marbuta primary-before Teh.
We recommend to order Teh Marbuta secondary-before Teh, to keep the relative order from the DUCET. This tailoring could be promoted to the root table.

The proposal retains the current tailoring's ordering Alef Maksura secondary-after Yeh (except see later comments about presentation forms).
&‎ي<<‎ى<<<‎ﯨ<<<‎ﯩ<<<‎ﻰ<<<‎ﻯ<<<‎ﲐ<<<‎ﱝ

By comparison, the DUCET orders Alef Maksura primary-before Yeh.
We recommend to order Alef Maksura secondary-before Yeh, to keep the relative order from the DUCET.
This tailoring could be promoted to the root table, with the addition of Farsi Yeh, also with a secondary difference: Alef Maksura << Farsi Yeh << Yeh

The last two characters in the above rule are ligatures of Alef Maksura with superscript Alef. The DUCET maps them to expansions, and we should change the rule to also make them equivalent expansions after tailoring the components.

The proposal makes Alef with Madda above equal to Hamza + Alef.
&‎ء=‎آ‎/ا

This should be a secondary difference, not an identical mapping. This rule is specific to Arabic, not appropriate for the root table.
For clarity, this rule should use the simpler syntax of resetting on the expansion sequence.

The proposal orders various Hamzas with secondary differences rather than the DUCET's primary ones (but otherwise in the same order).
&‎ء<<‎أ<<‎ؤ<<‎إ<<‎ئ<<‎ا

This tailoring could be promoted to the root table, except for the last letter in this rule.
The last letter is Alef. In the root table, it should remain primary-after Hamzas.

The proposal attempts to change Tashkil (which are mostly vowel marks) from secondary CEs to tertiary CEs, presumably to change their contribution to a lower level than the proposed-secondary differences among the Hamza.
&ً<<<ٌ<<<ٍ<<<َ<<<ُ<<<ِ<<<ّ<<<ْ‎

However, by resetting on the first vowel, it achieves tertiary differences only *among* the Tashkil, but they still contribute secondary differences to the string. For the intended effect, this rule chain would have to start with &[last secondary ignorable]<<<Fathatan<<<... In addition, we should review other mappings that include Tashkil secondary weights, especially in expansions, to keep the behavior consistent. It may or may not be easier to make this change in the root table.

Changing Tashkil to tertiary CEs could be promoted to the root table.

The proposal then changes the presentation forms of every otherwise-tailored letter from tertiary differences to identical mappings, presumably to make them less distinguishing than the Tashkil.

&‎ء=‎ﺀ
&‎آ=ﺁ=ﺂ
&‎أ=ﺃ=ﺄ
&‎ؤ=ﺅ=ﺆ
&‎إ=ﺇ=ﺈ
&‎ئ=‎ﺉ=‎ﺊ=‎ﺋ=‎ﺌ
&‎ا=‎ﺍ=‎ﺎ
&‎ى=‎ﻰ=‎ﻯ
&‎ي=‎ﻳ=‎ﻴ=‎ﻲ=‎ﻱ
‎&ً=ﹰ=‎ﹱ
&ٌ‎=ﹲ
&ٍ‎=ﹴ
&َ‎=ﹶ=‎ﹷ
&ُ‎=ﹸ=‎ﹹ
&ِ‎=ﹺ=‎ﹻ
&ّ‎=ﹼ=‎ﹽ
&ْ‎=ﹾ=‎ﹿ

This makes sense, except that it becomes inconsistent with the tertiary differences for presentation forms of all other letters. An alternative would be to make all presentation forms of all letters compare equal to their normal forms, but that would be a massive tailoring. We should consider retaining tertiary differences for presentation forms, unless we can make this change for all Arabic-script letters in the root table.

The removal of differences between presentation forms and normal letters could be promoted to the root table.

Roozbeh says: That the string is using a presentation form instead of the real character shouldn't matter at all in collation. They are the same thing to the user.

My note: With strength=identical, we already get presentation forms ordered after normal forms because they got allocated high in the BMP. They are also distinguished from each other on that level. This seems like the right treatment.

Roozbeh says: But that's only for languages that don't really use ZWNJ (especially Arabic). It does matter to users of languages that use ZWNJ in their orthography, including Persian, Urdu, and Pashto. ZWNJ is not completely ignorable in these languages (they typically want a secondary or tertiary difference for it). Now, presentation forms provide a mechanism to insert a ZWNJ or ZWJ without actually putting the U+200C/200D codepoint in. In such languages, you want an expansion from, say, U+FECA ARABIC LETTER AIN FINAL FORM to <ZWJ, AIN, ZWNJ>.

My note: Sounds definitely like root should not distinguish the presentation forms, and in Persian/Urdu/Pashto we would then tailor the exceptions. ZWNJ/ZWJ are completely ignorable in the DUCET anyway, which means that you need a tailoring to give them some weight.

Generally: The rules need to be formatted with line breaks and with LRMs so that they are readable. Comments should be added, similar to the comments in the root search tailoring. We might want to use \u escapes for combining marks.

Generally: Remember to define expansions after their components have been tailored; possibly collect all expansions at the end.

To answer some of my questions in comment:18

  • The DUCET provides the hijā’ī order, probably based on a pre-existing Arabic codepage.
  • The proposal essentially moves some differences from level x to level x+1.
  • The search tailoring makes a couple of different choices because (a) order does not matter and (b) it is a little more based on letter shape than pronunciation.

comment:22 follow-up: ↓ 23 Changed 2 years ago by ramys@…

Point #1:
The proposal retains the current tailoring's ordering Teh Marbuta secondary-after Teh (except see later comments about presentation forms).
&ت<<‎ة<<<‎ﺔ<<<‎ﺓ
By comparison, the DUCET orders Teh Marbuta primary-before Teh.
We recommend to order Teh Marbuta secondary-before Teh, to keep the relative order from the DUCET. This tailoring could be promoted to the root table.

Ramy: Yes, it is better to order Teh Marbuta secondary before Teh.

Point #2:
The proposal retains the current tailoring's ordering Alef Maksura secondary-after Yeh (except see later comments about presentation forms).
&‎ي<<‎ى<<<‎ﯨ<<<‎ﯩ<<<‎ﻰ<<<‎ﻯ<<<‎ﲐ<<<‎ﱝ
By comparison, the DUCET orders Alef Maksura primary-before Yeh.
We recommend to order Alef Maksura secondary-before Yeh, to keep the relative order from the DUCET.
This tailoring could be promoted to the root table, with the addition of Farsi Yeh, also with a secondary difference: Alef Maksura << Farsi Yeh << Yeh

Ramy: Yes, it is better to order Alef Maksura secondary-before Yeh.

Point #3:

The proposal makes Alef with Madda above equal to Hamza + Alef.
&‎ء=‎آ‎/ا
This should be a secondary difference, not an identical mapping. This rule is specific to Arabic, not appropriate for the root table.
For clarity, this rule should use the simpler syntax of resetting on the expansion sequence.

Ramy: This rule is specific for Arabic, please explain what is required to handle it ?

Point #4:

The proposal orders various ​Hamzas with secondary differences rather than the DUCET's primary ones (but otherwise in the same order).
&‎ء<<‎أ<<‎ؤ<<‎إ<<‎ئ<<‎ا
This tailoring could be promoted to the root table, except for the last letter in this rule.
The last letter is Alef. In the root table, it should remain primary-after Hamzas.

Ramy: Yes, the last letter (Alef) should be primary after Hamzas.

Point #5:

The proposal attempts to change Tashkil (which are mostly vowel marks) from secondary CEs to tertiary CEs, presumably to change their contribution to a lower level than the proposed-secondary differences among the Hamza.
&ً<<<ٌ<<<ٍ<<<َ<<<ُ<<<ِ<<<ّ<<<ْ‎
However, by resetting on the first vowel, it achieves tertiary differences only *among* the Tashkil, but they still contribute secondary differences to the string. For the intended effect, this rule chain would have to start with &[last secondary ignorable]<<<Fathatan<<<... In addition, we should review other mappings that include Tashkil secondary weights, especially in expansions, to keep the behavior consistent. It may or may not be easier to make this change in the root table.
Changing Tashkil to tertiary CEs could be promoted to the root table.

Ramy: please clarify what is required exactly. If handling Tashkeel characters is very complex, we can ignore it with the current proposal to speed it up. As the Hamzas sort is the most important and we have a project pending on it.

Point #6:

The proposal then changes the presentation forms of every otherwise-tailored letter from tertiary differences to identical mappings, presumably to make them less distinguishing than the Tashkil.
&‎ء=‎ﺀ
&‎آ=ﺁ=ﺂ
&‎أ=ﺃ=ﺄ
&‎ؤ=ﺅ=ﺆ
&‎إ=ﺇ=ﺈ
&‎ئ=‎ﺉ=‎ﺊ=‎ﺋ=‎ﺌ
&‎ا=‎ﺍ=‎ﺎ
&‎ى=‎ﻰ=‎ﻯ
&‎ي=‎ﻳ=‎ﻴ=‎ﻲ=‎ﻱ
‎&ً=ﹰ=‎ﹱ
&ٌ‎=ﹲ
&ٍ‎=ﹴ
&َ‎=ﹶ=‎ﹷ
&ُ‎=ﹸ=‎ﹹ
&ِ‎=ﹺ=‎ﹻ
&ّ‎=ﹼ=‎ﹽ
&ْ‎=ﹾ=‎ﹿ
This makes sense, except that it becomes inconsistent with the tertiary differences for presentation forms of all other letters. An alternative would be to make all presentation forms of all letters compare equal to their normal forms, but that would be a massive tailoring. We should consider retaining tertiary differences for presentation forms, unless we can make this change for all Arabic-script letters in the root table.
The removal of differences between presentation forms and normal letters could be promoted to the root table.
Roozbeh says: That the string is using a presentation form instead of the real character shouldn't matter at all in collation. They are the same thing to the user.
My note: With strength=identical, we already get presentation forms ordered after normal forms because they got allocated high in the BMP. They are also distinguished from each other on that level. This seems like the right treatment.
Roozbeh says: But that's only for languages that don't really use ZWNJ (especially Arabic). It does matter to users of languages that use ZWNJ in their orthography, including Persian, Urdu, and Pashto. ZWNJ is not completely ignorable in these languages (they typically want a secondary or tertiary difference for it). Now, presentation forms provide a mechanism to insert a ZWNJ or ZWJ without actually putting the U+200C/200D codepoint in. In such languages, you want an expansion from, say, U+FECA ARABIC LETTER AIN FINAL FORM to <ZWJ, AIN, ZWNJ>.
My note: Sounds definitely like root should not distinguish the presentation forms, and in Persian/Urdu/Pashto we would then tailor the exceptions. ZWNJ/ZWJ are completely ignorable in the DUCET anyway, which means that you need a tailoring to give them some weight.

Ramy: We think it is better to ignore it for this proposal, as checking all other letters will require checking its effect on different languages (Persian/Urdu ...etc).

Point #7:

The rules need to be formatted with line breaks and with LRMs so that they are readable. Comments should be added, similar to the comments in the root search tailoring. We might want to use \u escapes for combining marks.

Ramy: already handled earlier by Steven. We will check and update them if any thing else is required.

Point #8:

Remember to define expansions after their components have been tailored; possibly collect all expansions at the end.

Ramy: please clarify what is required here, our intention is to perform the Alef with Madda Above expansion before handling the tailoring of their components.

Ramy: Please let us confirm the items that need clarification and confirm other items to create the final proposal for final review.

comment:23 in reply to: ↑ 22 ; follow-up: ↓ 28 Changed 2 years ago by srl

OK everyone, let me see if I can summarize where we are. I'm going to use Ramy's numbering in comment:22 for reference.

Ramy, it seems that some of the proposed items are recommended to be in the Root tailoring rather than Arabic. Can you separate out the tailorings?

WhatWaiting onStatus
-Separate out root and Arabic tailoringsRamy
1Teh Marbuta: secondary-before Teh. Root.Ramy
2Alef Maksura secondary-before Yeh. Root. (w/ Farsi Yeh)Ramy
3Alef with Madda (Arabic only)Markus / Ramyneed explanation?
4Alef primary after Hamzas. Root.Ramy
5TashkilRamyRemove?
6other mappingsRamyRemove?
7FormattingRamyAlready handled?
8Put expansions after tailorings, expansions at endMarkus / RamyNeed clarification?

It seems like the Tashkil items 5 and 6 may be best removed from this phase of the proposal (i.e. targetting CLDR 26).

Thanks, all.

comment:24 in reply to: ↑ 17 ; follow-up: ↓ 27 Changed 2 years ago by srl

Replying to mark:

We've been looking this over, but this is hard for non-Arabic speakers to assess. What we need is a comparison table for:

CharacterDUCETCURRENT CLDRPROPOSAL
XX3.25.11.1

(I'm using 3.2 to mean primary order = 3 (among these characters!) and secondary = 2

Mark, Markus, do you still need this table? Perhaps it is important to separate out root and Arabic here also?

comment:25 Changed 2 years ago by srl

John Emmons said that there are about 6 weeks to finish this. We will get a more detailed timeframe tomorrow.

comment:26 Changed 2 years ago by emmons

Needs to be in "reviewable" form ( ar.xml and root.xml ) either in a shared doc or attached to this ticket for a "go or no go" decision by 7/30, in order to make the CLDR 26 release.

comment:27 in reply to: ↑ 24 Changed 2 years ago by srl

Replying to srl:

Replying to mark:

We've been looking this over, but this is hard for non-Arabic speakers to assess. What we need is a comparison table for:

CharacterDUCETCURRENT CLDRPROPOSAL
XX3.25.11.1

(I'm using 3.2 to mean primary order = 3 (among these characters!) and secondary = 2

Mark, Markus, do you still need this table? Perhaps it is important to separate out root and Arabic here also?

Please comment as to whether this looks like maybe it will help:

http://unicode.org/~srloomis/tmp/Arabic4207.html

.. also let me know what other strings to test. Thanks!

(learning a LOT about collatoin element iterators..)

comment:28 in reply to: ↑ 23 Changed 2 years ago by ramys@…

Kindly find the updated status below

WhatWaiting onStatusComment
-Separate out root and Arabic tailoringsN/ANot needed. Need to re-combine ar and rootMarkus confirmed this does NOT need to be done for 26.
1Teh Marbuta: secondary-before Teh. Root. doneThe rule is updated to set Teh Marbuta: secondary-before Teh
2Alef Maksura secondary-before Yeh. Root. (w/ Farsi Yeh) doneThe rule is updated to set Alef Maksura secondary-before Yeh
3Alef with Madda (Arabic only) doneThe Arabic letter Alef with Madda has two representations in Arabic, it can be represented by the character (U+0622) (آ) or Hamza character (U+0621) (ء) followed by Alef (U+0627) (ا). The arabic user consider them the same, so this rule is added to treat them as equal characters when sorting
4Alef primary after Hamzas. Root.srl / Markus Any concern?It is required to order Alef secondry (not primary) after Hamzas. Because the Arabic user may ignore the hamzas and type Alef instead. For example, the user may type أحمد as احمد and means the same word. So we need to handle them as the same character with when sorting. Also, Arabic letter Alef with Hamza above (U+0623) (أ), Arabic letter Waw with Hamza above (U+0624) (ؤ), Arabic letter Alef with Hamza below (U+0625) (إ) and Arabic letter Yeh with Hamza Above (U+0626) (ئ) are all considered as accent of Arabic letter Hamza (U+0621) (ء). So they are updated as secondary order in the root table and removed from the secondary order of (U+0627) (ا), (U+0648) (و) and (U+064A) (ي) as they are considered different characters. Arabic letter Alef is updated in the root table to be primary after Arabic letter Hamza (Kindly check root_updated.xml)
5Tashkil doneRemoved
6other mappings doneRemoved
7FormattingsrlTo Doformatting and chart
8Put expansions after tailorings, expansions at end doneupdated in ar_updated.xml

Replying to srl:

OK everyone, let me see if I can summarize where we are. I'm going to use Ramy's numbering in comment:22 for reference.

Ramy, it seems that some of the proposed items are recommended to be in the Root tailoring rather than Arabic. Can you separate out the tailorings?

WhatWaiting onStatus
-Separate out root and Arabic tailoringsRamy
1Teh Marbuta: secondary-before Teh. Root.Ramy
2Alef Maksura secondary-before Yeh. Root. (w/ Farsi Yeh)Ramy
3Alef with Madda (Arabic only)Markus / Ramyneed explanation?
4Alef primary after Hamzas. Root.Ramy
5TashkilRamyRemove?
6other mappingsRamyRemove?
7FormattingRamyAlready handled?
8Put expansions after tailorings, expansions at endMarkus / RamyNeed clarification?

It seems like the Tashkil items 5 and 6 may be best removed from this phase of the proposal (i.e. targetting CLDR 26).

Thanks, all.

Changed 2 years ago by ramys@…

root tailoring

Changed 2 years ago by ramys@…

Arabic tailoring

Changed 2 years ago by ramys@…

Updated Arabic collation rules

comment:29 Changed 2 years ago by srl

OK, I have updated my chart at http://unicode.org/~srloomis/tmp/Arabic4207.html - it matches the 'v2' proposal. Please review it, especially Mark (done for your input).

What I've done is take the "ar_updated.xml" attached here, and added the following rule (which is in the doc).

#The following Hamza characters are considered the same character with
#different accent:
&ء<<أ<<ؤ<<إ<<ئ<<ا

# &U+0621<<U+0623<<U+0624<<U+0625<<U+0626<<U+0627

Ramy, I think you updated the "search collator" in the root rule. So I added above to ar.xml.

The above don't have LRM/RLMs properly put in yet. Will refresh again Monday morning.

Changed 2 years ago by srl

as per meeting

Changed 2 years ago by srl

presentation form rules (as per Markus)

comment:30 Changed 2 years ago by srl

todo's:

  • IBM-EG to provide samples
  • Steven to provide presentation forms and final xml
  • retain/rename original rules as "compat"
    • file new ticket to remove "compat" in the future

comment:31 Changed 2 years ago by markus

Initially, keep the current tailoring as type="compat". Add "compat" to bcp47/collation.xml.

Mention in LDML Migration section, send email about new keyword value to cldr-users.

comment:32 Changed 2 years ago by srl

This one is problematic because of the space (U+0020). Thoughts?

                "‎&جل جلاله‎=ﷻ"

(NB sorry, but trac is closed for comments without an account. Send me email if you have trouble.)

comment:33 Changed 2 years ago by srl

  • Status changed from design to accepted

Changed 2 years ago by srl

sample words with comparisons. Just a few differences.

comment:34 Changed 2 years ago by srl

I attached some sample runs, with words from our Egypt colleagues. The file 'run.txt' summarizes the differences and is explained here:

897 OldNew_Level1_srl_diff.txt   # OLD vs NEW collation, strength 1
4 NewEGvsNewsrl_Level1_diff.txt  # IBM-EG expected vs SRL's actual - 4 lines of diff

955 OldNew_Level2_srl_diff.txt   # OLD vs NEW collation, strength 2
44 NewEGvsNewsrl_Level2_diff.txt # IBM-EG vs SRL - 44 lines of diff

911 OldNew_Level3_srl_diff.txt   # OLD vs NEW, str 3
8 NewEGvsNewsrl_Level3_diff.txt  # IBM-EG vs SRL - 8 lines of diff

Ramy, Heba -- can you look at the NewEGvsNewsrl* files and also the New*_srl.txt files to see why they are different and if there are any problems?

comment:35 follow-up: ↓ 36 Changed 2 years ago by roozbeh

For U+FDFB ﷻ, you can just use the word without space: جلجلاله.

comment:36 in reply to: ↑ 35 Changed 2 years ago by srl

Replying to roozbeh:

For U+FDFB ﷻ, you can just use the word without space: جلجلاله.

OK, what about this one, removing spaces seems like it would be bad? Or maybe use ZWNJ?:

//                "‎&صلى الله عليه وسلم‎=ﷺ"

comment:37 Changed 2 years ago by roozbeh

Same there. Just remove the spaces: صلىاللهعليهوسلم. ZWNJ may create complications, especially since some may not know about the character.

Changed 2 years ago by srl

updated ICU ar.txt (informative)

Changed 2 years ago by srl

Updated ar.xml ( for committing to CLDR) !

comment:38 Changed 2 years ago by srl

OK - please review the ar.xml (and ICU ar.txt if you wish) enclosed. The sorting of the test files did not change with the last two updates. I removed spaces as per Roozbeh.

Markus, Roozbeh, I will commit this on your approval.

comment:39 Changed 2 years ago by srl

OK, the only file to review now is:

branches/srl/ar4207/common/collation/ar.xml

All other files above are more or less obsolete, but may be useful background.

I still have a TODO: of updating documentation in that file, I will do so.

comment:40 Changed 2 years ago by srl

  • Xref set to 7766

comment:41 Changed 2 years ago by srl

also chart http://unicode.org/~srloomis/tmp/Arabic4207.html is periodidically updated

comment:42 Changed 2 years ago by srl

  • Summary changed from Add shared weight and Arabic tashkeel to collation to Update to Arabic Ordering Specification

comment:43 Changed 2 years ago by srl

A question came up, why not have tertiary difference for this one:

أكبر
ﷳ
(U+FD69) ARABIC LIGATURE SHEEN WITH JEEM WITH YEH FINAL FORM

comment:44 Changed 2 years ago by srl

  • Heba is wondering about just making the following tertiary:

[\uFDF0\uFDF1\uFDF5\uFDFA\uFDFB\uFDFD\uFDF4\uFDF2]

(U+FDF0) ARABIC LIGATURE SALLA USED AS KORANIC STOP SIGN ISOLATED FORM
(U+FDF1) ARABIC LIGATURE QALA USED AS KORANIC STOP SIGN ISOLATED FORM
(U+FDF2) ARABIC LIGATURE ALLAH ISOLATED FORM
(U+FDF4) ARABIC LIGATURE MOHAMMAD ISOLATED FORM
(U+FDF5) ARABIC LIGATURE SALAM ISOLATED FORM
(U+FDFA) ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
(U+FDFB) ARABIC LIGATURE JALLAJALALOUHOU
(U+FDFD) ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM

ﷰﷱﷲﷴﷵﷺﷻ ﷽

  • which would mean these would be identical:

[\uFDF2-\uFDF4\uFDF6-\uFDF9\uFDFC]
(U+FDF2) ARABIC LIGATURE ALLAH ISOLATED FORM
(U+FDF3) ARABIC LIGATURE AKBAR ISOLATED FORM
(U+FDF4) ARABIC LIGATURE MOHAMMAD ISOLATED FORM
(U+FDF6) ARABIC LIGATURE RASOUL ISOLATED FORM
(U+FDF7) ARABIC LIGATURE ALAYHE ISOLATED FORM
(U+FDF8) ARABIC LIGATURE WASALLAM ISOLATED FORM
(U+FDF9) ARABIC LIGATURE SALLA ISOLATED FORM
(U+FDFC) RIAL SIGN

ﷲﷳﷴﷶﷷﷸﷹ﷼

comment:46 in reply to: ↑ 45 Changed 2 years ago by srl

Replying to srl:

Please review

branches/srl/ar4207/common/collation/ar.xml

and

http://unicode.org/~srloomis/tmp/Arabic4207.html

updated for r10711 and later

Merging this to trunk before tomorrow's meeting (Wed Aug 6 8am PT) if there are no objections.

comment:47 Changed 2 years ago by srl

It is in trunk. Thanks all! We'll make any changes needed, of course.

comment:48 Changed 2 years ago by srl

  • Status changed from accepted to reviewing
  • Review set to roozbeh

comment:49 Changed 2 years ago by markus

  • Phase set to rc
  • Milestone changed from 26rc to 26

comment:50 Changed 2 years ago by roozbeh

  • Status changed from reviewing to closed
  • Resolution set to fixed

No time to review the Java code, but the output data and other stuff in ar.xml appear sane.

View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.