Re: Arabic Normalization chart

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 09 2008 - 19:43:08 CDT

Next message: Maha Hassan: "Re: Arabic Normalization chart"

Previous message: Maha Hassan: "Re: Arabic Normalization chart"
Maybe in reply to: Maha Hassan: "Arabic Normalization chart"
Next in thread: Maha Hassan: "Re: Arabic Normalization chart"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Thanks for the references.
> But, why U+06C7 has no decomposition? I can enter from Arabic
> keyboard U+0648\U+0619 and get the exact glyph in U+06C7.
> How come u+0623 has a decomposition and not U+06C7?
> What the criteria?

It is an interaction of the requirements for normalization
stability with the timing of the addition of various characters
for the Arabic script.

U+06C7 was already an encoded character in Unicode as of Version 1.1,
dating back to 1993.

The "composition version" for Unicode normalization stability
is defined to be Version 3.1, dating back to 2001. See
http://www.unicode.org/reports/tr15/#Versioning
for details. Among other things that means that no character
that was either decomposed or *not* decomposed as of
Version 3.1, cannot ever have its decomposition status
changed by a later version of the standard.

Those few Arabic letters that *do* have decompositions,
such as U+0622..U+0626, were *already* decomposed as of
Version 3.1, based on U+0653..U+0655 (madda and/or hamza
above or below), which were also already encoded as
of Version 3.1.

But combining marks added *after* Version 3.1 cannot be
used in decompositions of Arabic characters encoded
*before* Version 3.1 (or indeed those added in any
version earlier than when the combining marks themselves
were added).

U+0619 ARABIC SMALL DAMMA was just added in Unicode Version 5.1,
so it cannot be used to decompose any Arabic character from
earlier versions. To do so would destabilize the normalization
of Unicode data.

See:

http://www.unicode.org/policies/stability_policy.html#Normalization

for the formal statement of this requirement for stability.

Also, it should be noted that U+0619 (and similar characters
in the range U+0610..U+0618) are really intended for honorifics and
Koranic annotation -- they are not nuqtas used as diacritics
to create new Arabic characters.

So, for example, U+0615 ARABIC SMALL HIGH TAH is an annotation
mark, as cannot be used to decompose U+0679 ARABIC LETTER TTEH
(which looks like a dotless beh with a small high tah diacritic)
or U+06BB ARABIC LETTER RNOON (which looks like a noon ghunna
with a small high tah diacritic). So even though you could
type such combinations and have them appear like those letters,
they would not be canonical equivalents, nor would applications
consider them to compare equal to each other.

I realize that this is complicated and not at all self-evident
from just using an Arabic keyboard and looking at the
Unicode charts. But the constraints are in place because
of the overriding requirement to keep Unicode normalization
stable, not only for Arabic, but for all Unicode characters.

--Ken

Next message: Maha Hassan: "Re: Arabic Normalization chart"
Previous message: Maha Hassan: "Re: Arabic Normalization chart"
Maybe in reply to: Maha Hassan: "Arabic Normalization chart"
Next in thread: Maha Hassan: "Re: Arabic Normalization chart"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 09 2008 - 19:45:46 CDT