Re: Arabic Normalization chart

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 09 2008 - 19:43:08 CDT

  • Next message: Maha Hassan: "Re: Arabic Normalization chart"

    > Thanks for the references.
    > But, why U+06C7 has no decomposition? I can enter from Arabic
    > keyboard U+0648\U+0619 and get the exact glyph in U+06C7. 
    > How come u+0623 has a decomposition and not U+06C7?
    > What the criteria?

    It is an interaction of the requirements for normalization
    stability with the timing of the addition of various characters
    for the Arabic script.

    U+06C7 was already an encoded character in Unicode as of Version 1.1,
    dating back to 1993.

    The "composition version" for Unicode normalization stability
    is defined to be Version 3.1, dating back to 2001. See
    http://www.unicode.org/reports/tr15/#Versioning
    for details. Among other things that means that no character
    that was either decomposed or *not* decomposed as of
    Version 3.1, cannot ever have its decomposition status
    changed by a later version of the standard.

    Those few Arabic letters that *do* have decompositions,
    such as U+0622..U+0626, were *already* decomposed as of
    Version 3.1, based on U+0653..U+0655 (madda and/or hamza
    above or below), which were also already encoded as
    of Version 3.1.

    But combining marks added *after* Version 3.1 cannot be
    used in decompositions of Arabic characters encoded
    *before* Version 3.1 (or indeed those added in any
    version earlier than when the combining marks themselves
    were added).

    U+0619 ARABIC SMALL DAMMA was just added in Unicode Version 5.1,
    so it cannot be used to decompose any Arabic character from
    earlier versions. To do so would destabilize the normalization
    of Unicode data.

    See:

    http://www.unicode.org/policies/stability_policy.html#Normalization

    for the formal statement of this requirement for stability.

    Also, it should be noted that U+0619 (and similar characters
    in the range U+0610..U+0618) are really intended for honorifics and
    Koranic annotation -- they are not nuqtas used as diacritics
    to create new Arabic characters.

    So, for example, U+0615 ARABIC SMALL HIGH TAH is an annotation
    mark, as cannot be used to decompose U+0679 ARABIC LETTER TTEH
    (which looks like a dotless beh with a small high tah diacritic)
    or U+06BB ARABIC LETTER RNOON (which looks like a noon ghunna
    with a small high tah diacritic). So even though you could
    type such combinations and have them appear like those letters,
    they would not be canonical equivalents, nor would applications
    consider them to compare equal to each other.

    I realize that this is complicated and not at all self-evident
    from just using an Arabic keyboard and looking at the
    Unicode charts. But the constraints are in place because
    of the overriding requirement to keep Unicode normalization
    stable, not only for Arabic, but for all Unicode characters.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri May 09 2008 - 19:45:46 CDT