Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

From: Mark Davis ☕ (mark@macchiato.com)
Date: Thu Jul 29 2010 - 09:51:59 CDT

  • Next message: Philippe Verdy: "Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)"

    Mark

    *— Il meglio è l’inimico del bene —*

    On Thu, Jul 29, 2010 at 05:57, Philippe Verdy <verdy_p@wanadoo.fr> wrote:

    > "Martin J. Dürst" <duerst@it.aoyama.ac.jp> wrote:
    > >
    > > On 2010/07/29 13:33, karl williamson wrote:
    > > > Asmus Freytag wrote:
    > > >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
    > >
    > > >>> Well, there actually is such a script, namely Han. The digits (一、
    > > >>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as
    > > >>> decimal place-value digits, and they are scattered widely, and of
    > > >>> course there are is a lot of modern living practice.
    > >
    > > >> The situation is worse than you indicate, because the same characters
    > > >> are also used as elements in a system that doesn't use place-value,
    > > >> but uses special characters to show powers of 10.
    > >
    > > No. Sequences of numeric Kanji are also used in names and word-plays,
    > > and as sequences of individual small numbers.
    >
    > (1) Existing exception :
    >
    > There's one example of a digit which has a numeric type = decimal, AND
    > is encoded in a "scattered" way:
    >
    > 19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N
    >
    > The other decimal nine digits for the Tham variant of the New Tai Lue
    > digits are borrowed from another sequence of decimal digits, starting
    > at U+19D0 (for digit zero) with the exception of U+19D1 which is
    > replaced (for digit one). Both sets are assigned in the same
    > "New_Tai_Lue" script property value.
    >
    > So the additional stability proposal will not be enforceable.
    >

    On the contrary. Were we do want such a policy, the implication would be
    either to:
    (a) change the type of 19DA from Nd to No (what I think would be the right
    thing to do)
    (b) grandfather in the character.

    >
    > (2) Arabic digits :
    >
    > Such case was avoided for the Eastern/Extended variant of Arabo-Indic
    > digits in U+06F0..U+06F9, without borrowing the common forms for the
    > Standard variant in U+0660.U+0669: they were reencoded separately to
    > create a complete sequence of 10 digits, even if most of them (all
    > except 4 to 6) are exactly similar and belong to the same unified
    > "script".
    >
    > But what is even more "strange" is that the Standard Arabic digits are
    > assigned to the "Common" script, when the Eastern/Extended variant is
    > assigned to the "Arabic" script (look at the Unicode script property
    > value, from the file "Scripts-5.2.0.txt" in the UCD).
    >
    > If you just look at this property, you may think that the
    > Extended/Eastern digits are the standard ones for the Arabic script:
    > this is a side-effect of unification of Western and Eastern variants
    > of the Arabic script.
    >

    It is not so strange. Read
    http://www.unicode.org/reports/tr24/proposed.html#Multiple_Script_Values,
    and other parts of #24 describing Common.

    >
    >
    > (3) Unification of the Arabic script:
    >
    > Ideally, there should be two additional separate ISO 15924 script
    > codes for the Western and Eastern variants the Arabic script (possibly
    > [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the
    > Unicode "script" property value alias for the Western and Eastern
    > digits or letters should be segregated, using a separate Script
    > property value (splitting the Arabic script, where it is significant,
    > just like it occured for Georgian and Greek/Coptic alphabets).
    >

    There is no likelihood of that happening, simply for the sake of these
    digits.

    The original characters were just font variants; they were really split to a
    large extend because of the UBA (which I think in retrospect was a mistake,
    but c'est la vie, n'est pas?).

    > Nothing will be changed for the existing Arabic script, but the
    > "Extended/Eastern Arabic" script (assigned with a new ISO 15924 code
    > and mapped with a new property alias in Unicode), will still borrow
    > most of its letters from the standard script without reencoding them.
    >
    > No character or block will be renamed (and I DO NOT propose to
    > disunifying existing common Arabic letters, or assigning them in the
    > "Common" script), it should just be a better sub-classification, where
    > the characters are clearly distinguished between the two variants.
    >
    > Most Arabic characters should remain in the common "Arabic" script,
    > and those that are differentiated should be assigned in a
    > "Standard_Arabic" or "Extended_Arabic" script. But this may cause some
    > complication for the script inheritance in spans of texts (because the
    > "Arabic" script property value would behave a bit like what the
    > "Common" does for alphabetic scripts, i.e. like a group of scripts).
    >
    > Such change for the assigned script property value (if it's not
    > already stabilized) would require documentation, and changes in a few
    > other core or derived datafiles:
    >
    > - PropertyValueAliases.txt (adding two new property values for "sc"):
    > sc ; Arab ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and
    > "sc=Arbx" in regexps)
    > sc ; Arbc ; Common_Arabic
    > sc ; Arbs ; Standard_Arabic # (also includes "sc=Arbc" in regexps)
    > sc ; Arbx ; Extended_Arabic # (also includes "sc=Arbc" in regexps)
    >
    > - Script.txt (assigning the two new property values to remap existing
    > "Arabic")
    > - Arabic-Shaping.txt (possibly adding comments at end of lines where
    > this is not the Common Arabic)
    > - Joining-Groups.txt (same remark)
    > - Bidi-Mirroring.txt (same remark)
    >
    > And in the description of some standard script identification and
    > segmentation algorithms. I don't know if IDNA should continue to use
    > "Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to
    > avoid mixing digits that are visually confusable), as it uses such
    > segmentation (note that these characters are canonically different,
    > for normalization purposes).
    >
    > Philippe.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Jul 29 2010 - 09:54:32 CDT