Re: Jumping Cursor. Was: Right-to-Left Punctuation Problem

From: Gregg Reynolds (unicode@arabink.com)
Date: Tue Aug 02 2005 - 19:33:19 CDT

  • Next message: Mark Davis: "Re: Jumping Cursor. Was: Right-to-Left Punctuation Problem"

    John Hudson wrote:
    > Gregg Reynolds wrote:
    >
    >> Adding to the already existing - what, 5? 6? - different ways of
    >> encoding each digit. Let's count the ways:
    >>
    >> 0030-0039 DIGIT ZERO etc
    >> 0660-0069 ARABIC-INDIC
    >> 06F0-06F9 EXTENDED ARABIC-INDIC
    >> 0966-096F DEVANAGARI
    >> 09E6-09EF BENGALI
    >> 0A66-0A6F GURMUKHI
    >> 0AE6-0AEF GUJARATI
    >> Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan,
    >> Myanmar, Ethiopic, Khmer, Mongolian, Limbu, Osmanya, various
    >> mathematical digit characters, Japanese full-width, etc. etc. Twenty
    >> one and counting.
    >
    >
    > Most of which look different, some of which function differently (i.e.
    > use different counting systems that do not correspond to our decimal
    > digit system). I don't think there is any expectation that one would be
    > able to perform cross-script arithmetic using Mongolian and Ethiopic
    > numeral characters. What you are proposing is something quite other: two
    > ways of encoding the *same* numerals. Your new numerals would look the
    > same, represent the same numbers, need to be considered the same for
    > searches, sorts and mathematical functions. They would be, in fact, the
    > same characters encoded twice.
    >

    Ok. I agree that is a valid observation. I think, anyway. I have to
    ponder it a bit more. I think it depends on what the meaning of "same"
    is. Aren't 0030-9 and 0660-9 really the "same"? My understanding of
    unicode is that it doesn't address these semantics - 0-9 are just
    characters, not mathematical signs. (The fact that the have "number"
    property only means they all have the same formal category, not that
    they denote mathematical values; it could just as easily have been
    called the "fdsaflkh" property. It's up to a higher level protocol to
    interpret "fdsaflkh" characters as mathematical signs.) Mathematically,
    any characters that denote the mathematical values 0-9 may be considered
    "the same", regardless of graphical form. The latter is a mere matter
    of implementation (font) technology.

    > But this is the kicker, as already mentioned yesterday: *all* those
    > numerals characters you listed share the same directionality, and all
    > numbers in Unicode are encoded most-significant digit first. Maybe if

    Well, typographically they are all LTR, but that is completely
    orthogonal to encoding syntax (polarity). It occurs to me now that
    you've put your finger on the problem. Which is, that these
    "characters" should in fact be treated as characters, and not
    mathematical signs, in order to be consistent (ha!) with Unicode
    principles. Mathematical interpretation comes in at a higher
    level protocol. This is consistent with Unicode design principles, as I
    understand them. So assume that RTL 0-9 are just another set of
    characters, w/out mathematical semantics, that all happen to have a
    property called "number". They will be treated no differently than any
    other RTL character w/r/t typesetting; w/r/t to math routines, they will
    be treated no differently than any other "number" characters (math
    routines must merely interpret polarity correctly.) In fact, there is
    no need to stipulate any graphical form. (I note that MSWord happily
    changes the form of numeric digit characters from European to Arabic
    Indic based on user preferences. Does it change the underlying
    encoding? Dunno, never checked.)

    > computing had been invented in the Middle East it would be the other way
    > around, with the least significant digit encoded first, and the various
    > standards would oblige all LTR writing systems to function
    > bidirectionally with regard to numerals.

    But the point is that absolute directional is not the only design
    choice. We would get along just fine with relative polarity (relative
    to writing direction, that is.)

    >
    > Now, when it comes to things like parentheses, the mirrored stuff does
    > my head in and I really don't see the point of it. I'm guessing that it
    > confuses application developers also, since it is implemented with so
    > little consistency.

    You can say that again. But in this respect Unicode is already
    obsolete. The only justification I can see for ambiguous
    directionality, mirroring, etc. is trying to save space (code space, I
    mean). Fifty years from now (or ten?) chars will be 64 bits, with an
    essentially infinite code space, so there will be no justification for
    either unification or directional ambiguity.

    -gregg



    This archive was generated by hypermail 2.1.5 : Tue Aug 02 2005 - 19:34:04 CDT