Re: [indic] Re: 28th IUC paper - Tamil Unicode New

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 22 2005 - 20:11:20 CDT

  • Next message: Richard Wordingham: "Re: [indic] Re: 28th IUC paper - Tamil Unicode New"

    Richard responded:

    > > That translates to: "it can be displayed with a dumb rendering engine
    > > and a simple font".
    >
    > Largely, yes. I suspect the default Unicode collation would also produce
    > the correct results.

    Only for the TUNE set, and only if the default table were set up
    following its binary order -- which could of course be done.

    > > In fact adding TUNE to Unicode "without any awareness of Tamil
    > > as a distinct script" is a recipe for disaster.
    >
    > It junks data in the current encoding. How else is it a recipe for
    > disaster?

    /head_desk

    There seems to be a lot of reality denial going on, presuming,
    apparently, that if only TUNE were encoded, the old data and
    the old encoding (i.e. what is currently in the standard) will
    go away. It won't.

    You would only end up with two encodings of the same script,
    whether one is in PUA (which would be essentially useless, as
    others have been pointing out) or not. And those two encodings
    *would* coexist. And the complexity that results doesn't scale
    linearly. In an attempt to make Tamil *simpler*, this proposal
    is heading towards a disaster where it makes Tamil ineluctably
    *more* complex in the encoding. *Much* more complex.

    I'll say it again: Korean Hangul.

    Korean *should* be simple and straightforward.

    It isn't.

    Why? Because it wasn't encoded once in the standard -- it was
    encoded *FOUR* times.

    Doubt me? Examine the standard:

    Encoding #1: U+1100..U+11F9, as combining jamos

    Encoding #2: U+AC00..U+D7A3, as preformed syllables

    Encoding #3: U+3131..U+318E, as compatibility jamos

    Encoding #4: U+FFA0..U+FFDC, as halfwidth jamos

    Representing the *same* Korean text is done distinctly for each
    of those encodings.

    And hey, sorting Encoding #2 is easy, because all the syllables
    are laid out in the collation order, so binary works just fine.
    Sound familiar?

    But sorting *Korean* in Unicode is a bloody, awful nightmare
    with edge cases galore, because the encoding is such a mess
    to begin with. If you are dealing with any data originating
    from encoding #3 or #4, you have to put in place transducers
    to convert representation, or get only partially correct
    results. And even for encodingd #1 and #2, which are meant to
    work with each other and which have canonical equivalence relations
    built in, you *still* have funky edge cases because the combining
    jamos are more expressive than then preformed syllables
    (which don't cover ancient Hangul), and you can't depend just on
    the binary order of the preformed syllables -- which was one
    of the big reasons for creating them in the first place.

    Making encodings *more* complex does not make them simpler to
    process.

    Adding a *second* encoding for Tamil, no matter that it be
    divinely inspired and self-evident, does *NOT* make Unicode
    processing of Tamil data simpler.

    > > you have to make the software
    > > *aware* of the Tamil script to establish the equivalences between the
    > > existing Tamil encoding and the TUNE encoding.
    >
    > Are such canonical equivalences now permitted?

    If the claim were to be for identity of interpretation, as for
    combining jamos versus an equivalent preformed Hangul syllable,
    then you'd be committing yourself to canonical equivalences.
    As far as I can tell, there is nothing in the TUNE table that
    cannot already be represented with the existing Tamil characters.

    But if you commit to introduction of characters with canonical
    equivalences, you might as well give up right there. Such
    additions accomplish nothing except force everyone to do the
    canonical mappings to normalize the data. And it wouldn't
    normalize *to* the TUNE representations, but away from them.

    > I suppose they could be made
    > equivalent in the default Unicode collation algorithm.

    Yes, if you didn't claim actual interpretive equivalence, but
    simply a compatibility equivalence, then you could import the
    complexity of the mapping into the collation algorithm. But
    any process that was not using a full-blown collation tailoring
    for Tamil, but expected normalization to do the equivalencing,
    would end up with the wrong answers.

    > Another, nasty
    > issue, is that if they were canonically equivalent, conversion from TUNE
    > characters to NFD (thus current Tamil) would make text dependent on
    > sophisticated rendering, and defeat a large part of the point of TUNE.

    Precisely.

    Or more correctly, it would defeat the entire point of TUNE.

    And more -- because the resulting encoding would be more complex
    than if TUNE had never been considered in the first place.

    > > Encoding TUNE, whether in the PUA or elsewhere, *without any
    > > awareness of Tamil as a distinct script*, defeats the purpose
    > > of an encoding in the first place.
    >
    cript*, defeats the purpose
    > > of an encoding in the first place.
    >
    > Please enlighten me. What's fundamentally wrong with having LATIN LETTER
    > TAMIL K, LATIN LETTER TAMIL KA, etc?

    Huh? Other than the fact that TAMIL KA isn't a Latin letter?

    I suppose we could have CJK IDEOGRAPH TAMIL K, too, for that matter,
    but I don't see how that helps any. :-)

    What I was responding to was your claim (or perhaps your
    interpretation of the implicit claim behind the TUNE proposal)
    that New Tamil could be rendered in a dumb way "without any
    awareness of Tamil as a distinct script". That is of course
    true at a certain level, particularly if you consider the
    issue *only* for New Tamil, as if this were simply another
    8-bit font hack solution. It *isn't* true once you try to make
    New Tamil work *in* Unicode -- at that point the fact that these
    are another representation of Tamil characters becomes critical
    to proper behavior of everything, and you *CANNOT* treat the
    encoding as if it were a de novo simple script. It isn't: as
    proposed it is a *RE*-encoding of an existing encoded script
    with complex behavior. That is the difference.

    > I thought scripts were chiefly
    > relevant in Unicode because characters in the same script tend to have
    > similar properties and have to work together.

    They do have to work together, but not because they have "similar
    properties". Characters in a script often have very distinct
    properties -- e.g. base characters versus combining marks.

    Scripts are chiefly relevant because they delimit the
    identity of characters and because in implementations they
    trigger distinct rendering logic and font choices.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Aug 22 2005 - 20:12:25 CDT