RE: [OT] non-terrestrial writing systems

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jun 05 2007 - 09:25:13 CDT

  • Next message: Andrew West: "Re: [OT] non-terrestrial writing systems"

    Doug Ewell écrit le mardi 5 juin 2007 05:32 à Unicode Mailing List:
    >
    > Back in the day when ISO 10646 was still 31 bits wide and the proposal was
    > made to limit it to 17 planes, as Unicode already was, there were quite a
    > few, apparently serious, objections that this would be a regrettable,
    > Y2K-like limitation because of the eventual discovery of non-terrestrial
    > scripts that would need the extra coding space. I think some of us who
    > remember this being portrayed as a genuine technical flaw in Unicode still
    > tend to wince when the topic is brought up, even if the humorous intent is
    > clear to everyone else.

    If there was a flaw, it was not originally from Unicode, but from the
    designers of the UTF-16 encoding which was first released as a RFC and then
    adopted by ISO, before being made part of the Unicode standard.

    Nothing was ever prepared to allow apossible extension of UTF-16 to more
    than 17 planes (and nothing has been done since then to allow more
    surrogates in the BMP to make this possible using 3 surrogates).

    If one ever wants to have 31-bit codepoints, the only way is to allocate a
    net set of surrogates either within the PUA block (but this may conflict
    with many current uses of the PUA block in the BMP), or within the special
    plane 14 (but this will require using 2 supplementary codepoints, each one
    using 2 normal surrogates (i.e. a total of 4 surrogates, i.e. coding 31 bit
    codepoints using...64 bits), and this will break parsers that expect
    codepoints to be terminated after the first 2 surrogates.

    In all cases, such extension will require breaking existing conformance
    rules for the standard UTF-16 decoding and use, and so such extension will
    not be able to use the same "UTF-16" identifier (but possibly "UTF-16X")

    (The most efficient way to reach the 30 bit limit would be to have another
    10-bits wide block in the BMP, but chances are now very low that this will
    be ever possible, the convenient 1024-codepoints space that remained between
    the Hangul syllables and existing surrogates being reserved now for Hangul
    extensions).

    To make a "UTF-16X" extension compatible however with strict implementations
    of UTF-16 (even if they see multiple codepoints, that are currently
    considered valid as characters, instead of just one), the only remaining
    solution is to allocate 2 blocks of supplementary surrogates in the special
    plane.

    But do we need such extension? Note that there are now variation selectors
    to qualify the existing characters, without having to encode many
    compatibility characters in the future.



    This archive was generated by hypermail 2.1.5 : Tue Jun 05 2007 - 09:29:01 CDT