Codespace Anxiety Redux (was: Re: Level of Unicode support required ...)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Nov 01 2007 - 15:18:53 CST

  • Next message: Michael Maxwell: "RE: Codespace Anxiety Redux"

    Once again, just in time for the holidays, the Unicode list
    has come around again to one of its perennial favorite topics:
    how 17 planes isn't enough codespace, how software will
    break when we "inevitably" run out of codes for characters,
    and what a shame it is to be stuck with such a limited
    and architecturally flawed construct, given all the 30 bezillion
    unencoded characters waiting to be encoded.

    > <vunzndi at vfemail dot net> wrote:
    >
    > >>> There are advatages to utf-8
    > >>
    > >> And many many more advantages to not breaking working code.
    > >
    > > And even more to making code hard to break, Y2K, et al.
    >
    > The 17-plane limit was determined on the basis that the scope of
    > 10646/Unicode, to encode abstract text characters rather than specific
    > instances of glyphs, would safely fit within such a limit. To this
    > date, this has not been proven false.

    Doug has this right, in my opinion.

    Just yesterday, I posted the first full version of the Unicode
    names list for early review of Unicode 5.1. My tools report
    that as having 100,713 graphic and control characters -- including
    the unlisted but obviously massive numbers of Han characters in
    the standard.

    So that's where we stand after 18 *years* of concerted effort,
    by literally hundreds of people in the character encoding
    field, to encode every reasonable character that anyone could
    lay their hands on documentation for.

    That leaves 873,883 code points to go before the millennial
    catastrophe, when UTF-16 and all UTF-16 software breaks,
    and airplanes start falling from the sky.

    Now, I'll grant that there are some big ticket scripts still
    to go, and more swaths of Han characters to plow before
    we are done. Just take a look at the green and yellow
    entries (post 5.1) noted in the Unicode character pipeline page:

    http://www.unicode.org/alloc/Pipeline.html

    18 years on, Egyptian hieroglyphs are in their last round
    of ballotting and are close to getting into the standard.
    That's 1071 characters, accounting for the basic Gardiner
    set, some Gardiner extensions, and elements for numerals.
    Sure there are more Egyptian hieroglyphs out there, but
    at the rate the Egyptological community is going to move
    on this, we are unlikely to see more than small extensions
    of a few dozen here and there for some time to come. And
    talk of needing a whole plane for Egyptian hieroglyphics
    is basically Halloween harum-scarum talk.

    CJK Extension C is also in its last round of ballotting.
    That now includes 4149 characters -- which *is* a lot of
    characters compared to most scripts. But the last big
    chunk of Han that went in was CJK Extension B, 42,711
    Han characters in March, 2001. What that means is that
    it has taken the IRG and WG2 7 years to prepare the
    next 4000 or so Han characters for encoding after
    Extension B -- which had picked all the low-hanging
    fruit from the big dictionaries. CJK Extension D will
    probably show up in less time than Extension C did,
    given IRG's use of better tools for cross-checking
    submissions now, but still we are dealing with the difficult
    long tail of CJK submissions, rather than lots and lots
    of obvious missing characters.

    Even after CJK Extension C is added to the
    standard, there are still 16,694 code points on Plane 1
    and the BMP reserved for CJK unified ideographs.
    (4DB6..4DBF, 9FC6..9FFF, 2A6D7..2A6FF, and the big
    chunk for new extensions: 2B735..2F7FF). I don't think
    I'm going to far out on a limb to suggest that prospective
    Extensions D and E will fit comfortably in the existing
    space. It won't be until somebody gets the submissions
    together for Zhuang sawndip that WG2 will need to crack
    open the until now unused Plane 3 for Han characters.

    The other big historic ideographic scripts (Tangut, Jurchen, Khitan)
    all fit comfortably within Plane 1, with plenty of room
    to spare. We don't have an accurate count yet for old Yi
    ideographs, but the unified character encoding for it
    is likely to be a few 1000's, not in the 10's of thousands --
    which is the number associated with the paleographic glyph
    count, not actually distinct characters.

    > Code that uses UTF-16, SCSU, or other encoding forms that assume the
    > 17-plane limit are not broken, or break-prone, in the same sense as code
    > written under the assumption it would be replaced or upgraded before the
    > turn of the century.

    Yep.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Nov 01 2007 - 15:21:20 CST