Re: N2515: Request for Roadmap - plane 3

From: John H. Jenkins (
Date: Tue Nov 12 2002 - 12:38:51 EST

  • Next message: Markus Scherer: "Re: Media UI Symbols"

    On Tuesday, November 12, 2002, at 09:03 AM, Andrew C. West wrote:

    > BTW, what is "CJK Unified Ideographs Extension C" intended to include
    > ? Surely
    > not any more ordinary Han ideographs - with over 70,000 ideographs
    > already
    > encoded, there can't be so many genuine ideographs that still need
    > encoding as
    > to warrant a whole new plane. However there is a real need to encode
    > oracle bone
    > characters and other ancient epigraphic forms of Han ideographs. Is
    > this
    > (hopefully) what Extension C is intended for ?

    Nope. We're still doing modern stuff.

    it is unlikely in the extreme that we'll actuall *need* a whole plane
    for new ideographs. Extension C is currently big enough, however, that
    if we were to accommodate it via separate encoding of everything we'd
    use up the rest of Plane 2. And there's still no end in sight.

    To some extent, we're having to deal with massive turtle--er, fecal
    matter being dumped uncritically into the bin consisting largely of
    things which are obviously variants of existing characters. This we
    will deal with to an extent by using variation selectors. (Many of
    Unicode's proposed additions are unofficial simplifications which will
    also be handled via variation selectors.)

    Beyond that, it is incredible just how many obscure characters there
    are once you start looking for them. The PRC's submission includes
    large numbers of place names, for example, and I dread to think how
    many more of *those* there may be. HKSAR has come up with more
    Cantonese- or Hong Kong-specific characters. The only non-Mandarin
    dialect to receive *any* attention at all is Cantonese, and despite the
    efforts of the HKSAR that's been rather unsystematic. Unicode's
    proposed characters include a few Cantonese-specific ones that we were
    able to dig up without much effort.

    And all this leaves out stuff like cute names for Hong Kong race
    horses, frogs-in-wells, and things like that.

    All in all, I wouldn't be surprised if there were as many as ten
    thousand or so genuinely distinct characters in modern use which have
    yet to be encoded. And there are a number of border line cases from
    pre-modern texts where it looks like it's probably a variant but it may
    not be. (Of course, I also estimated the total number of genuine Han
    ideographs to be under eighty thousand, which just goes to show how
    much *I* know.)

    Oracle bone forms and other older versions of the Han ideographs are
    something we haven't even got a good model for how to handle yet.

    John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Tue Nov 12 2002 - 13:19:32 EST