From: John H. Jenkins (jenkins@apple.com)
Date: Tue Nov 12 2002 - 12:38:51 EST
On Tuesday, November 12, 2002, at 09:03 AM, Andrew C. West wrote:
> BTW, what is "CJK Unified Ideographs Extension C" intended to include
> ? Surely
> not any more ordinary Han ideographs - with over 70,000 ideographs
> already
> encoded, there can't be so many genuine ideographs that still need
> encoding as
> to warrant a whole new plane. However there is a real need to encode
> oracle bone
> characters and other ancient epigraphic forms of Han ideographs. Is
> this
> (hopefully) what Extension C is intended for ?
>
Nope. We're still doing modern stuff.
it is unlikely in the extreme that we'll actuall *need* a whole plane
for new ideographs. Extension C is currently big enough, however, that
if we were to accommodate it via separate encoding of everything we'd
use up the rest of Plane 2. And there's still no end in sight.
To some extent, we're having to deal with massive turtle--er, fecal
matter being dumped uncritically into the bin consisting largely of
things which are obviously variants of existing characters. This we
will deal with to an extent by using variation selectors. (Many of
Unicode's proposed additions are unofficial simplifications which will
also be handled via variation selectors.)
Beyond that, it is incredible just how many obscure characters there
are once you start looking for them. The PRC's submission includes
large numbers of place names, for example, and I dread to think how
many more of *those* there may be. HKSAR has come up with more
Cantonese- or Hong Kong-specific characters. The only non-Mandarin
dialect to receive *any* attention at all is Cantonese, and despite the
efforts of the HKSAR that's been rather unsystematic. Unicode's
proposed characters include a few Cantonese-specific ones that we were
able to dig up without much effort.
And all this leaves out stuff like cute names for Hong Kong race
horses, frogs-in-wells, and things like that.
All in all, I wouldn't be surprised if there were as many as ten
thousand or so genuinely distinct characters in modern use which have
yet to be encoded. And there are a number of border line cases from
pre-modern texts where it looks like it's probably a variant but it may
not be. (Of course, I also estimated the total number of genuine Han
ideographs to be under eighty thousand, which just goes to show how
much *I* know.)
Oracle bone forms and other older versions of the Han ideographs are
something we haven't even got a good model for how to handle yet.
==========
John H. Jenkins
jenkins@apple.com
jhjenkins@mac.com
http://www.tejat.net/
This archive was generated by hypermail 2.1.5 : Tue Nov 12 2002 - 13:19:32 EST