From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Nov 01 2007 - 15:18:53 CST
Once again, just in time for the holidays, the Unicode list
has come around again to one of its perennial favorite topics:
how 17 planes isn't enough codespace, how software will
break when we "inevitably" run out of codes for characters,
and what a shame it is to be stuck with such a limited
and architecturally flawed construct, given all the 30 bezillion
unencoded characters waiting to be encoded.
> <vunzndi at vfemail dot net> wrote:
>
> >>> There are advatages to utf-8
> >>
> >> And many many more advantages to not breaking working code.
> >
> > And even more to making code hard to break, Y2K, et al.
>
> The 17-plane limit was determined on the basis that the scope of
> 10646/Unicode, to encode abstract text characters rather than specific
> instances of glyphs, would safely fit within such a limit. To this
> date, this has not been proven false.
Doug has this right, in my opinion.
Just yesterday, I posted the first full version of the Unicode
names list for early review of Unicode 5.1. My tools report
that as having 100,713 graphic and control characters -- including
the unlisted but obviously massive numbers of Han characters in
the standard.
So that's where we stand after 18 *years* of concerted effort,
by literally hundreds of people in the character encoding
field, to encode every reasonable character that anyone could
lay their hands on documentation for.
That leaves 873,883 code points to go before the millennial
catastrophe, when UTF-16 and all UTF-16 software breaks,
and airplanes start falling from the sky.
Now, I'll grant that there are some big ticket scripts still
to go, and more swaths of Han characters to plow before
we are done. Just take a look at the green and yellow
entries (post 5.1) noted in the Unicode character pipeline page:
http://www.unicode.org/alloc/Pipeline.html
18 years on, Egyptian hieroglyphs are in their last round
of ballotting and are close to getting into the standard.
That's 1071 characters, accounting for the basic Gardiner
set, some Gardiner extensions, and elements for numerals.
Sure there are more Egyptian hieroglyphs out there, but
at the rate the Egyptological community is going to move
on this, we are unlikely to see more than small extensions
of a few dozen here and there for some time to come. And
talk of needing a whole plane for Egyptian hieroglyphics
is basically Halloween harum-scarum talk.
CJK Extension C is also in its last round of ballotting.
That now includes 4149 characters -- which *is* a lot of
characters compared to most scripts. But the last big
chunk of Han that went in was CJK Extension B, 42,711
Han characters in March, 2001. What that means is that
it has taken the IRG and WG2 7 years to prepare the
next 4000 or so Han characters for encoding after
Extension B -- which had picked all the low-hanging
fruit from the big dictionaries. CJK Extension D will
probably show up in less time than Extension C did,
given IRG's use of better tools for cross-checking
submissions now, but still we are dealing with the difficult
long tail of CJK submissions, rather than lots and lots
of obvious missing characters.
Even after CJK Extension C is added to the
standard, there are still 16,694 code points on Plane 1
and the BMP reserved for CJK unified ideographs.
(4DB6..4DBF, 9FC6..9FFF, 2A6D7..2A6FF, and the big
chunk for new extensions: 2B735..2F7FF). I don't think
I'm going to far out on a limb to suggest that prospective
Extensions D and E will fit comfortably in the existing
space. It won't be until somebody gets the submissions
together for Zhuang sawndip that WG2 will need to crack
open the until now unused Plane 3 for Han characters.
The other big historic ideographic scripts (Tangut, Jurchen, Khitan)
all fit comfortably within Plane 1, with plenty of room
to spare. We don't have an accurate count yet for old Yi
ideographs, but the unified character encoding for it
is likely to be a few 1000's, not in the 10's of thousands --
which is the number associated with the paleographic glyph
count, not actually distinct characters.
> Code that uses UTF-16, SCSU, or other encoding forms that assume the
> 17-plane limit are not broken, or break-prone, in the same sense as code
> written under the assumption it would be replaced or upgraded before the
> turn of the century.
Yep.
--Ken
This archive was generated by hypermail 2.1.5 : Thu Nov 01 2007 - 15:21:20 CST