From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 05 2007 - 20:09:22 CDT
A propos the discussion about 17 planes, UTF-16, and
extraterrestrial characters, I have gone ahead and done
the preliminary calculations on what we can expect in
terms of numbers of characters for Unicode 5.1, now
due sometime next spring, based on the current contents
of Amendments 3 and 4 to 10646:2003.
Comparing Unicode 5.0 and 5.1 for the main figures
of concern:
5.0 5.1
BMP characters 52013 53439
SMP+characters 47007 47315
Total characters 99020 100754
Total designated 238667 240401
Total reserved 875445 873711
"Characters" here refers to the sum of regular graphic
characters and Unicode format controls, the "traditional"
Unicode count.
"Designated" also includes ISO control codes, noncharacters,
private use characters, and the surrogate code points.
"Reserved" is everything else -- the totally unassigned
code points still available for encoding characters.
As you can see, we have hardly made a dent in that figure.
Also, to give you a concrete idea of the current character
encoding "velocity", if you take the number of characters
added since the last big anomalous jump in content
(Extension B in 2001), and average it over the time from
2001 to the anticipated release of Unicode 5.1 in 2008,
the per annum character encoding rate for WG2 and the UTC
is 944 characters/year (and trending down).
Now we know that some large collections are still to
go, particularly for the various East Asian ideographic
collections. In addition to CJK Extensions C and D, there
is also Old Hanzi (seal script, etc.), Tangut, and Khitan.
And there are more Egyptian hieroglyphs and Sumerian
cuneiform to go. Let's take some worst case scenarios
and assume those all get done in 2008 and all come in
on the large side:
CJK Extension C: 4213
CJK Extension D: 8000
Old Hanzi: 8000
Tangut: 5910
Khitan: 5000
Yi ideographs: 7000
Egyptian basic: 1063
Egyptian ext: 8000
Cuneiform: 1000
O.k., that's another 48,186 characters. Let's assign all these
heavy hitters to allocations, and *then* assume that the
WG2 and UTC committees will still find enough left over to
keep plugging away at 1000 characters per year, indefinitely.
How long have we got?
(873,711 - 48,186) / 1000 = 825 years
Oh dear, it looks like I underestimated before when I said it
would take 800 years to fill the 17 planes.
Quick, someone get busy on contacting the Orionids!
--Ken
This archive was generated by hypermail 2.1.5 : Tue Jun 05 2007 - 20:12:04 CDT