Re: Code pages and Unicode

From: Ken Whistler <kenw_at_sybase.com>
Date: Fri, 19 Aug 2011 17:03:41 -0700

On 8/19/2011 2:53 PM, Benjamin M Scarborough wrote:
> Whenever somebody talks about needing 31 bits for Unicode, I always think of the hypothetical situation of discovering some extraterrestrial civilization and trying to add all of their writing systems to Unicode. I imagine there would be little to unify outside of U+002E FULL STOP.

It is the *terrestrial* denizens of this discussion list that I worry
more about. Most of
the proposals for filling up uncountable planes with numbers
representing -- well,
who knows? -- originate here. ;-)

>
> The point I'm getting at is that somebody always claims that U+0000..U+10FFFF isn't enough, but I never see convincing evidence or rationale that an expansion is necessary—just speculation.
>

Well, it is a late Friday afternoon in August. A slow news day, I guess.

So it is time to trot out the periodically updated statistics that long ago
convinced the folks who think 21 bits is just fine and dandy, and has a
usefulness
warranty that far exceeds our lifetimes, but which of course no matter how
often repeated never convince the we-need-31-bits crowd.

Newly updated to include the Unicode 6.1 repertoire in process for
publication
very early next year, the figures are:

110,181 characters encoded (graphic, format, and control codes counted)

Now let's just assign that number an era of 2011, to make the math a
little simpler.

The first version of Unicode was published in 1991, so we've been at
this for 20
years, not counting start up time. If you just divide 110,181 by 20
years, that is
a rough average of 5509 characters added per year.

But here is the interesting part: the rate of inclusion is declining,
rather than being
steady. Again, to make the math simpler, just compare the *first* decade of
Unicode (1991 - 2001) and the *second* decade of Unicode (2001 - 2011).
Unicode 3.1 (2001) had 94,205 characters in it. So:

1st decade: 94,205 characters, or roughly 9420 characters/year

2nd decade: 15,976 characters, or roughly 1598 characters/year

Also keep in mind that the absolute numbers have always been completely
dominated by CJK. 75.46% of the characters encoded in Unicode 3.1
are CJK ideographs (unified and compatibility). The IRG has been working
mightily to keep adding to the total of encoded CJK ideographs, but they
are starting to scrape the bottom even of that deep barrel.

And look at the SMP Roadmap:

http://www.unicode.org/roadmaps/smp/

We know there are a few big historic ideographic scripts to go: Tangut
is the
biggest and most advanced of the proposals, weighing in at something
over 7000
characters. But even with East Asian heavyweights like Tangut, Jurchen,
and Khitan
given tentative allocations on the SMP roadmap, there is plenty of
unassigned
"air" on Plane 1 still. And frankly, a lot of very serious people have
been looking
hard for good, encodable candidate scripts to add to the roadmap, for a very
long time.

The upshot is, based on 20 years "in the business", as it were, my best
estimate of what we can expect for the next decade is something as follows:

Two big chunks: roughly 10K more CJK ideographs nobody ever heard of,
plus 7K+ Tangut ideographs. After that, the two committees (UTC and WG2)
will be hard pressed to find and process many more than 1000 characters per
year. Why? Because all the *easy* stuff was done long ago, during the
first decade of Unicode. Everything from here on out is very obscure, hard
to research, hard to document and review, hard to get consensus on, and
is often fragmentary or even undeciphered, or consists of sets of notations
that many folks won't even agree *are* characters.

So: 10K + 7K + 1k/year for 10 years = 27,000 *maximum* additions by 2021.

And that is to fill the gaping hole -- nay, gigantic chasm -- of 862,020
unassigned
code points still left in the 21-bit space.

Past 2021, who knows? Many of us will no longer be participating by then,
but there are various possible scenarios:

1. The committees may creak to a halt, freeze the standards, and the
delta encoding rate will drop from 1000/year to 0/year. This is actually
a scenario with a non-zero probability.

2. Somebody with non-character agendas may seize control and start using
numbers for, I don't know, perhaps localizable sentences, or something, just
because over 835,000 numbers will be available and nature abhors a
vacuum. I consider that a very low likelihood, because of the enormous
vested interest there will be by the entire worldwide IT industry in keeping
the character encoding standard stable.

3. Or, the committees may limp along more or less indefinitely, with
more and
more obscure scripts being documented and standardized, with a trickle
of new ones always being invented, and new sets of symbols or notations
being invented and stuck in. So maybe they could keep up the pace
of 1000 characters encoded per year for some time off into the future.
But at that rate, when do we have to start worrying? 835,000 divided by
1000 characters/year is 835 years.

O.k., so apparently we have awhile to go before we have to start worrying
about the Y2K or IPv4 problem for Unicode. Call me again in the
year 2851, and we'll still have 5 years left to design a new scheme and plan
for the transition. ;-)

--Ken
Received on Fri Aug 19 2011 - 19:06:09 CDT

This archive was generated by hypermail 2.2.0 : Fri Aug 19 2011 - 19:06:10 CDT