Re: Beyond 17 planes, was: Java char and Unicode 3.0+

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Oct 16 2003 - 11:45:07 CST


From: "Peter Kirk" <peterkirk@qaya.org>

> On 16/10/2003 06:33, Philippe Verdy wrote:
>
> >From: "John Cowan" <cowan@mercury.ccil.org>
> >
> >>Philippe Verdy scripsit:
> >>>
> >>>I am also doubting, but I would not bet on it. After all, when Unicode
> >>>started, a single plane was considered waaaaaay more than sufficient
> >
> >too.
> >
> >>I not only would bet on it, I actually have a bet on it. Henry Thompson
> >>of the W3C's Schema WG bet me that we'd outrun the existing planes
within
> >>five years; four left to go and no sign of it, even if Michael Everson
> >>were to achieve pluripresence and actually get everything accepted into
> >>the standard that he knows needs to be done.
> >
> >Just for the case it would be needed, are you keeping an unassigned range
> >in the BMP so that extension will remain possible to preserve an
ascending
> >compatibility or support for UTF-16 which currently is the main reason
why
> >there are for now 17 planes defined ?
> >(for example in the range between Hangul syllables and existing
surrogates)
> ...
>
> I would guess not. I can think of much more useful things to do with any
> remaining space in the BMP. Anyway, the space you mention, if used for
> additional high-half or low-half surrogates, is only 80 characters wide
> and so would give just slightly more than one more plane, in fact 80 x
> 1024 characters. And it is the largest space on the BMP which is not
> already roadmapped.
>
> I suppose that, in the unlikely event that in the foreseeable future it
> looks as if more than 17 planes might become necessary, and anyone is
> still trying to use UTF-16 (although by that time memory and bandwidth
> will probably be so cheap that no one bothers any more with encodings
> that save them), it will be possible to reserve part of the 17 planes
> for surrogate pairs representing the remaining planes. So the UTF-16
> encoding would be two existing 16-bit surrogate pairs forming a higher
> level surrogate pair. UTF-32 would of course be more efficient (32 bits
> rather than 64), but I doubt if anyone will care.

This is another solution, however this should be predictable by just
testing the value of the first high surrogate which would indicate
the length of the encoding sequence for the extended codepoints.
Given that each surrogate encodes 10 bits, a third surrogate of
the same size would encode 30 bits, i.e. half the size of the whole
UTF-32 space _as define at origine_.

> If two whole planes were reserved for such surrogates, this mechanism
> could cover the whole 32-bit hyperspace. Meanwhile UTF-8 can be extended
> to 6 bytes (byte 1 being 111110xx) to cover the same space. Plenty of
> room there to encode not just all the scripts of the Galactic Federation
> but even to squeeze in those of the Klingons and their allies!

Thanks for pointing something that is obvious in the original RFC describing
the unrestricted UTF-8 encoding as defined by X/Open and first adopted
in the first releases of ISO10646.

> Or perhaps a way can be found to graciously retire UTF-16 in some
> distant future version of Unicode. That is likely to become viable long
> before the extra planes are needed.

I really doubt this: UTF-16 was the prefered encoding scheme for Unicode
and it is still (and will remain for long) the prefered representation in
C/C++
environments that define wchar_t, and in OS'es like Windows that define and
use it in its Win32 API...

I would really not like to imagine a situation where UTF-16 would become
deprecated by Unicode: this would be a big issue for many systems that
rely on the fact of being able to encode Unicode characters with it, and
that will not like to shift to UTF-32 before long as it would require
defining
new APIs at the OS kernel level !!!

It's true that there is no plan in Unicode to encode something else than
plain text for existing or future actual scripts. But ISO10646 objectives
are to also to offer support and integrate almost all other related ISO
specifications that may need a unified codepoint space for encoding
either plain text or their own objects.

Yes we have some clear indications that we won't need more than 17
planes for scripts considered by Unicode. But keeping the space open
for non-Unicode applications (it would be up to ISO10646 to accept
and reference them, as Unicode.org will not attempt to define their
properties as actual text characters for general scripts) is still a
good security for the long term future of Unicode in a more open
architecture where it could be fitted.

So I don't see anything wrong if Unicode just says now that only 17
planes will be allocated to encode plain text in accordance with
ISO10646, leaving other applications use and allocate codepoint
ranges that could be kept compatible with UTF-16 with new
kinds of surrogates (someting like hyperplane selectors, used in
prefix before high and low surrogates).

The other solution based on assigning new hyper-surrogates out
of the BMP would require, for parsing predicatability, that these
"hyper-surrogates" be encoded each one with a pair of
UTF-16 surrogates. This would create sequences of 4 code units,
and this may be quite wasteful for memory space.

A solution with 3 UTF-16 surrogates however could allow
extending the encoding space to a little more than 30 bits,
adding 2^15 planes to the existing 17 ones.

Suppose that there's a 10-bits range reserved in the BMP for
these hyper-surrogates (1024 codepoints), this would of course
conflict with the current roadmap which does not leave such
space available, which for now remains only in these rows:

A8xx (Syloti Nagri) ¿Pahawh Hmong? ??? (Varang Kshiti) (Sorang Sng.) ???
A9xx ¿Chakma? ??? ??? ¿Javanese? ??? ???
AAxx ¿Newari? ??? ??? ¿Siddham? ??? ???
ABxx ¿Saurashtra? ??? ??? (hPhags-pa) ??? ???

As we are discussing here about the roadmap of possible future
integration of rare scripts still not standardized, this is an important
issue, for which a decision must be made: should we really fill all the BMP,
so that it won't fit with future efficient representations compatible with
UTF-16, of a larger encoding space in which Unicode will be only a
small part?

If we still want to keep these scripts in the BMP in the roadmap, then
the only solution would be to deprecate some ranges of the BMP PUA
area, giving soon an opportunity to authors that currently use PUAs
in the BMP to relocate them in one of the new PUA planes 15 and 16.
(why not the space EBxx..EFxx, or even the space E8xx-EFxx if we
want to cover the whole 31-bit space of the original X/open spec)

The other solution would be to reserve these hyper-surrogates in
the "special" plane 14, as the allocation roadmap leaves this plane
nearly empty with very few usages. There will still be issues with
applications using the old parsing rules for combining sequences,
and that would expect that they are independant characters



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST