Beyond 17 planes, was: Java char and Unicode 3.0+

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Oct 16 2003 - 09:03:34 CST


On 16/10/2003 06:33, Philippe Verdy wrote:

>From: "John Cowan" <cowan@mercury.ccil.org>
>
>
>
>>Philippe Verdy scripsit:
>>
>>
>>
>>>I am also doubting, but I would not bet on it. After all, when Unicode
>>>started, a single plane was considered waaaaaay more than sufficient
>>>
>>>
>too.
>
>
>>I not only would bet on it, I actually have a bet on it. Henry Thompson
>>of the W3C's Schema WG bet me that we'd outrun the existing planes within
>>five years; four left to go and no sign of it, even if Michael Everson
>>were to achieve pluripresence and actually get everything accepted into
>>the standard that he knows needs to be done.
>>
>>
>
>Just for the case it would be needed, are you keeping an unassigned range
>in the BMP so that extension will remain possible to preserve an ascending
>compatibility or support for UTF-16 which currently is the main reason why
>there are for now 17 planes defined ?
>(for example in the range between Hangul syllables and existing surrogates)
>
>
>
...

I would guess not. I can think of much more useful things to do with any
remaining space in the BMP. Anyway, the space you mention, if used for
additional high-half or low-half surrogates, is only 80 characters wide
and so would give just slightly more than one more plane, in fact 80 x
1024 characters. And it is the largest space on the BMP which is not
already roadmapped.

I suppose that, in the unlikely event that in the foreseeable future it
looks as if more than 17 planes might become necessary, and anyone is
still trying to use UTF-16 (although by that time memory and bandwidth
will probably be so cheap that no one bothers any more with encodings
that save them), it will be possible to reserve part of the 17 planes
for surrogate pairs representing the remaining planes. So the UTF-16
encoding would be two existing 16-bit surrogate pairs forming a higher
level surrogate pair. UTF-32 would of course be more efficient (32 bits
rather than 64), but I doubt if anyone will care.

If two whole planes were reserved for such surrogates, this mechanism
could cover the whole 32-bit hyperspace. Meanwhile UTF-8 can be extended
to 6 bytes (byte 1 being 111110xx) to cover the same space. Plenty of
room there to encode not just all the scripts of the Galactic Federation
but even to squeeze in those of the Klingons and their allies!

Or perhaps a way can be found to graciously retire UTF-16 in some
distant future version of Unicode. That is likely to become viable long
before the extra planes are needed.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST