From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jun 05 2007 - 09:25:13 CDT
Doug Ewell écrit le mardi 5 juin 2007 05:32 à Unicode Mailing List:
>
> Back in the day when ISO 10646 was still 31 bits wide and the proposal was
> made to limit it to 17 planes, as Unicode already was, there were quite a
> few, apparently serious, objections that this would be a regrettable,
> Y2K-like limitation because of the eventual discovery of non-terrestrial
> scripts that would need the extra coding space. I think some of us who
> remember this being portrayed as a genuine technical flaw in Unicode still
> tend to wince when the topic is brought up, even if the humorous intent is
> clear to everyone else.
If there was a flaw, it was not originally from Unicode, but from the
designers of the UTF-16 encoding which was first released as a RFC and then
adopted by ISO, before being made part of the Unicode standard.
Nothing was ever prepared to allow apossible extension of UTF-16 to more
than 17 planes (and nothing has been done since then to allow more
surrogates in the BMP to make this possible using 3 surrogates).
If one ever wants to have 31-bit codepoints, the only way is to allocate a
net set of surrogates either within the PUA block (but this may conflict
with many current uses of the PUA block in the BMP), or within the special
plane 14 (but this will require using 2 supplementary codepoints, each one
using 2 normal surrogates (i.e. a total of 4 surrogates, i.e. coding 31 bit
codepoints using...64 bits), and this will break parsers that expect
codepoints to be terminated after the first 2 surrogates.
In all cases, such extension will require breaking existing conformance
rules for the standard UTF-16 decoding and use, and so such extension will
not be able to use the same "UTF-16" identifier (but possibly "UTF-16X")
(The most efficient way to reach the 30 bit limit would be to have another
10-bits wide block in the BMP, but chances are now very low that this will
be ever possible, the convenient 1024-codepoints space that remained between
the Hangul syllables and existing surrogates being reserved now for Hangul
extensions).
To make a "UTF-16X" extension compatible however with strict implementations
of UTF-16 (even if they see multiple codepoints, that are currently
considered valid as characters, instead of just one), the only remaining
solution is to allocate 2 blocks of supplementary surrogates in the special
plane.
But do we need such extension? Note that there are now variation selectors
to qualify the existing characters, without having to encode many
compatibility characters in the future.
This archive was generated by hypermail 2.1.5 : Tue Jun 05 2007 - 09:29:01 CDT