From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 05 2007 - 14:33:45 CDT
Ah well, another year has passed, so it must be time again to
worry about 17 planes not being enough. :-)
> Doug Ewell écrit le mardi 5 juin 2007 05:32 à Unicode Mailing List:
> >
> > Back in the day when ISO 10646 was still 31 bits wide and the proposal was
> > made to limit it to 17 planes, as Unicode already was, there were quite a
> > few, apparently serious, objections that this would be a regrettable,
^^^^^^^^^^^^^^^^^^
The operative words.
> > Y2K-like limitation because of the eventual discovery of non-terrestrial
> > scripts that would need the extra coding space. I think some of us who
> > remember this being portrayed as a genuine technical flaw in Unicode still
> > tend to wince when the topic is brought up, even if the humorous intent is
> > clear to everyone else.
>
> If there was a flaw, it was not originally from Unicode, but from the
> designers of the UTF-16 encoding which was first released as a RFC and then
> adopted by ISO, before being made part of the Unicode standard.
I think Philippe has his history mixed up here.
The RFC for UTF-16 was the last in this sequence. That is RFC 2781,
dated February, 2000, by Hoffman and Yergeau. It bases its
definition (as it should) on the then Annex Q of 10646, then
cited as ISO/IEC 10646-1:1993 plus amendments.
RFC 2781 in turn refers to RFC 2271 (BCP 18), dated January, 1998,
by Alvestrand. That RFC refers to UTF-16, although not defining
it, and also cites ISO/IEC 10646-1:1993 plus amendments.
Amendment 1 (UTF-16) to ISO/IEC 10646-1:1993 was actually published
in early 1996.
UTF-16 and the use of surrogates was also published in 1996 in
Unicode 2.0 (deliberately, as part of the ongoing synchronization
with 10646).
Amendment 1 was actually drafted by Mark Davis, who was then WG2
project editor. And the first draft was WG2 N970, dated 7 February 1994.
UTF-16 was *first* presented to WG2 in WG2 N883, Proposal for Extended
UCS-2, by Joe Becker, dated 21 January 1993 (= X3L2/93-016). The
extension scheme was known as "UCS-2E" through most of 1993, until
it was rechristened "UTF-16" around January, 1994, with a revised
range of code points for the surrogates.
What people need to understand, also, is that as of 21 January 1993
the exact architecture of the merged Unicode and 10646 standards
was still in play. The concept of an East Asian character set
"swapping" area in the then "O-Zone" was still being advocated
as a way to extend the BMP. That was the ghost of ISO 2022, still
not yet vanquished as an approach to universal character encoding
at the time.
The Unicode script committee had submitted a paper, drafted in
late 1992, demonstrating that the already identified need for
encoding as yet unencoded scripts would exceed the BMP available
space by at least 10,000 code points, if the O-Zone were keep
for an East Asian swapping scheme. It was quite apparent to
everyone at the time that the BMP simply wasn't going to be
enough -- particularly with the building pressue to encode more
Hangul syllables on the BMP (the eventual Amendment 5 to 10646).
So it was also apparent there had to be an extension mechanism.
The question was merely *which* extension mechanism would
be most acceptable and least disruptive going forward.
And as should be clear from the above, the historical direction
for UTF-16 was: UTC --> 10646 --> IETF, and not the reverse, as
implied by Philippe's summary.
>
> Nothing was ever prepared to allow apossible extension of UTF-16 to more
> than 17 planes (and nothing has been done since then to allow more
> surrogates in the BMP to make this possible using 3 surrogates).
That at least is correct -- largely because in the 13 years since
UTF-16 was proposed to WG2 for 10646, there has been no need to.
Nor *will* there be within the lifetimes of anyone reading this
email list.
>
> If one ever wants to have 31-bit codepoints, the only way is to allocate a
^^^^^^^^
False, of course, because UTF-32 (and UCS-4, its twin) are 32-bit
encoding forms, using 32-bit code units to represent code points.
But I presume what Philippe means to say is that the only way to
represent characters encoded past U+10FFFF using 16-bit Unicode
code units is to allocate a...
> net set of surrogates either within the PUA block (but this may conflict
^^^
new
> with many current uses of the PUA block in the BMP), or within the special
> plane 14 (but this will require using 2 supplementary codepoints, each one
> using 2 normal surrogates (i.e. a total of 4 surrogates, i.e. coding 31 bit
> codepoints using...64 bits), and this will break parsers that expect
> codepoints to be terminated after the first 2 surrogates.
Any allocation beyond U+10FFFF will break many things beyond such
parsers at this point.
But there is little to worry about, because:
A. Such an extension is not needed.
B. Such an extension is not going to happen.
> (The most efficient way to reach the 30 bit limit would be to have another
> 10-bits wide block in the BMP, but chances are now very low that this will
> be ever possible, the convenient 1024-codepoints space that remained between
> the Hangul syllables and existing surrogates being reserved now for Hangul
> extensions).
Once again little deterred by the facts...
There has never been a "convenient 1024-codepoints space ... between
the Hangul syllables and existing surrogates." Hangul syllables
stop at U+D7A3. The first surrogate code point starts at U+D800.
I'm guessing that Philippe is bemoaning the loss of the contiguous
block of U+A800..U+ABFF *before* the start of Hangul syllables.
But in my opinion, the current allocation of that area to the
encoding of Syloti Nagri, Phags-pa, Saurashtra, Kayah Li, Rejang,
Cham, Tai Viet, and Old Hangul jamo extensions -- all demonstrably
existent and in demonstrable need of encoding -- is a far wiser
use of BMP allocation than reserving code points for speculative
extension schemes for characters that don't exist.
> But do we need such extension?
In a word, no.
--Ken
> Note that there are now variation selectors
> to qualify the existing characters, without having to encode many
> compatibility characters in the future.
This archive was generated by hypermail 2.1.5 : Tue Jun 05 2007 - 14:37:37 CDT