In a message dated 2001-06-25 2:24:53 Pacific Daylight Time,
marco.cimarosti@essetre.it writes:
> 6) UTF-16X (also named UTF-16S or UTF-16F) is definitely humor, although I
> am probably not the only one to think that it is technically more "serious"
> than UTF-8S.
We might want to be careful about naming a proposal "UTF-16x," since there is
(or was) already such a proposal:
http://www.ceres.dti.ne.jp/~maedera/UTF16X.TXT
This was proposed unofficially by Masahiko Maedera in 1999 as an extension to
UTF-16 that would allow access to the entire (at that time) UCS-4 space up to
U-7FFFFFFF. This was to be done by carving out a block of 4096 "super
surrogates" in the range from U+EE000 to U+EEFFF, and using three of these
(high, middle, low) to index into the 31-bit space. This is similar to the
way UTF-16 works, but without the additive offset.
Six 16-bit words, or twelve octets, would have been required to access a
single character in the range beyond U+10FFFF, three "super surrogates" each
composed of two traditional UTF-16 surrogate pairs. (Note that UTF-8
requires a maximum of six octets to access any character in the 31-bit 10646
code space.) This was not considered a problem, according to the proposal.
For example, the character U-12345678 would have been represented by the
"super surrogates"
U+EE122 U+EE8D1 U+EED67
which would in turn break down into the following conventional UTF-16
surrogates
U+DB78 U+DD22 U+DB7A U+DCD1 U+DB7B U+DD67
(No, I didn't do an implementation! These calculations were done by hand,
and may be wrong. You get the idea.)
The UTF-16x proposal was phrased in terms of "protecting" the Unicode
Standard from the chaos that would ensue if it did not conform to ISO/IEC
10646; but in a thread on the Unicode list that began on 1999-04-12, the real
motivation turned out to be that a Japanese font consortium wanted a
mechanism to access specific (non-unified) CJK glyphs in a font. As most of
us know, Han unification is fundamental to the design of Unicode, and this is
a perfect example of non-Unicode data being shoehorned into the Unicode
framework. The debate prompted Mark Davis's well-known "noble effort"
passage, which is now in the FAQ:
> - As stated, the goal of Unicode is not to encode glyphs, but
> characters. Over a million possible codes is far more than enough for
> this goal. Unicode is *not* designed to encode arbitrary data. If you
> wanted, for example, to give each "instance of a character on paper
> throughout history" its own code, you might need trillions or
> quadrillions of such codes; noble as this effort might be, you would not
> use Unicode for such an encoding.
Today, ISO/IEC 10646 is, or is expected to be soon, limited to Planes 0
through 16 in Group 0, for the purpose of conforming with UTF-16 (not the
other way around, as Maedera predicted). Nevertheless, the UTF-16x proposal
is still available at the URL listed above.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:19 EDT