The "original" UTF-16x (was: Re: Playing with Unicode)

From: DougEwell2@cs.com
Date: Sat Jun 30 2001 - 18:57:43 EDT


In a message dated 2001-06-25 2:24:53 Pacific Daylight Time,
marco.cimarosti@essetre.it writes:

> 6) UTF-16X (also named UTF-16S or UTF-16F) is definitely humor, although I
> am probably not the only one to think that it is technically more "serious"
> than UTF-8S.

We might want to be careful about naming a proposal "UTF-16x," since there is
(or was) already such a proposal:

    http://www.ceres.dti.ne.jp/~maedera/UTF16X.TXT

This was proposed unofficially by Masahiko Maedera in 1999 as an extension to
UTF-16 that would allow access to the entire (at that time) UCS-4 space up to
U-7FFFFFFF. This was to be done by carving out a block of 4096 "super
surrogates" in the range from U+EE000 to U+EEFFF, and using three of these
(high, middle, low) to index into the 31-bit space. This is similar to the
way UTF-16 works, but without the additive offset.

Six 16-bit words, or twelve octets, would have been required to access a
single character in the range beyond U+10FFFF, three "super surrogates" each
composed of two traditional UTF-16 surrogate pairs. (Note that UTF-8
requires a maximum of six octets to access any character in the 31-bit 10646
code space.) This was not considered a problem, according to the proposal.

For example, the character U-12345678 would have been represented by the
"super surrogates"

    U+EE122 U+EE8D1 U+EED67

which would in turn break down into the following conventional UTF-16
surrogates

    U+DB78 U+DD22 U+DB7A U+DCD1 U+DB7B U+DD67

(No, I didn't do an implementation! These calculations were done by hand,
and may be wrong. You get the idea.)

The UTF-16x proposal was phrased in terms of "protecting" the Unicode
Standard from the chaos that would ensue if it did not conform to ISO/IEC
10646; but in a thread on the Unicode list that began on 1999-04-12, the real
motivation turned out to be that a Japanese font consortium wanted a
mechanism to access specific (non-unified) CJK glyphs in a font. As most of
us know, Han unification is fundamental to the design of Unicode, and this is
a perfect example of non-Unicode data being shoehorned into the Unicode
framework. The debate prompted Mark Davis's well-known "noble effort"
passage, which is now in the FAQ:

> - As stated, the goal of Unicode is not to encode glyphs, but
> characters. Over a million possible codes is far more than enough for
> this goal. Unicode is *not* designed to encode arbitrary data. If you
> wanted, for example, to give each "instance of a character on paper
> throughout history" its own code, you might need trillions or
> quadrillions of such codes; noble as this effort might be, you would not
> use Unicode for such an encoding.

Today, ISO/IEC 10646 is, or is expected to be soon, limited to Planes 0
through 16 in Group 0, for the purpose of conforming with UTF-16 (not the
other way around, as Maedera predicted). Nevertheless, the UTF-16x proposal
is still available at the URL listed above.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:19 EDT