And Visions of Sugar Plum UTF-8's Dance in Their Heads

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 12 2001 - 18:16:43 EDT


Case I. Code points U-0000D800..U-0000DFFF excluded
        from the UTF's. "The way God intended it to be"

   code point UTF-8 UTF-16 UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0000E000
h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
i. 00010000 <=> F0 90 80 80 D800 DC00 00010000
j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF

[Commentary by Ken: UTF-16 does not define the same
 binary ordering as UTF-8 or UTF-32. Big whoop.]

===========================================================

Case II. Code points U-0000D800..U-0000DFFF included
        in the UTF's. "Mark's hard look at the real
        world, where the angels have fallen."
        http://www.macchiato.com/utc/utf_comparison.htm

   code point UTF-8 UTF-16 UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0000E000
h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
i. 00010000 <=> F0 90 80 80 D800 DC00 00010000
j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF

Round-tripping isolated surrogate code points (when not
appropriately paired):

c. 0000D800 <=> ED A0 80 D800 0000D800
d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
e. 0000DC00 <=> ED B0 80 DC00 0000DC00
f. 0000DFFF <=> EF BF BF DFFF 0000DFFF

Code point sequences that do not round-trip from UTF code
unit sequences. [Could be termed "irregular code point
sequences" --Ken]:

k. 0000D800 0000DC00 => F0 90 80 80 D800 DC00 00010000
l. 0000DBFF 0000DFFF => F4 8F BF BF DBFF DFFF 0010FFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular code unit sequences):

m. 00010000 <= ED A0 80 ED B0 80 ---- 0000D800 0000DC00
n. 0010FFFF <= ED AF BF ED BF BF ---- 0000DBFF 0000DFFF

[Commentary by Ken: k and l are a real problem here,
 since the conditional handling of "surrogate code points",
 where they convert to a single UTF-32 code unit when isolated,
 but *also* convert to a single UTF-32 code unit when paired,
 breaks the 1-to-1 relationship, character==>code unit, implicit
 for UTF-32. m and n have the same problem in reverse for UTF32.
 I don't think either can be considered a correct specification
 for UTF-32.]

===========================================================

Case III. Code points U-0000D800..U-0000DFFF included
        in the UTF's, using UTF-8s "The vision provided
        by the Oracle."

   code point UTF-8s UTF-16 UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0000E000
h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF

Round-tripping isolated surrogate code points:

c. 0000D800 <=> ED A0 80 D800 0000D800
d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
e. 0000DC00 <=> ED B0 80 DC00 0000DC00
f. 0000DFFF <=> EF BF BF DFFF 0000DFFF

Code point sequences that do not round-trip from all UTF code
unit sequences. (Could be termed "irregular code point
sequences" --Ken):

k. 0000D800 0000DC00 => ED A0 80 ED B0 80 D800 DC00 0000D800 0000DC00
l. 0000DBFF 0000DFFF => ED AF BF ED BF BF DBFF DFFF 0000DBFF 0000DFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular code unit sequences):

m. 00010000 <= F0 90 80 80 ---- ???
n. 0010FFFF <= F4 8F BF BF ---- ???

[Commentary by Ken: The UTF-8s proposal reverses the
 sense of the irregular UTF-8 code unit sequences, making
 them regular for UTF-8s and making the regular UTF-8
 code unit sequences for supplementary characters *irregular*
 for UTF-8s. The proposal suffers the same nagging problem
 about what to do for UTF-32 for the odd cases of k, l, m, n.
 The UTF-32 *does* round-trip for k and l, but the UTF-8
 and UTF-16 do not. This leads to a conversion conundrum
 for UTF-32:

 <0000D800 0000DC00> => <U+D800, U+DC00> ==>
      <ED A0 80 ED AF BF> => U+10000 != <U+D800, U+DC00>

 Further note: To think about this Case the way Oracle does,
 recast everything in terms of UTF-8s <==> UTF-16 conversions.
 This vision of UTF-8s is really the extrapolation of the
 original UTF-2, as a transform on UCS-2, seeking not to
 special-case the handling of surrogate code units that
 were introduced in UTF-16. ]

===========================================================

Case IV. Code points U-0000D800..U-0000DFFF included
        in the UTF's, using UTF-8s and adding UTF-32s.
        "Let them order UTF-16 cake."

   code point UTF-8s UTF-16 UTF-32s

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0011E000
h. 0000FFFF <=> EF BF BF FFFF 0011FFFF
i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF

(and everything else follows the Oracle Case III.)

[Commentary by Ken: This one is *too* weird. UTF-32s
 now has the same binary order as UTF-16 and UTF-8s, but
 it breaks the numeric relationship between code point
 and UTF-32 code unit value, which is sure to break lots
 of code. Use of code unit values greater than 0x10FFFF would
 also break code that assumed the UTF-32 structure. Otherwise
 this has the same imprecision regarding irregular UTF-32
 for surrogate pairs as Case III.]

===========================================================

Case V. Code points U-0000D800..U-0000DFFF included
        in the UTF's, using UTF-16x. "Huh?"

   code point UTF-8 UTF-16x UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 D800 0000E000
h. 0000FFFF <=> EF BF BF F7FF 0000FFFF
i. 00010000 <=> F0 90 80 80 F800 FC00 00010000
j. 0010FFFF <=> F4 8F BF BF FBFF FFFF 0010FFFF

(And it isn't unclear what else to do with this, as I
 haven't seen a complete specification yet.)

[Commentary by Ken: This one is *even* weirder, if
 I have interpreted what people have in mind. Mark already
 ruled it "impossible". While obtaining the goal of
 binary order compatibility between the three UTF's, it
 would trash interoperability with existing UTF-16 data and
 API's.]

===========================================================

Case VI. "Ken's Horrible Vision of the Future with
    UTF-8 *and* UTF-8s"

   code point UTF-8/8s UTF-16 UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0000E000
h. 0000FFFF <=> EF BF BF FFFF 0000FFFF

   code point UTF-8 UTF-16 UTF-32

i. 00010000 <=> F0 90 80 80 D800 DC00 00010000
j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF

   code point UTF-8s UTF-16 UTF-32

i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF

Round-tripping isolated surrogate code points:

   code point UTF-8/8s UTF-16 UTF-32

c. 0000D800 <=> ED A0 80 D800 0000D800
d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
e. 0000DC00 <=> ED B0 80 DC00 0000DC00
f. 0000DFFF <=> EF BF BF DFFF 0000DFFF

Code point sequences that do not round-trip from UTF code
unit sequences. [Commentary by Ken: These also have to
map from irregular UTF-32 code unit sequences, as currently
defined.]:

   code point UTF-8 UTF-32

k. 0000D800 0000DC00 => F0 90 80 80 0000D800 0000DC00
l. 0000DBFF 0000DFFF => F4 8F BF BF 0000DBFF 0000DFFF

   code point UTF-8s

k. 0000D800 0000DC00 => ED A0 80 ED B0 80 0000D800 0000DC00
l. 0000DBFF 0000DFFF => ED AF BF ED BF BF 0000DBFF 0000DFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular UTF-8/8s code unit sequences):

   code point UTF-8

m. 00010000 <= ED A0 80 ED B0 80
n. 0010FFFF <= ED AF BF ED BF BF

   code point UTF-8s

m. 00010000 <= F0 90 80 80
n. 0010FFFF <= F4 8F BF BF

[Commentary by Ken: All generic UTF-8 handlers will have
to be armed with the expectation that they may run into
supplementary characters encoded either as UTF-8 or as UTF-8s.
All processing of UTF-8 will necessitate normalization
between the two forms, to avoid inconsistencies, round-trip
failures, and security issues. The actual API's that people
want to write: UTF8toUTF16, UTF16toUTF8, UTF8toUTF32,
UTF32toUTF8, etc., will be greatly complicated by this
situation, compared to the situation for Case 1, "The way
God intended it to be."]

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT