And Visions of Sugar Plum UTF-8's Dance in Their Heads

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 12 2001 - 18:16:43 EDT

Next message: Patrick Andries: "U+007E and U+02DC"
Previous message: Peter_Constable@sil.org: "Re: UTF8 vs AL32UTF8"
Next in thread: Jianping Yang: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Reply: Jianping Yang: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Kenneth Whistler: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Sarasvati: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Kenneth Whistler: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Kenneth Whistler: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Peter_Constable@sil.org: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Case I. Code points U-0000D800..U-0000DFFF excluded
from the UTF's. "The way God intended it to be"

code point UTF-8 UTF-16 UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0000E000
h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
i. 00010000 <=> F0 90 80 80 D800 DC00 00010000
j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF

[Commentary by Ken: UTF-16 does not define the same
binary ordering as UTF-8 or UTF-32. Big whoop.]

===========================================================

Case II. Code points U-0000D800..U-0000DFFF included
        in the UTF's. "Mark's hard look at the real
        world, where the angels have fallen."
        http://www.macchiato.com/utc/utf_comparison.htm

code point UTF-8 UTF-16 UTF-32

Round-tripping isolated surrogate code points (when not
appropriately paired):

c. 0000D800 <=> ED A0 80 D800 0000D800
d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
e. 0000DC00 <=> ED B0 80 DC00 0000DC00
f. 0000DFFF <=> EF BF BF DFFF 0000DFFF

Code point sequences that do not round-trip from UTF code
unit sequences. [Could be termed "irregular code point
sequences" --Ken]:

k. 0000D800 0000DC00 => F0 90 80 80 D800 DC00 00010000
l. 0000DBFF 0000DFFF => F4 8F BF BF DBFF DFFF 0010FFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular code unit sequences):

m. 00010000 <= ED A0 80 ED B0 80 ---- 0000D800 0000DC00
n. 0010FFFF <= ED AF BF ED BF BF ---- 0000DBFF 0000DFFF

[Commentary by Ken: k and l are a real problem here,
since the conditional handling of "surrogate code points",
where they convert to a single UTF-32 code unit when isolated,
but *also* convert to a single UTF-32 code unit when paired,
breaks the 1-to-1 relationship, character==>code unit, implicit
for UTF-32. m and n have the same problem in reverse for UTF32.
I don't think either can be considered a correct specification
for UTF-32.]

===========================================================

Case III. Code points U-0000D800..U-0000DFFF included
in the UTF's, using UTF-8s "The vision provided
by the Oracle."

code point UTF-8s UTF-16 UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0000E000
h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF

Round-tripping isolated surrogate code points:

c. 0000D800 <=> ED A0 80 D800 0000D800
d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
e. 0000DC00 <=> ED B0 80 DC00 0000DC00
f. 0000DFFF <=> EF BF BF DFFF 0000DFFF

Code point sequences that do not round-trip from all UTF code
unit sequences. (Could be termed "irregular code point
sequences" --Ken):

k. 0000D800 0000DC00 => ED A0 80 ED B0 80 D800 DC00 0000D800 0000DC00
l. 0000DBFF 0000DFFF => ED AF BF ED BF BF DBFF DFFF 0000DBFF 0000DFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular code unit sequences):

m. 00010000 <= F0 90 80 80 ---- ???
n. 0010FFFF <= F4 8F BF BF ---- ???

[Commentary by Ken: The UTF-8s proposal reverses the
sense of the irregular UTF-8 code unit sequences, making
them regular for UTF-8s and making the regular UTF-8
code unit sequences for supplementary characters *irregular*
for UTF-8s. The proposal suffers the same nagging problem
about what to do for UTF-32 for the odd cases of k, l, m, n.
The UTF-32 *does* round-trip for k and l, but the UTF-8
and UTF-16 do not. This leads to a conversion conundrum
for UTF-32:

<0000D800 0000DC00> => <U+D800, U+DC00> ==>
<ED A0 80 ED AF BF> => U+10000 != <U+D800, U+DC00>

Further note: To think about this Case the way Oracle does,
recast everything in terms of UTF-8s <==> UTF-16 conversions.
This vision of UTF-8s is really the extrapolation of the
original UTF-2, as a transform on UCS-2, seeking not to
special-case the handling of surrogate code units that
were introduced in UTF-16. ]

===========================================================

Case IV. Code points U-0000D800..U-0000DFFF included
in the UTF's, using UTF-8s and adding UTF-32s.
"Let them order UTF-16 cake."

code point UTF-8s UTF-16 UTF-32s

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0011E000
h. 0000FFFF <=> EF BF BF FFFF 0011FFFF
i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF

(and everything else follows the Oracle Case III.)

[Commentary by Ken: This one is *too* weird. UTF-32s
now has the same binary order as UTF-16 and UTF-8s, but
it breaks the numeric relationship between code point
and UTF-32 code unit value, which is sure to break lots
of code. Use of code unit values greater than 0x10FFFF would
also break code that assumed the UTF-32 structure. Otherwise
this has the same imprecision regarding irregular UTF-32
for surrogate pairs as Case III.]

===========================================================

Case V. Code points U-0000D800..U-0000DFFF included
in the UTF's, using UTF-16x. "Huh?"

code point UTF-8 UTF-16x UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 D800 0000E000
h. 0000FFFF <=> EF BF BF F7FF 0000FFFF
i. 00010000 <=> F0 90 80 80 F800 FC00 00010000
j. 0010FFFF <=> F4 8F BF BF FBFF FFFF 0010FFFF

(And it isn't unclear what else to do with this, as I
haven't seen a complete specification yet.)

[Commentary by Ken: This one is *even* weirder, if
I have interpreted what people have in mind. Mark already
ruled it "impossible". While obtaining the goal of
binary order compatibility between the three UTF's, it
would trash interoperability with existing UTF-16 data and
API's.]

===========================================================

Case VI. "Ken's Horrible Vision of the Future with
UTF-8 *and* UTF-8s"

code point UTF-8/8s UTF-16 UTF-32

a. 00000000 <=> 00 0000 00000000
b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
g. 0000E000 <=> EE 80 80 E000 0000E000
h. 0000FFFF <=> EF BF BF FFFF 0000FFFF

code point UTF-8 UTF-16 UTF-32

i. 00010000 <=> F0 90 80 80 D800 DC00 00010000
j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF

code point UTF-8s UTF-16 UTF-32

i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF

Round-tripping isolated surrogate code points:

code point UTF-8/8s UTF-16 UTF-32

c. 0000D800 <=> ED A0 80 D800 0000D800
d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
e. 0000DC00 <=> ED B0 80 DC00 0000DC00
f. 0000DFFF <=> EF BF BF DFFF 0000DFFF

Code point sequences that do not round-trip from UTF code
unit sequences. [Commentary by Ken: These also have to
map from irregular UTF-32 code unit sequences, as currently
defined.]:

code point UTF-8 UTF-32

k. 0000D800 0000DC00 => F0 90 80 80 0000D800 0000DC00
l. 0000DBFF 0000DFFF => F4 8F BF BF 0000DBFF 0000DFFF

code point UTF-8s

k. 0000D800 0000DC00 => ED A0 80 ED B0 80 0000D800 0000DC00
l. 0000DBFF 0000DFFF => ED AF BF ED BF BF 0000DBFF 0000DFFF

UTF code unit sequences that do not round-trip from code
points. (Irregular UTF-8/8s code unit sequences):

code point UTF-8

m. 00010000 <= ED A0 80 ED B0 80
n. 0010FFFF <= ED AF BF ED BF BF

code point UTF-8s

m. 00010000 <= F0 90 80 80
n. 0010FFFF <= F4 8F BF BF

[Commentary by Ken: All generic UTF-8 handlers will have
to be armed with the expectation that they may run into
supplementary characters encoded either as UTF-8 or as UTF-8s.
All processing of UTF-8 will necessitate normalization
between the two forms, to avoid inconsistencies, round-trip
failures, and security issues. The actual API's that people
want to write: UTF8toUTF16, UTF16toUTF8, UTF8toUTF32,
UTF32toUTF8, etc., will be greatly complicated by this
situation, compared to the situation for Case 1, "The way
God intended it to be."]

--Ken

Next message: Patrick Andries: "U+007E and U+02DC"
Previous message: Peter_Constable@sil.org: "Re: UTF8 vs AL32UTF8"
Next in thread: Jianping Yang: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Reply: Jianping Yang: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Kenneth Whistler: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Sarasvati: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Kenneth Whistler: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Kenneth Whistler: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Maybe reply: Peter_Constable@sil.org: "Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT