Re: And Visions of Sugar Plum UTF-8's Dance in Their Heads

From: Jianping Yang (Jianping.Yang@oracle.com)
Date: Tue Jun 12 2001 - 19:31:29 EDT


One thing needs to clarify here is that there is no four byte encoding in
UTF-8S proposal and four byte encoding is illegal but not irregular. As
everything in UTF-8S is perfect match to UTF-16, any blame to this proposal
also applies to UTF-16 encoding form.

Regards,
Jianping.

Kenneth Whistler wrote:

> Case I. Code points U-0000D800..U-0000DFFF excluded
> from the UTF's. "The way God intended it to be"
>
> code point UTF-8 UTF-16 UTF-32
>
> a. 00000000 <=> 00 0000 00000000
> b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
> g. 0000E000 <=> EE 80 80 E000 0000E000
> h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
> i. 00010000 <=> F0 90 80 80 D800 DC00 00010000
> j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF
>
> [Commentary by Ken: UTF-16 does not define the same
> binary ordering as UTF-8 or UTF-32. Big whoop.]
>
> ===========================================================
>
> Case II. Code points U-0000D800..U-0000DFFF included
> in the UTF's. "Mark's hard look at the real
> world, where the angels have fallen."
> http://www.macchiato.com/utc/utf_comparison.htm
>
> code point UTF-8 UTF-16 UTF-32
>
> a. 00000000 <=> 00 0000 00000000
> b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
> g. 0000E000 <=> EE 80 80 E000 0000E000
> h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
> i. 00010000 <=> F0 90 80 80 D800 DC00 00010000
> j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF
>
> Round-tripping isolated surrogate code points (when not
> appropriately paired):
>
> c. 0000D800 <=> ED A0 80 D800 0000D800
> d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
> e. 0000DC00 <=> ED B0 80 DC00 0000DC00
> f. 0000DFFF <=> EF BF BF DFFF 0000DFFF
>
> Code point sequences that do not round-trip from UTF code
> unit sequences. [Could be termed "irregular code point
> sequences" --Ken]:
>
> k. 0000D800 0000DC00 => F0 90 80 80 D800 DC00 00010000
> l. 0000DBFF 0000DFFF => F4 8F BF BF DBFF DFFF 0010FFFF
>
> UTF code unit sequences that do not round-trip from code
> points. (Irregular code unit sequences):
>
> m. 00010000 <= ED A0 80 ED B0 80 ---- 0000D800 0000DC00
> n. 0010FFFF <= ED AF BF ED BF BF ---- 0000DBFF 0000DFFF
>
> [Commentary by Ken: k and l are a real problem here,
> since the conditional handling of "surrogate code points",
> where they convert to a single UTF-32 code unit when isolated,
> but *also* convert to a single UTF-32 code unit when paired,
> breaks the 1-to-1 relationship, character==>code unit, implicit
> for UTF-32. m and n have the same problem in reverse for UTF32.
> I don't think either can be considered a correct specification
> for UTF-32.]
>
> ===========================================================
>
> Case III. Code points U-0000D800..U-0000DFFF included
> in the UTF's, using UTF-8s "The vision provided
> by the Oracle."
>
> code point UTF-8s UTF-16 UTF-32
>
> a. 00000000 <=> 00 0000 00000000
> b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
> g. 0000E000 <=> EE 80 80 E000 0000E000
> h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
> i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
> j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF
>
> Round-tripping isolated surrogate code points:
>
> c. 0000D800 <=> ED A0 80 D800 0000D800
> d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
> e. 0000DC00 <=> ED B0 80 DC00 0000DC00
> f. 0000DFFF <=> EF BF BF DFFF 0000DFFF
>
> Code point sequences that do not round-trip from all UTF code
> unit sequences. (Could be termed "irregular code point
> sequences" --Ken):
>
> k. 0000D800 0000DC00 => ED A0 80 ED B0 80 D800 DC00 0000D800 0000DC00
> l. 0000DBFF 0000DFFF => ED AF BF ED BF BF DBFF DFFF 0000DBFF 0000DFFF
>
> UTF code unit sequences that do not round-trip from code
> points. (Irregular code unit sequences):
>
> m. 00010000 <= F0 90 80 80 ---- ???
> n. 0010FFFF <= F4 8F BF BF ---- ???
>
> [Commentary by Ken: The UTF-8s proposal reverses the
> sense of the irregular UTF-8 code unit sequences, making
> them regular for UTF-8s and making the regular UTF-8
> code unit sequences for supplementary characters *irregular*
> for UTF-8s. The proposal suffers the same nagging problem
> about what to do for UTF-32 for the odd cases of k, l, m, n.
> The UTF-32 *does* round-trip for k and l, but the UTF-8
> and UTF-16 do not. This leads to a conversion conundrum
> for UTF-32:
>
> <0000D800 0000DC00> => <U+D800, U+DC00> ==>
> <ED A0 80 ED AF BF> => U+10000 != <U+D800, U+DC00>
>
> Further note: To think about this Case the way Oracle does,
> recast everything in terms of UTF-8s <==> UTF-16 conversions.
> This vision of UTF-8s is really the extrapolation of the
> original UTF-2, as a transform on UCS-2, seeking not to
> special-case the handling of surrogate code units that
> were introduced in UTF-16. ]
>
> ===========================================================
>
> Case IV. Code points U-0000D800..U-0000DFFF included
> in the UTF's, using UTF-8s and adding UTF-32s.
> "Let them order UTF-16 cake."
>
> code point UTF-8s UTF-16 UTF-32s
>
> a. 00000000 <=> 00 0000 00000000
> b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
> g. 0000E000 <=> EE 80 80 E000 0011E000
> h. 0000FFFF <=> EF BF BF FFFF 0011FFFF
> i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
> j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF
>
> (and everything else follows the Oracle Case III.)
>
> [Commentary by Ken: This one is *too* weird. UTF-32s
> now has the same binary order as UTF-16 and UTF-8s, but
> it breaks the numeric relationship between code point
> and UTF-32 code unit value, which is sure to break lots
> of code. Use of code unit values greater than 0x10FFFF would
> also break code that assumed the UTF-32 structure. Otherwise
> this has the same imprecision regarding irregular UTF-32
> for surrogate pairs as Case III.]
>
> ===========================================================
>
> Case V. Code points U-0000D800..U-0000DFFF included
> in the UTF's, using UTF-16x. "Huh?"
>
> code point UTF-8 UTF-16x UTF-32
>
> a. 00000000 <=> 00 0000 00000000
> b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
> g. 0000E000 <=> EE 80 80 D800 0000E000
> h. 0000FFFF <=> EF BF BF F7FF 0000FFFF
> i. 00010000 <=> F0 90 80 80 F800 FC00 00010000
> j. 0010FFFF <=> F4 8F BF BF FBFF FFFF 0010FFFF
>
> (And it isn't unclear what else to do with this, as I
> haven't seen a complete specification yet.)
>
> [Commentary by Ken: This one is *even* weirder, if
> I have interpreted what people have in mind. Mark already
> ruled it "impossible". While obtaining the goal of
> binary order compatibility between the three UTF's, it
> would trash interoperability with existing UTF-16 data and
> API's.]
>
> ===========================================================
>
> Case VI. "Ken's Horrible Vision of the Future with
> UTF-8 *and* UTF-8s"
>
> code point UTF-8/8s UTF-16 UTF-32
>
> a. 00000000 <=> 00 0000 00000000
> b. 0000D700 <=> ED 9F BF D7FF 0000D7FF
> g. 0000E000 <=> EE 80 80 E000 0000E000
> h. 0000FFFF <=> EF BF BF FFFF 0000FFFF
>
> code point UTF-8 UTF-16 UTF-32
>
> i. 00010000 <=> F0 90 80 80 D800 DC00 00010000
> j. 0010FFFF <=> F4 8F BF BF DBFF DFFF 0010FFFF
>
> code point UTF-8s UTF-16 UTF-32
>
> i. 00010000 <=> ED A0 80 ED B0 80 D800 DC00 00010000
> j. 0010FFFF <=> ED AF BF ED BF BF DBFF DFFF 0010FFFF
>
> Round-tripping isolated surrogate code points:
>
> code point UTF-8/8s UTF-16 UTF-32
>
> c. 0000D800 <=> ED A0 80 D800 0000D800
> d. 0000DBFF <=> ED AF BF DBFF 0000DBFF
> e. 0000DC00 <=> ED B0 80 DC00 0000DC00
> f. 0000DFFF <=> EF BF BF DFFF 0000DFFF
>
> Code point sequences that do not round-trip from UTF code
> unit sequences. [Commentary by Ken: These also have to
> map from irregular UTF-32 code unit sequences, as currently
> defined.]:
>
> code point UTF-8 UTF-32
>
> k. 0000D800 0000DC00 => F0 90 80 80 0000D800 0000DC00
> l. 0000DBFF 0000DFFF => F4 8F BF BF 0000DBFF 0000DFFF
>
> code point UTF-8s
>
> k. 0000D800 0000DC00 => ED A0 80 ED B0 80 0000D800 0000DC00
> l. 0000DBFF 0000DFFF => ED AF BF ED BF BF 0000DBFF 0000DFFF
>
> UTF code unit sequences that do not round-trip from code
> points. (Irregular UTF-8/8s code unit sequences):
>
> code point UTF-8
>
> m. 00010000 <= ED A0 80 ED B0 80
> n. 0010FFFF <= ED AF BF ED BF BF
>
> code point UTF-8s
>
> m. 00010000 <= F0 90 80 80
> n. 0010FFFF <= F4 8F BF BF
>
> [Commentary by Ken: All generic UTF-8 handlers will have
> to be armed with the expectation that they may run into
> supplementary characters encoded either as UTF-8 or as UTF-8s.
> All processing of UTF-8 will necessitate normalization
> between the two forms, to avoid inconsistencies, round-trip
> failures, and security issues. The actual API's that people
> want to write: UTF8toUTF16, UTF16toUTF8, UTF8toUTF32,
> UTF32toUTF8, etc., will be greatly complicated by this
> situation, compared to the situation for Case 1, "The way
> God intended it to be."]
>
> --Ken





This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT