Re: UTF8 vs AL32UTF8

From: Mark Davis (mark@macchiato.com)
Date: Tue Jun 12 2001 - 11:29:26 EDT


That would be viewing history in the prism of present thought.

When applying UTF-8 -- as originally designed -- the sequence 0000D800
0000DC00 would transform into a 6-byte sequence. Transforming back would
result with the original sequence 0000D800 0000DC00. When applying this to
Unicode (16 bit only, at the time), it would take D800 DC00 to the 6-byte
sequence and back.

Only after UTF-16 was designed were the definitions changed so that the
16-bit sequence D800 DC00 would transform into a 4-byte sequence in UTF-8.
UTF-16 was designed to be as interoperable as possible with the past, but
this was definitely a change.

> The only sensible interpretation of
> the definitions of Unicode is that UTF-8 maps exactly one coded character
> to exactly one code unit sequence

This is not correct. The most obvious point is that UTFs also map unassigned
code points (such as U+0220) that are not coded characters. Yours is not the
only possible "sensible" interpretation.

The minimal formal requirement is that a UTF map each sequence of code
points in its domain to a unique sequence of bytes, and map any sequence of
bytes that it generates back to a sequence of code points. The definition
does allow a UTF to map other byte sequences back to code points, and there
is some dispute about the precise domain -- whether to exclude
surrogate/noncharacter code points or not.

Mark

----- Original Message -----
From: <Peter_Constable@sil.org>
To: <unicode@unicode.org>
Sent: Tuesday, June 12, 2001 00:39
Subject: Re: UTF8 vs AL32UTF8

>
> >In other words, Oracle has an alternate solution here for 9i -- they can
> >simply explain that the old product defined the old pre-surrogate UTF-8
> and
> >the new product is now surrogate aware and uses the current definition.
>
> There's a mistake being made here that has been made repeatedly throughout
> our discussion: that's to assume that there are two kinds of UTF-8: the
> original, in which the code unit sequence < ED A0 80 ED B0 80 > meant the
> coded character sequence < U-0000D800, U-0000DC00 >, and the new UTF-8 in
> which this sequence means U-00010000. The only sensible interpretation of
> the definitions of Unicode is that UTF-8 maps exactly one coded character
> to exactly one code unit sequence. As far as I know, the UTF-8 mapping
> hasn't changed; all that has changed are the range of USVs that are mapped
> into it, and the introduction of some terms like "irregular".
>
> The sequence < ED A0 80 ED B0 80 > is *only* achievable in conformant
> software [1] when a process that is translating between encoding forms
> (specifically from UTF-16 to UTF-8)
>
> (i) is doing a direct translation between the surface forms (as opposed to
> a meaning-based transltion in which you map code units to characters and
> then characters to other code units),
> (ii) on encountering an unpaired high surrogate code unit at the end of a
> stream maps that to the irregular sequence < ED A0 80 >, or on
> encountering an unpaired low surrogate code unit at the start of a stream
> maps that to the irregular sequence
> (iii) a subsequent process is reassembling portions of the total stream,
> and the concatenation results in < ED A0 80 ED B0 80 >.
>
> This is still an irregular sequence, and strictly speaking is not
> interpretable UTF-8. However, the Standard allows a process to "work out
> what was meant" and replace that with a well-formed UTF-8 sequence that
> corresponded to the same interpretation as the original UTF-16 sequence.
It
> also allows the process to toss it.
>
> This is kind of like a two year old saying, "mfrxopebh" and the parent
> saying, "What she said was, 'Mommy hug baby'". The two year old was not
> pronouncing English in anything approaching a conformant way, and if my
> neighbor expressed that same utterance to me, I would not for an instant
> consider giving it any interpretation. But the mother was acting in a
> highly specialised and extraneous circumstance, as far as language usage
> goes, and was able to "work out what was intended". Nominally, though,
that
> utterance is completely void of meaning in English. Likewise, the 6-byte
> sequence above is nominally meaningless with respect to the Unicode coded
> character set. Given special circumstances, it is considered acceptable
for
> a process to infer a meaning to it and replace it with a well-formed
> utterance (sequence), but apart from those circumstances the process could
> ignore it with total impunity.
>
> What seems to have happened somewhere along the way is that baby talk
> counts as the real thing, or is as fully meaningful as the real thing. It
> isn't. That mother probably wouldn't understand similar utterances coming
> from *your* two-year-old; don't require her or me to make sense of your
> software if it's not speaking the real thing.
>
>
>
> - Peter
>
>
> --------------------------------------------------------------------------
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>
>
>
> [1] The sequence may have been possible to generate at one time, but that
> difference has not resulted from a change in the algorithm that defines
it;
> it is only due to the change of the set of input values.
>
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT