Re: UTF8 vs AL32UTF8

From: Peter_Constable@sil.org
Date: Tue Jun 12 2001 - 03:39:44 EDT


>In other words, Oracle has an alternate solution here for 9i -- they can
>simply explain that the old product defined the old pre-surrogate UTF-8
and
>the new product is now surrogate aware and uses the current definition.

There's a mistake being made here that has been made repeatedly throughout
our discussion: that's to assume that there are two kinds of UTF-8: the
original, in which the code unit sequence < ED A0 80 ED B0 80 > meant the
coded character sequence < U-0000D800, U-0000DC00 >, and the new UTF-8 in
which this sequence means U-00010000. The only sensible interpretation of
the definitions of Unicode is that UTF-8 maps exactly one coded character
to exactly one code unit sequence. As far as I know, the UTF-8 mapping
hasn't changed; all that has changed are the range of USVs that are mapped
into it, and the introduction of some terms like "irregular".

The sequence < ED A0 80 ED B0 80 > is *only* achievable in conformant
software [1] when a process that is translating between encoding forms
(specifically from UTF-16 to UTF-8)

(i) is doing a direct translation between the surface forms (as opposed to
a meaning-based transltion in which you map code units to characters and
then characters to other code units),
(ii) on encountering an unpaired high surrogate code unit at the end of a
stream maps that to the irregular sequence < ED A0 80 >, or on
encountering an unpaired low surrogate code unit at the start of a stream
maps that to the irregular sequence
(iii) a subsequent process is reassembling portions of the total stream,
and the concatenation results in < ED A0 80 ED B0 80 >.

This is still an irregular sequence, and strictly speaking is not
interpretable UTF-8. However, the Standard allows a process to "work out
what was meant" and replace that with a well-formed UTF-8 sequence that
corresponded to the same interpretation as the original UTF-16 sequence. It
also allows the process to toss it.

This is kind of like a two year old saying, "mfrxopebh" and the parent
saying, "What she said was, 'Mommy hug baby'". The two year old was not
pronouncing English in anything approaching a conformant way, and if my
neighbor expressed that same utterance to me, I would not for an instant
consider giving it any interpretation. But the mother was acting in a
highly specialised and extraneous circumstance, as far as language usage
goes, and was able to "work out what was intended". Nominally, though, that
utterance is completely void of meaning in English. Likewise, the 6-byte
sequence above is nominally meaningless with respect to the Unicode coded
character set. Given special circumstances, it is considered acceptable for
a process to infer a meaning to it and replace it with a well-formed
utterance (sequence), but apart from those circumstances the process could
ignore it with total impunity.

What seems to have happened somewhere along the way is that baby talk
counts as the real thing, or is as fully meaningful as the real thing. It
isn't. That mother probably wouldn't understand similar utterances coming
from *your* two-year-old; don't require her or me to make sense of your
software if it's not speaking the real thing.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>

[1] The sequence may have been possible to generate at one time, but that
difference has not resulted from a change in the algorithm that defines it;
it is only due to the change of the set of input values.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT