Re: UTF8 vs AL32UTF8

From: Peter_Constable@sil.org
Date: Tue Jun 12 2001 - 17:06:02 EDT


On 06/12/2001 01:13:48 PM Jianping Yang wrote:

>If you convert < ED A0 80 ED B0 80 > into UTF-16, what does it mean then?
I
>think definitely it means U-00010000.

Please read the definitions and tell me how you support that.

The only way I can see to support that is to assume that the mapping from
code unit sequences in an encoding form to codepoints in the coded
character set can be one to many (one code unit sequence can map to many
different codepoints). That is a singularly bad way to define a formal
mapping for a character set encoding standard, or for any algorithmic
process for which you want deterministic behaviour, and I for one do not
buy that that is the intended design of the Standard. Granted, the
normative definitions are unfortunately quiet regarding CEF > CCS mappings,
D31 being the only statement (a code unit sequence is illegal if it doesn't
"map back", where the meaning of "map back" is only implied). Where it
fails to be explicit, I turn to common sense regarding the formal design of
processing systems: you want predictable results, and you don't get that by
adding indeterminate states.

The mapping defined by UTF-8 for U-00010000 is < F0 90 80 80 >, and not <
ED A0 80 ED B0 80 >. If we want the "map back" to correspond, then < F0 90
80 80 > must map back to U-00010000. If we further want our mapping to be
deterministic, then the "map back" for < ED A0 80 ED B0 80 > must be
undefined.

As I said in an earlier message, taking < ED A0 80 ED B0 80 > to mean
U-00010000 is like taking a toddler's near gibberish to mean "Momma hug
baby". Here's a better analogy: It's like saying that the 23-byte sequence

 xxx
x x
x
x x
 xxx

represents LATIN CAPITAL LETTER C in ASCII. It may be what was intended to
be understood by the process that generated it, but it is not ASCII, and
the process that generated it does not conform to ASCII if that is how it
is trying to represent that character. It happens to be an ASCII character
sequence that a higher level protocol (a human agent) can make sense of as
a Latin C, but that is irrelevant.

Anyway, this is getting off the point. It started as a discussion of
Oracle's use of the label "UTF-8" and whether they could find support in
the original definition of UTF-8.

>If you convert < ED A0 80 ED B0 80 > into UTF-16, what does it mean then?
I
>think definitely it means U-00010000.

The response this *should* be getting is to please read the original
definition of UTF-8 and tell me how you support that. You can't!

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT