Re: UTF8 vs AL32UTF8

From: Mark Davis (mark@macchiato.com)
Date: Sat Jun 16 2001 - 13:38:08 EDT


I agree with Jianping on this point. If you are interpreting < ED A0 80 ED
B0 80 > in UTF-8, which the Unicode Standard does allow you to do (though
not to generate it!), the only possible interpretation is as U-00010000.

The standard does recognize that it is not a round-trip mapping; that is why
they are distinguished as "irregulars", and that is why you can't generate
them (and be conformant).

Mark

----- Original Message -----
From: <Peter_Constable@sil.org>
To: <unicode@unicode.org>
Sent: Tuesday, June 12, 2001 14:06
Subject: Re: UTF8 vs AL32UTF8

>
> On 06/12/2001 01:13:48 PM Jianping Yang wrote:
>
> >If you convert < ED A0 80 ED B0 80 > into UTF-16, what does it mean then?
> I
> >think definitely it means U-00010000.
>
> Please read the definitions and tell me how you support that.
>
> The only way I can see to support that is to assume that the mapping from
> code unit sequences in an encoding form to codepoints in the coded
> character set can be one to many (one code unit sequence can map to many
> different codepoints). That is a singularly bad way to define a formal
> mapping for a character set encoding standard, or for any algorithmic
> process for which you want deterministic behaviour, and I for one do not
> buy that that is the intended design of the Standard. Granted, the
> normative definitions are unfortunately quiet regarding CEF > CCS
mappings,
> D31 being the only statement (a code unit sequence is illegal if it
doesn't
> "map back", where the meaning of "map back" is only implied). Where it
> fails to be explicit, I turn to common sense regarding the formal design
of
> processing systems: you want predictable results, and you don't get that
by
> adding indeterminate states.
>
> The mapping defined by UTF-8 for U-00010000 is < F0 90 80 80 >, and not <
> ED A0 80 ED B0 80 >. If we want the "map back" to correspond, then < F0 90
> 80 80 > must map back to U-00010000. If we further want our mapping to be
> deterministic, then the "map back" for < ED A0 80 ED B0 80 > must be
> undefined.
>
> As I said in an earlier message, taking < ED A0 80 ED B0 80 > to mean
> U-00010000 is like taking a toddler's near gibberish to mean "Momma hug
> baby". Here's a better analogy: It's like saying that the 23-byte sequence
>
> xxx
> x x
> x
> x x
> xxx
>
> represents LATIN CAPITAL LETTER C in ASCII. It may be what was intended to
> be understood by the process that generated it, but it is not ASCII, and
> the process that generated it does not conform to ASCII if that is how it
> is trying to represent that character. It happens to be an ASCII character
> sequence that a higher level protocol (a human agent) can make sense of as
> a Latin C, but that is irrelevant.
>
> Anyway, this is getting off the point. It started as a discussion of
> Oracle's use of the label "UTF-8" and whether they could find support in
> the original definition of UTF-8.
>
> >If you convert < ED A0 80 ED B0 80 > into UTF-16, what does it mean then?
> I
> >think definitely it means U-00010000.
>
> The response this *should* be getting is to please read the original
> definition of UTF-8 and tell me how you support that. You can't!
>
>
> - Peter
>
>
> --------------------------------------------------------------------------
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT