Re: UTF8 vs AL32UTF8

From: Peter_Constable@sil.org
Date: Tue Jun 12 2001 - 16:40:15 EDT


On 06/12/2001 02:05:38 PM Jianping Yang wrote:

>Peter_Constable@sil.org wrote:
>
>> On 06/12/2001 01:13:48 PM Jianping Yang wrote:
>>
>> >If you convert < ED A0 80 ED B0 80 > into UTF-16, what does it mean
then?
>> I
>> >think definitely it means U-00010000.
>>
>> I'd say not if that 6-byte sequence is interpreted in terms of *UTF-8*.
>
>So UTF-8 is not compatible with UTF-16 even in its repository, which is
not
>guaranteed that you will have a round-trip conversion, which may be a
*big*
>issue.

No. Every character in the Unicode character set can be equally represented
in any of the encoding forms. If we want to talk about encoding forms being
mappings of codepoints (or USVs - take your pick for now) to code units,
then the current definitions are clear that there is no well-formed UTF-8
code unit sequence to represent U-0000D800 to U-0000DFFF. A sequence of <
ED A0 80 > within a UTF-8 stream is not well-formed UTF-8. Period. The
current definitions are not entirely clear, however, regarding UTF-16 and
whether or not there is a well-formed UTF-16 code unit sequence to
represent U-0000D800 to U-0000DFFF. If you believe that the definitions say
there is, then there is a discrepency between UTF-8 and UTF-16 according to
the current definitions. I believe that the intention is that there not be
any discrepency, and so I think it is preferable to interpret the
definitions (until they get cleared up) as though there is no well-formed
UTF-16 code unit sequence for codepoints in that range.

(I may be contradicting some statement I made recently in saying this. If
so, that reflects the degree of problem in the definitions: people can
interpret one way one day and another another day to suit whatever argument
they're trying to make. Not that they necessarily do that maliciously.
They're trying to reason along a certain line, and a reading of the
definitions supports that so they naturally adopt that. It would take
someone some time of meditation on their philosophical convictions
regarding the current definitions for them to adopt a single position and
stick to it in all discussions. I'm not quite there yet,)

>> UTF-8 has no 6-byte sequences. It must be something else, like the thing
>> informally designated in our discussions as UTF-8S.
>
>UTF-8S proposal will keep round-trip conversion between UTF-16 and UTF-8S.
>Please don't confuse UTF-8S with UTF-8 as they are different encoding
forms
>based on the proposal.

We don't need UTF-8S to achieve what we need in regard to UTF-16 <=> UTF-8.
It adds nothing of any benefit to round-trip conversion between those two.
As for round trip conversion between UTF-16 and UTF-8S, I certainly hope
the UTF-8S proposal will keep *that*.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT