Re: UTF8 vs AL32UTF8

From: Peter_Constable@sil.org
Date: Tue Jun 12 2001 - 11:37:30 EDT


On 06/12/2001 10:29:26 AM "Mark Davis" wrote:

>When applying UTF-8 -- as originally designed -- the sequence 0000D800
>0000DC00 would transform into a 6-byte sequence. Transforming back would
>result with the original sequence 0000D800 0000DC00. When applying this to
>Unicode (16 bit only, at the time), it would take D800 DC00 to the 6-byte
>sequence and back.

Yes, but those sequences were not considered ambiguous or confused with
00010000.

>Only after UTF-16 was designed were the definitions changed so that the
>16-bit sequence D800 DC00 would transform into a 4-byte sequence in UTF-8.

But does UTF-8 map UTF-16 code unit sequences, or does it map code points
(or USVs, take your pick for now) in the CCS?

>> The only sensible interpretation of
>> the definitions of Unicode is that UTF-8 maps exactly one coded
character
>> to exactly one code unit sequence
>
>This is not correct. The most obvious point is that UTFs also map
unassigned
>code points (such as U+0220) that are not coded characters. Yours is not
the
>only possible "sensible" interpretation.

OK. Clarification: UTF-8 mapes exactly one codepoint to exactly one code
unit sequence.

>The minimal formal requirement is that a UTF map each sequence of code
>points in its domain to a unique sequence of bytes, and map any sequence
of
>bytes that it generates back to a sequence of code points. The definition
>does allow a UTF to map other byte sequences back to code points, and
there
>is some dispute about the precise domain -- whether to exclude
>surrogate/noncharacter code points or not.

But I don't think a sensible interpretation of D31 allows a code unit
sequence to map back to multiple codepoints. If that constraint isn't
assumed, then we're taking the already leaky definitions and poking new
holes into them.

So, I'm not yet convinced that there hasn't always been just one UTF-8
(Unicode and ISO differences related to the size of the codespace aside).

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT