RE: UTF-8 syntax

From: Peter_Constable@sil.org
Date: Thu Jun 07 2001 - 09:28:33 EDT


On 06/07/2001 05:17:37 AM Marco Cimarosti wrote:

>So, <ED A0 80 ED B0 80> is NOT a six-byte sequence: it is two adjacent
>THREE-byte sequences: <ED A0 80> and <ED B0 80>, and the meaning of these
>sequences is already clear enough by the rules (Table 3.1B): the first one
>means U+D800 and the second one means U+DC00.

Yes, and those are codepoints in the CCS, not code units in UTF-16. There
is nothing in the Standard that allows codepoints in the CCS to be mapped
arithmetically in terms of their USVs to other codepoints in the CCS. The
only mappings from the CCS to the CCS are things like normalization, case
mapping, other character foldings, transliterations, etc.

>That's why conformance rules say nothing explicit about "6-byte
sequences".

Sure they do: they say they don't exist.

C1 A process shall interpret the Unicode code units in accordance with the
Unicode Transformation Format used.

The definition of UTF-8 does *not* generate 6-byte sequences and
*explicitly* restricts sequences to 4-byte sequences.

D36(a) UTF-8 is the Unicode Transformation Format that serializes a
Unicode code point as a sequence of one to four bytes, as specified in
Table 3.1, UTF-8 Bit Distribution.

End of story. Any process that emits *or interprets* 6-byte sequences is
non-conformant. The statements in the standard that suggest otherwise, such
as D36(c), are logically inconsistent with other conformance requirements
and definitions in the Standard. Either the definitions need to be reworked
to remove the inconsistency, or they should be applied.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT