RE: UTF-8 syntax

From: Mark Davis (mark_macchiato@yahoo.com)
Date: Fri Jun 08 2001 - 00:25:21 EDT


Peter,

I haven't been able to follow all of the postings
today.* I disagree with the conclusion you draw -

The UTF-8 Corrigendum
(http://www.unicode.org/unicode/reports/tr27/#conformance)
is pretty explicit on that (D36), and is the
result of long debate within the UTC. After all,
if there were no difference between irregular and
illegal sequences, why define a difference
between them?

I do think the text and definitions need to be
clarified, and that will be part of the 4.0
process.

Mark

* I made the mistake of installing Microsoft's
latest service pack for Windows 2000. I am now
blue-screened, even in "safe" mode. (Am more than
a bit annoyed about it, but won't elaborate
here.)

--- Peter_Constable@sil.org wrote:
>
> On 06/07/2001 05:17:37 AM Marco Cimarosti
> wrote:
>
> >So, <ED A0 80 ED B0 80> is NOT a six-byte
> sequence: it is two adjacent
> >THREE-byte sequences: <ED A0 80> and <ED B0
> 80>, and the meaning of these
> >sequences is already clear enough by the rules
> (Table 3.1B): the first one
> >means U+D800 and the second one means U+DC00.
>
> Yes, and those are codepoints in the CCS, not
> code units in UTF-16. There
> is nothing in the Standard that allows
> codepoints in the CCS to be mapped
> arithmetically in terms of their USVs to other
> codepoints in the CCS. The
> only mappings from the CCS to the CCS are
> things like normalization, case
> mapping, other character foldings,
> transliterations, etc.
>
>
> >That's why conformance rules say nothing
> explicit about "6-byte
> sequences".
>
> Sure they do: they say they don't exist.
>
> C1 A process shall interpret the Unicode code
> units in accordance with the
> Unicode Transformation Format used.
>
> The definition of UTF-8 does *not* generate
> 6-byte sequences and
> *explicitly* restricts sequences to 4-byte
> sequences.
>
> D36(a) UTF-8 is the Unicode Transformation
> Format that serializes a
> Unicode code point as a sequence of one to four
> bytes, as specified in
> Table 3.1, UTF-8 Bit Distribution.
>
>
> End of story. Any process that emits *or
> interprets* 6-byte sequences is
> non-conformant. The statements in the standard
> that suggest otherwise, such
> as D36(c), are logically inconsistent with
> other conformance requirements
> and definitions in the Standard. Either the
> definitions need to be reworked
> to remove the inconsistency, or they should be
> applied.
>
>
>
> - Peter
>
>
>
---------------------------------------------------------------------------
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>
>
>
>
>
>

__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT