Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

From: Peter_Constable@sil.org
Date: Tue Jun 05 2001 - 13:31:09 EDT


>I am a little bit confused. I re-read conformance rules and the UTF-8
>Corrigendum, and I could find these two things:
>
>1) The difference between "lenient" vs. "strict" parsers.

That has to do with XML conformance, not Unicode. You were looking in the
wrong spec.

>2) The rule that an UTF-8 sequence like ED A0 80 ED B0 80 should be
>interpreted (by a lenient parser) as <U+10000> rather than <U+D800
U+DC00>.

Note that U+D800 and U+DC00 are not interpretable code points. They only
make sense as code units in the UTF-16 encoding form. Your question was
relating to the coded character set, and on that level there is only one
possibility: U+10000.

>The fact that a "strict" UTF-8 parser rejects sequences (such as ED A0 80
ED
>B0 80) explicitly mentioned as legal seems even against my idea of
>conformance.

In Unicode terms, that sequence is legal but irregular. In XML terms, that
sequence is illegal. Again, two different specs.

>Or, as a minimum, it seems to me a sort of higher-level
>protocol that imposes private syntactical constraints to otherwise legal
>Unicode text.

That's what it is. Note that there's no reason at all why the XML spec
can't be more restrictive. There may be some things that are reasonable in
some contexts but not in others. XML requires (recommends?) data to be
normalised in normal form C. That imposes private (well, open actually, but
private in the sense of limited to that protocol) constraints against
otherwise legal Unicode character sequences.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT