UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Jun 05 2001 - 11:28:42 EDT


Mark Davis wrote:
> It is either one code point (lenient parser) or an error
> (strict parser). It is never two.

I am a little bit confused. I re-read conformance rules and the UTF-8
Corrigendum, and I could find these two things:

1) The difference between "lenient" vs. "strict" parsers.

2) The rule that an UTF-8 sequence like ED A0 80 ED B0 80 should be
interpreted (by a lenient parser) as <U+10000> rather than <U+D800 U+DC00>.

The fact that a "strict" UTF-8 parser rejects sequences (such as ED A0 80 ED
B0 80) explicitly mentioned as legal seems even against my idea of
conformance. Or, as a minimum, it seems to me a sort of higher-level
protocol that imposes private syntactical constraints to otherwise legal
Unicode text.

Am I looking in the wrong places? Or do these rules implicitly come from
some other rule that I didn't consider?

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT