Mark Davis wrote:
> It is either one code point (lenient parser) or an error
> (strict parser). It is never two.
I am a little bit confused. I re-read conformance rules and the UTF-8
Corrigendum, and I could find these two things:
1) The difference between "lenient" vs. "strict" parsers.
2) The rule that an UTF-8 sequence like ED A0 80 ED B0 80 should be
interpreted (by a lenient parser) as <U+10000> rather than <U+D800 U+DC00>.
The fact that a "strict" UTF-8 parser rejects sequences (such as ED A0 80 ED
B0 80) explicitly mentioned as legal seems even against my idea of
conformance. Or, as a minimum, it seems to me a sort of higher-level
protocol that imposes private syntactical constraints to otherwise legal
Unicode text.
Am I looking in the wrong places? Or do these rules implicitly come from
some other rule that I didn't consider?
_ Marco
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT