RE: UTF-8 syntax

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Jun 06 2001 - 04:58:07 EDT


I (Marco Cimarosti) asked:
> >1) The difference between "lenient" vs. "strict" parsers.

Mark Davis replied:
> 1. By strict, I meant "excludes irregular sequences"

Peter Constable replied:
> That has to do with XML conformance, not Unicode. You were
> looking in the wrong spec.

I did not grasp that Mark was talking about XML: I thought that he was
talking about general purpose UTF-8. So now the concept of "strict parser"
is clear, in the context of XML.

Just to be sure, is also a "lenient parser" an XML-specific term? Or is it a
regular (all purpose) Unicode UTF-8 parser?

I (Marco Cimarosti) asked:
> >2) The rule that an UTF-8 sequence like ED A0 80 ED B0 80 should be
> >interpreted (by a lenient parser) as <U+10000> rather than <U+D800
> U+DC00>.

Mark Davis replied:
> 2. To be precise, U+D800 and U+DC00 are code points and do have
> interpretations. They are surrogates. They are *not* characters.

Peter Constable replied:
> Note that U+D800 and U+DC00 are not interpretable code
> points. They only make sense as code units in the UTF-16
> encoding form. Your question was relating to the coded
> character set, and on that level there is only one
> possibility: U+10000.

There are a few keywords (such as "interpretation", "interpretable" and
"coded character set") that seem quite important in these replies, but I am
unsure of their exact meaning in this context.

But my main problem is that now I don't know whether your replies referred
to XML or to general-purpose Unicode.

Moreover, now my doubt extended to UTF-32, although I though I knew the
answer.

So, sorry for restating the same question again (I know you both replied B
to question 1 below, I just want to be sure that you and I had the same
thing in mind).

Also excuse the rather formal tone of the questions. This is just because
understanding these points is very important to me, and I want to be sure
that my questions are unambiguous.

1) According to the Unicode Standard (with no higher-level protocols in
action), what code point(s) correspond(s) to the irregular sequence of UTF-8
octets <ED A0 80 ED B0 80>:
        A) <U+D800, U+DC00>?
        B) <U+10000>?

2) According to the Unicode Standard (with no higher-level protocols in
action), what code point(s) correspond(s) to the sequence of UTF-32BE octets
<00 00 00 00 D8 00 00 00 00 00 DC 00>:
        A) <U+D800, U+DC00>?
        B) <U+10000>?

3) Which passages in The Unicode Standard 3.0, UTR's, or addenda justify the
replies to question 1 and 2 above?

4) If question 1 and 2 had different answers (A,B or B,A), what is the
rationale for this difference between UTF-8 and UTF-32?

Thank you.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT