I (Marco Cimarosti) asked:
> >1) The difference between "lenient" vs. "strict" parsers.
Mark Davis replied:
> 1. By strict, I meant "excludes irregular sequences"
Peter Constable replied:
> That has to do with XML conformance, not Unicode. You were
> looking in the wrong spec.
I did not grasp that Mark was talking about XML: I thought that he was
talking about general purpose UTF-8. So now the concept of "strict parser"
is clear, in the context of XML.
Just to be sure, is also a "lenient parser" an XML-specific term? Or is it a
regular (all purpose) Unicode UTF-8 parser?
I (Marco Cimarosti) asked:
> >2) The rule that an UTF-8 sequence like ED A0 80 ED B0 80 should be
> >interpreted (by a lenient parser) as <U+10000> rather than <U+D800
> U+DC00>.
Mark Davis replied:
> 2. To be precise, U+D800 and U+DC00 are code points and do have
> interpretations. They are surrogates. They are *not* characters.
Peter Constable replied:
> Note that U+D800 and U+DC00 are not interpretable code
> points. They only make sense as code units in the UTF-16
> encoding form. Your question was relating to the coded
> character set, and on that level there is only one
> possibility: U+10000.
There are a few keywords (such as "interpretation", "interpretable" and
"coded character set") that seem quite important in these replies, but I am
unsure of their exact meaning in this context.
But my main problem is that now I don't know whether your replies referred
to XML or to general-purpose Unicode.
Moreover, now my doubt extended to UTF-32, although I though I knew the
answer.
So, sorry for restating the same question again (I know you both replied B
to question 1 below, I just want to be sure that you and I had the same
thing in mind).
Also excuse the rather formal tone of the questions. This is just because
understanding these points is very important to me, and I want to be sure
that my questions are unambiguous.
1) According to the Unicode Standard (with no higher-level protocols in
action), what code point(s) correspond(s) to the irregular sequence of UTF-8
octets <ED A0 80 ED B0 80>:
A) <U+D800, U+DC00>?
B) <U+10000>?
2) According to the Unicode Standard (with no higher-level protocols in
action), what code point(s) correspond(s) to the sequence of UTF-32BE octets
<00 00 00 00 D8 00 00 00 00 00 DC 00>:
A) <U+D800, U+DC00>?
B) <U+10000>?
3) Which passages in The Unicode Standard 3.0, UTR's, or addenda justify the
replies to question 1 and 2 above?
4) If question 1 and 2 had different answers (A,B or B,A), what is the
rationale for this difference between UTF-8 and UTF-32?
Thank you.
_ Marco
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT