RE: UTF-8 syntax

From: Peter_Constable@sil.org
Date: Wed Jun 06 2001 - 12:20:50 EDT


>Peter Constable replied:
>> That has to do with XML conformance, not Unicode. You were
>> looking in the wrong spec.
>
>I did not grasp that Mark was talking about XML

I made a wrong assumption about what Mark was meaning. He used "strict" in
a way that I don't really see supported in the definitions and conformance
requirements of Unicode (which do not anywhere specify that 6-byte UTF-8
sequences for supplementary-plane characters constitute error conditions).
On the other hand, Misha Wolf had just pointed out that such sequences
would represent a fatal error to an XML decoder.

>Peter Constable replied:
>> Note that U+D800 and U+DC00 are not interpretable code
>> points. They only make sense as code units in the UTF-16
>> encoding form. Your question was relating to the coded
>> character set, and on that level there is only one
>> possibility: U+10000.
>
>There are a few keywords (such as "interpretation", "interpretable" and
>"coded character set") that seem quite important in these replies, but I
am
>unsure of their exact meaning in this context.

Formally, interpretation is a mapping from the denotatum to a denotee (from
the encoded representation to the thing being represented). There are two
levels of interpretation: at the encoding form level, and at the coded
character set level. ("Coded character set" comes straight from UTR#17.)

At the encoding form level interpretation is mapping from code unit
sequences to codepoints in the CCS. For UTF-16, it's merely a matter of
feeding that integer value into a formula:

  U = (C(subscript: H) ? D800(subscript: 16)) * 400(subscript: 16) + (C
  (subscript: L) ? DC00(subscript: 16)) + 10000(subscript: 16)

At the CCS level, offhand, I'd say "interpret" means to map a codepoint to
particular (not default) normative or informative semantic properties.
Based on that, my use of "interpretable" meant "able to assume particular
semantics", which implies that those particular semantics are defined.

"DC00" can be understood in two senses:

1. as a code unit used in code sequences as defined by UTF-16
2. as a codepoint in the codespace of the Unicode coded character set

In the first, DC00 combines with D800 to map to the codepoint U+10000. At
the second level of interpretation, as codepoints -- which is what I was
referring to -- DC00 and D800 are permanently reserved and thus do not have
any particular semantics. They are, therefore, not interpretable in the
sense I described above.

>But my main problem is that now I don't know whether your replies referred
>to XML or to general-purpose Unicode.

Well, I shouldn't have put words into Mark's mouth in the first place, so
I'll refrain from doing so again. For my part, I will refer only to Unicode
in what follows.

>1) According to the Unicode Standard (with no higher-level protocols in
>action), what code point(s) correspond(s) to the irregular sequence of
UTF-8
>octets <ED A0 80 ED B0 80>:
> A) <U+D800, U+DC00>?
> B) <U+10000>?

You have caused me to realise something: Unicode 3.1 gives clear statements
about generating UTF-8 sequences, but not so clear statements about
interpreting them. Maybe this is your point. Unicode does not say anything
explicit about exactly this case, thus we need to apply the defaults. In
doing so, we see that Unicode does not *exclude* D800 and DC00 from the
codespace for the CCS, and therefore it would seem that that UTF-8 sequence
would have to be interpreted (in the encoding form level of interpretation)
as the code points < D800 DC00 >, which have *no* meaning *as codepoints*!!
The only possible alternative is that the 6-byte sequence is ambiguous, but
I'd have a hard time supporting that from the current definitions. The only
way that can be considered is by allowing sloppiness, and it still leaves
ambiguity: unless D800 and DC00 are *excluded* from the Unicode CCS, then
interpretation A is possible.

This whole business of allowing people to think of that 6-byte sequence as
meaning U+10000 is just plain sloppy. (I've griped about this before.)
Likewise the mentality that UTF-8 code unit sequences can be mapped to/from
UTF-16 code unit sequences. The only mappings that are or should be defined
are between UTF-8 and the CCS, between UTF-16 and the CCS, and between
UTF-32 and the CCS. Mappings like f: U8 -> U16 or g: U16 -> U8 (where U8
and U16 are the sets of UTF-8 and UTF-16 code unit sequences) should only
be definable as composite mappings: f = p ° q where p: U8 -> CCS and q: CCS
-> U16.

>2) According to the Unicode Standard (with no higher-level protocols in
>action), what code point(s) correspond(s) to the sequence of UTF-32BE
octets
><00 00 00 00 D8 00 00 00 00 00 DC 00>:
> A) <U+D800, U+DC00>?
> B) <U+10000>?

If anybody says B, then we've got serious problems.

>3) Which passages in The Unicode Standard 3.0, UTR's, or addenda justify
the
>replies to question 1 and 2 above?

The relevant passages are the definitions and conformance requirements in
§3.8 and the revision to those in UTR#27.

>4) If question 1 and 2 had different answers (A,B or B,A), what is the
>rationale for this difference between UTF-8 and UTF-32?

I'd say that there should not be any. Only A should be possible in either
case. For UTF-8, B has been tolerated, and I have always thought that to be
a mistake. Your message is the first instance in which I've seen it
suggested for UTF-32. It's an absolutely horrific suggestion, but the logic
that makes it possible for UTF-8 should equally make it possible for
UTF-32.

For that matter, the same logic should allow a UTF-16 sequence < 00ED,
00A0, 0080, 00ED, 00B0, 0080 > to mean U+10000. Clearly nobody thinks that
way or should be allowed to think that way for UTF-16. I do not understand
why it is tolerated for UTF-8.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>

>
>Thank you.
>
>_ Marco
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT