RE: UTF-8 syntax

From: Misha Wolf (Misha.Wolf@reuters.com)
Date: Wed Jun 06 2001 - 13:18:58 EDT


On 06/06/2001 17:20:50 Peter Constable wrote:
> >Peter Constable replied:
> >> That has to do with XML conformance, not Unicode. You were
> >> looking in the wrong spec.
> >
> >I did not grasp that Mark was talking about XML
>
> I made a wrong assumption about what Mark was meaning. He used "strict" in
> a way that I don't really see supported in the definitions and conformance
> requirements of Unicode (which do not anywhere specify that 6-byte UTF-8
> sequences for supplementary-plane characters constitute error conditions).
> On the other hand, Misha Wolf had just pointed out that such sequences
> would represent a fatal error to an XML decoder.

Mark subsequently pointed out that there is nothing to stop an XML
parser from understanding arbitrary character encoding schemes,
including the one being referred to as UTF-8S, if it is correctly
declared in an XML document. For the whole story, read on [1]:

<quote>

Each external parsed entity in an XML document may use a different
encoding for its characters. All XML processors must be able to read
entities in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and
"UTF-16" in this specification do not apply to character encodings with
any other labels, even if the encodings or labels are very similar to
UTF-8 or UTF-16.

Entities encoded in UTF-16 must begin with the Byte Order Mark described
by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section
2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK
SPACE character, #xFEFF). This is an encoding signature, not part of
either the markup or the character data of the XML document. XML
processors must be able to use this character to differentiate between
UTF-8 and UTF-16 encoded documents.

Although an XML processor is required to read only entities in the UTF-8
and UTF-16 encodings, it is recognized that other encodings are used
around the world, and it may be desired for XML processors to read
entities that use them. In the absence of external character encoding
information (such as MIME headers), parsed entities which are stored in
an encoding other than UTF-8 or UTF-16 must begin with a text
declaration (see 4.3.1 The Text Declaration) containing an encoding
declaration:

[...]

In an encoding declaration, the values "UTF-8", "UTF-16",
"ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various
encodings and transformations of Unicode / ISO/IEC 10646, the values
"ISO-8859-1", "ISO-8859-2", ... "ISO-8859-n" (where n is the part
number) should be used for the parts of ISO 8859, and the values
"ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various
encoded forms of JIS X-0208-1997. It is recommended that character
encodings registered (as charsets) with the Internet Assigned Numbers
Authority [IANA-CHARSETS], other than those just listed, be referred to
using their registered names; other encodings should use names starting
with an "x-" prefix. XML processors should match character encoding
names in a case-insensitive way and should either interpret an
IANA-registered name as the encoding registered at IANA for that name or
treat it as unknown (processors are, of course, not required to support
all IANA-registered encodings).

In the absence of information provided by an external transport protocol
(e.g. HTTP or MIME), it is an error for an entity including an encoding
declaration to be presented to the XML processor in an encoding other
than that named in the declaration, or for an entity which begins with
neither a Byte Order Mark nor an encoding declaration to use an encoding
other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary
ASCII entities do not strictly need an encoding declaration.

It is a fatal error for a TextDecl to occur other than at the beginning
of an external entity.

It is a fatal error when an XML processor encounters an entity with an
encoding that it is unable to process. It is a fatal error if an XML
entity is determined (via default, encoding declaration, or higher-level
protocol) to be in a certain encoding but contains octet sequences that
are not legal in that encoding. It is also a fatal error if an XML
entity contains no encoding declaration and its content is not legal
UTF-8 or UTF-16.

</quote>

[1] http://www.w3.org/TR/2000/REC-xml-20001006#charencoding

Misha

-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT