Re: Is there Unicode mail out there?

From: Mark Davis (mark@macchiato.com)
Date: Wed Jul 18 2001 - 11:30:40 EDT


I believe that they are formally disallowed, if one traces through the right
path in the standards. Rather than do that myself, I believe that the XML
lawyers on this list can tell you the precise answer more quickly.

Mark

—————

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

----- Original Message -----
From: "Bill Kurmey" <Bill.Kurmey@v-wave.com>
To: "Mark Davis" <mark@macchiato.com>
Sent: Wednesday, July 18, 2001 03:08
Subject: Is there Unicode mail out there?

> I can find no restriction on control codes in HTML 4.01. Nor on their
> representation as NCRs in either decimal or hexadecimal form.
>
> Section 2.2 of the XML-20001006 states
>
> "Legal characters are tab, carriage return, line feed, and the legal
> characters of Unicode and ISO/IEC 10646. The versions of these standards
> cited in A.1 Normative References were current at the time this document
> was prepared. New characters may be added to these standards by amendments
> or new editions. Consequently, XML processors must accept any character
> in the range specified for Char."
>
> I can't find any statement that indicates that an XML processor cannot
> accept a control character that is a "legal" character in Unicode and
> ISO/IEC 10646, only if an ENCODING contains an octet sequence that is,
> presumably, not legal in Unicode and ISO/IEC 10646. I interpret 2.2 to
> mean that the XML processor MUST accept the characters specified in 2.2,
> but need not be limited to those characters.
>
> "It is a fatal error when an XML processor encounters an entity with an
> encoding that it is unable to process. It is a fatal error if an XML
entity
> is determined (via default, encoding declaration, or higher-level
protocol)
> to be in a certain encoding but contains octet sequences that are not
legal
> in that
> encoding. It is also a fatal error if an XML entity contains no encoding
> declaration and its content is not legal UTF-8 or UTF-16."
>
> Am I missing something somewhere in the specifications on the W3C site?
> Where is there a reference forbidding an XML processor from handling ANY
> character that is defined in Unicode and ISO/IEC 10646?
>
> My concern stems from working with an email archive format which uses soh,
> stx and etx as an envelope.
>
> > Mark Davis wrote:
> >
> > > I had been told by the W3C people that the reason for forbidding
control
> > > characters in XML and HTML was for compatibility with SGML.
> >
> >
> > More accurately, with the SGML default syntax, which is used in HTML
> > and (with a few modifications) in XML.
>
>
>
> Bill Kurmey, Edmonton, AB, Canada
>
>



This archive was generated by hypermail 2.1.2 : Wed Jul 18 2001 - 12:21:42 EDT