From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 12 2004 - 15:03:14 CDT
> This means that the rules of XML conflict with the rules of Unicode. If
> the string is a Unicode string, U+226F is canonically equivalent to
> <U+003E, U+0338> and therefore any higher level protocol should treat
> the two sequences as identical, rather than reject one of them as
> causing the document to be ill-formed.
There's no conflict here:
<tag1≯</tag1><tag2/≯
will not be *canonically equivalent* (for Unicode) to:
<tag1>!</tag1><tag2/>!
(here a exclamation point is used instead of the combining solidus)
which is canonically equivalent (for Unicode) to:
<tag1#</tag1><tag2/#
(here I use a # instead of the <not greater than> character)
Internally, in the parsed XML tree, the two syntaxes "̸" and "/"
(combining) will produce the same internal U+0338 character in the DOM tree.
So the problem is purely a choice of syntax, because the two first elements
above would be treated identically by any compliant XML parser.
When there's a conflict, use a NCR: this is completely equivalent for all
compliant XML parsers. A XML document generator can know such exception, and
can generate a NCR in the XML document, each time the U+0338 character must
be coded in the first position of a text element (a text element is
necessarily following a closing element tag in any wellformed XML document).
This will resist to any Unicode normalization applied to the whole XML
document...
Note however that a Unicode normalization *modifies* the XML document: XML
ignores the Unicode canonical equivalences so it will treet the precombined
character <e-acute> differently from the two characters <e, acute>. If a
document is transcoded from Unicode to another charset with an algorithm
that does not apply a one-to-one mapping of encoded characters, the new
document will *not* be equivalent for XML (for most legacy charsets, the
transcoding from this charset to Unicode is normally one-to-one, so most
document parsers will parse a legacy XML document into a DOM tree containing
Unicode strings).
For XML generators that use an internal DOM representation before generating
the XML document syntax, any character that cannot be mapped one-to-one in
the target charset of the document MUST use a NCR; not doing so will create
a document that will be later parsed as different from the original DOM
tree).
This is true also for all XML related APIs: DOM, SAX, ... when they are used
to get information from the parsed document tree, or when using
authentication of XML document contents (the XML semantic of XML-ignorable
whitespaces is considered, and space normalization will apply before the
signature is computed): they return either an exact Unicode string, or an
approximation of the actual DOM content if this information is requested in
another legacy charset because this would imply a lossy conversion, unless
the request to that API specifies that NCRs are allowed in the data returned
from the DOM tree by such API.
As a consequence, a compliant XML parser MUST NOT apply any Unicode
normalization to the parsed entities (text elements, element names,
attribute names, attribute values, processing instructions...) without being
instructed to do so.
So there's NO conflict between XML document equivalence and Unicode
canonical equivalence: they are not the same, and they don't need to be the
same!
This archive was generated by hypermail 2.1.5 : Thu Aug 12 2004 - 15:55:26 CDT