From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Nov 26 2003 - 10:49:31 EST
On 26/11/2003 06:17, Philippe Verdy wrote:
>Peter Kirk [mailto:peterkirk@qaya.org] writes:
>
>
>>Why is this a problem? Quotes and ">" with combining marks are
>>presumably not legal HTML or XML;
>>
>>
>
>You're wrong: it is legal in both HTML and XML. What is not specified
>correctly is the behavior of HTML and XML parsers face to a XML or HTML
>document claiming it is coded with a Unicode encoding scheme or any other
>Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
>contains supplementary characters that are not part of the Unicode/ISO/IEC
>10646 repertoire).
>
>
>
OK, I used the wrong words here. A sequence of a quote or ">" followed
by combining characters is legal HTML/XML with the interpretation of a
quote or ">" introducing a quoted string or terminating a tag, followed
by a defective combining sequence which is part of the quoted string or
of the text following the tag. The question is, does such a sequence
have any other legal interpretation, within the context of an HTML/XML
tag? If not, there is no ambiguity.
> ...
>
>>There could of course be
>>problems if there were any precomposed combinations of quotes or ">"
>>with combining characters, but I don't think there are any, are there?
>>
>>
>
>There are such precomposed sequences in Unicode. Look in
>NormalizationTest.txt for the places where ">", single and double quotes are
>used and part of a combining sequence... Notably look at sequences made with
>the combining solidus overlay; add also the case of enclosing combining
>characters, and of mathematical operators that can be created with a
>combining sequence starting by ">" or "=" or single or double quotes, and
>modified by diacritics.
>
>
>
According to John Cowan there is just one such precomposed character,
U+226F. As an HTML/XML document (the whole file, not just the parts
between tags) is a Unicode string, the Unicode conformance rules would
seem to mandate that an HTML/XML parser should parse U+226F exactly as
if it were the sequence <">", U+0338>, i.e. as end of tag followed by a
defective combining sequence. Normalisation stability implies that this
precomposed character will always be the only such problem case, at
least apart from composition exceptions, and so it is possible to write
it into parsers as a special case. A bit messy, but less messy than
using numeric entities or named entities.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 11:36:16 EST