From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 26 2003 - 09:17:55 EST
Peter Kirk [mailto:peterkirk@qaya.org] writes:
> Why is this a problem? Quotes and ">" with combining marks are
> presumably not legal HTML or XML;
You're wrong: it is legal in both HTML and XML. What is not specified
correctly is the behavior of HTML and XML parsers face to a XML or HTML
document claiming it is coded with a Unicode encoding scheme or any other
Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
contains supplementary characters that are not part of the Unicode/ISO/IEC
10646 repertoire).
> and so the interpretation of a quotes
> or ">" followed by combining marks as a quote or ">" and a defective
> combining sequence is unambiguous, surely?
No it is not: there's a problem of prevalence between XML/HTML/SGML parsing
rules, and Unicode parsing rules. Using character entities can solve this
problem, but I would really prefer that the W3 accepts a modification of its
parsing rules so that any text element or attribute value starting by a
defective combining sequence MUST NOT be interpreted as such using the
simple encoding scheme. If a XML document is serialized into a text file
with a encoding scheme, the generated file should (I would prefer "must")
not encoding these defective sequences with the encoding scheme, but with
character references only.
This would allow to use the exactly SAME text parser used in Unicode as the
input for the lexical and grammatical analysis of the XML/HTML/SGML parser.
Within that model, the sequence ">" + combining character would be seen as a
single combining sequence coding a abstract character, that breaks the
syntax of expected end of tags. Same thing for the quotes delimiting the
start of attribute values or for the square bracket delimiting the start of
a CDATA section.
> There could of course be
> problems if there were any precomposed combinations of quotes or ">"
> with combining characters, but I don't think there are any, are there?
There are such precomposed sequences in Unicode. Look in
NormalizationTest.txt for the places where ">", single and double quotes are
used and part of a combining sequence... Notably look at sequences made with
the combining solidus overlay; add also the case of enclosing combining
characters, and of mathematical operators that can be created with a
combining sequence starting by ">" or "=" or single or double quotes, and
modified by diacritics.
> Your proposed solution to the problem is messy in requiring the use of
> numeric entities, and unnecessary.
This is not that messy. Also I did not say that numeric entities must be
used. Any parsed named entity could be used as well. This is not a problem
of the Unicode standard, but a problem of the SGML, HTML 4.01, and XML
standards. For SGML and HTML up to 4.01, you also have problems with the
equal sign (because the quotes around element's attribute values are not
mandatory, unlike in XML).
We don't have this problem for element names or attribute names, because
they must obey a stricter syntax and can't be any arbitrary Unicode string:
these names cannot contain defective combining sequences simply because
combining characters cannot be identifier starts.
__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 10:27:07 EST