Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 09 2004 - 09:01:24 CST

  • Next message: Philippe Verdy: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > Ok, so it's the conversion from raw text to escaped character
    > references which should treat combining characters specially.
    >
    > What about < with combining acute, which doesn't have a precomposed
    > form? A broken opening tag or a valid text character?

    Also a broken opening tag for HTML/XML documents (which are NOT plain text
    documents, and must be first parsed as HTML/XML, before parsing the many
    text sections contained in text elements, element names, attribute names,
    attribute values (etc...) as plain-text under the restrictions specified in
    the HTML or XML specifications (which contain restriction for example on
    which characters are allowed in names).

    The XML/HTML core syntax is defined with fixed behavior of some individual
    characters like '&', '<', quotation marks, and with special behavior for
    spaces. This core structure is not plain-text, and cannot be overriden, even
    by Unicode grapheme clusters.

    Note that HTML/XML do NOT mandate the use or even the support of Unicode,
    just the support of a character repertoire that contains some required
    characters, and the acceptance of at least the ISO/10646 repertoire under
    some conditions, however the encoding to code points itself is not required
    for something else than numeric character references, which are more
    symbolic in a way similar to other named character entities in SGML, than
    absolute as implying the required support of the repertoire with a single
    code!

    So you can as well create fully conforming HTML or XML documents using a
    character set which includes characters not even defined in Unicode/ISO/IEC
    10646, or characters defined only symbolically with just a name. Whever this
    name will map or not to one or more Unicode characters does not change the
    validity of the document itself.

    And all the XML/HTML behavior ignores almost all Unicode properties
    (including normalization properties, because XML and HTML treat different
    strings, which are still canonically equivalent, as completely distinct; an
    important feature for cases like XML Signatures, where normalization of
    documents should not be applied blindly as it would break the data
    signature).

    If you want to normalize XML documents, you should not do it with a
    normalizer working on the whole document as if it was plain-text. Instead
    you must normalize the individual strings that are in the XML InfoSet, as
    accessible when browsing the nodes of its DOM tree, and then you can
    serialize the normalized tree to create a new document (using CDATA sections
    and/or character references, if needed to escape some syntaxic characters
    reserved by XML that would be present in the string data of DOM tree nodes).

    Note also that a XML document containing references to Unicode
    non-characters would still be well-formed, because these characters may be
    part of a non-Unicode charset.

    XML document validation is a separate and optional problem from XML parsing
    which checks well-formedness and builds a DOM tree: validation is only
    performed when matching the DOM tree according to a schema definition, DTD
    or XSD, in which additional restrictions on allowed characters may be
    checked, or in which additional symbolic-only "characters" may be defined
    and used in the XML document with parsable named entities similar to:
    "&gt;".

    (An example: the schema may contain a definition for a "character"
    representing a private company logo, mapped to a symbolic name; the XML
    document can contain such references, but the DTD may also define an
    encoding for it in a private charset, so that the XML document will directly
    use that code; the Apple logo in Macintosh charsets is an example, for which
    an internal mapping to Unicode PUAs is not sufficient to allow correct
    processing of multiple XML documents, where PUAs used in each XML documents
    have no equivalence; the conversion of such documents to Unicode with these
    PUAs is a lossy conversion, not suitable for XML data processing).



    This archive was generated by hypermail 2.1.5 : Thu Dec 09 2004 - 09:06:14 CST