Re: character entities in UTF-8 files

From: Gregg Reynolds (unicode@arabink.com)
Date: Tue Jul 12 2005 - 18:52:00 CDT

  • Next message: Peter Kirk: "Re: character entities in UTF-8 files"

    Chris Jacobs wrote:
    > ----- Original Message -----
    > From: "Peter Constable" <petercon@microsoft.com>
    > To: <unicode@unicode.org>
    > Sent: Tuesday, July 12, 2005 11:03 PM
    > Subject: RE: character entities in UTF-8 files
    >
    >
    >>>From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    >>>On Behalf Of Chris Jacobs
    >>
    >>>>We have an XML based application...
    >>
    >>>Only it does not stand for e acute, as far as unicode is involved it
    >>>just stands for itself, for &#233;.
    >>>
    >>>Of course you are allowed to have agreements with your users about
    >>>replacing &#233; by e acute or by whatever you want to replace it by.
    >>
    >>Since this is an XML application, then at the level of XML parsing,
    >>&#233 must be interpreted as e-acute; he is not allowed to have
    >>agreements with his users about replacing &#233 with anything else.
    >
    >
    > Except that not: specifies UTF-8 files as source, but: "specifies UTF-8
    > files as input".
    > So this &#233; is not in the XML source, but in the input which the XML
    > reads.

    Huh? When did XML start reading things?

    A putative XML file either conforms to XML grammar or it doesn't. XML
    is *almost* purely syntactic; but it defines a set of *entities* (a
    technical term) which have predefined (Unicode) character semantics.
    XML entities are defined syntactically; they start with "&" and end with
    ";". Unicode has no such concept. There are no *entities* in Unicode,
    so when you discuss XML and Unicode, you have to be careful about the
    terminology. From the Unicode perspective, a sequence of characters
    like &#233; is just a sequence of 5 distinct characters with no further
    semantics. Interpreted in accordance with XML, however, such a sequence
    *must* (not "may") be interpreted as e acute. Note that (if I'm not
    mistaken) such interpretation logically precedes other parsing. That
    is, an XML parser will first interpret (i.e. substitute) character
    *entities*, and then parse the resulting text. So what gets passed from
    the XML parser to higher-level processors is e acute, not the five
    character sequence &#233;. This means, among other things, that if you
    have an XSL stylesheet that munges text, it should look for the single
    character e acute, and not the sequence of five characters &#233;.
    (Unless, of course, you with to write &#233; in your XSL stylesheet,
    which is itself XML. ;-)

    Then again, I could be completely wrong.

    Hope that helps,

    -gregg



    This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 18:53:21 CDT