From: Gregg Reynolds (unicode@arabink.com)
Date: Tue Jul 12 2005 - 18:52:00 CDT
Chris Jacobs wrote:
> ----- Original Message -----
> From: "Peter Constable" <petercon@microsoft.com>
> To: <unicode@unicode.org>
> Sent: Tuesday, July 12, 2005 11:03 PM
> Subject: RE: character entities in UTF-8 files
>
>
>>>From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
>>>On Behalf Of Chris Jacobs
>>
>>>>We have an XML based application...
>>
>>>Only it does not stand for e acute, as far as unicode is involved it
>>>just stands for itself, for é.
>>>
>>>Of course you are allowed to have agreements with your users about
>>>replacing é by e acute or by whatever you want to replace it by.
>>
>>Since this is an XML application, then at the level of XML parsing,
>>é must be interpreted as e-acute; he is not allowed to have
>>agreements with his users about replacing é with anything else.
>
>
> Except that not: specifies UTF-8 files as source, but: "specifies UTF-8
> files as input".
> So this é is not in the XML source, but in the input which the XML
> reads.
Huh? When did XML start reading things?
A putative XML file either conforms to XML grammar or it doesn't. XML
is *almost* purely syntactic; but it defines a set of *entities* (a
technical term) which have predefined (Unicode) character semantics.
XML entities are defined syntactically; they start with "&" and end with
";". Unicode has no such concept. There are no *entities* in Unicode,
so when you discuss XML and Unicode, you have to be careful about the
terminology. From the Unicode perspective, a sequence of characters
like é is just a sequence of 5 distinct characters with no further
semantics. Interpreted in accordance with XML, however, such a sequence
*must* (not "may") be interpreted as e acute. Note that (if I'm not
mistaken) such interpretation logically precedes other parsing. That
is, an XML parser will first interpret (i.e. substitute) character
*entities*, and then parse the resulting text. So what gets passed from
the XML parser to higher-level processors is e acute, not the five
character sequence é. This means, among other things, that if you
have an XSL stylesheet that munges text, it should look for the single
character e acute, and not the sequence of five characters é.
(Unless, of course, you with to write é in your XSL stylesheet,
which is itself XML. ;-)
Then again, I could be completely wrong.
Hope that helps,
-gregg
This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 18:53:21 CDT