Re: character entities in UTF-8 files

From: Gregg Reynolds (unicode@arabink.com)
Date: Wed Jul 13 2005 - 10:37:22 CDT

Next message: Peter Constable: "RE: Regarding Correct Display of Extended Latin Devanagari"

Previous message: Eric Muller: "Re: character entities in UTF-8 files"
In reply to: Peter Kirk: "Re: character entities in UTF-8 files"
Next in thread: Andy Heninger: "Re: character entities in UTF-8 files"
Reply: Andy Heninger: "Re: character entities in UTF-8 files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Kirk wrote:
> On 13/07/2005 00:52, Gregg Reynolds wrote:
>
>> ... From the Unicode perspective, a sequence of characters like
>> é is just a sequence of 5 distinct characters with no further
>> semantics. Interpreted in accordance with XML, however, such a
>> sequence *must* (not "may") be interpreted as e acute. Note that (if
>> I'm not mistaken) such interpretation logically precedes other
>> parsing. That is, an XML parser will first interpret (i.e.
>> substitute) character *entities*, and then parse the resulting text.
>> So what gets passed from the XML parser to higher-level processors is
>> e acute, not the five character sequence é. ...
>
>
>
> I don't think you can be quite right, at least unless XML is quite
> different from HTML here. For surely in both HTML and XML character
> entities like < can and should be used to replace the character "<"
> when this is not to be interpreted as the start of a tag. This implies
> that character entities are parsed not as the first stage of parsing,
> but only after "<" is recognised as the start of a tag.
>

I stand corrected. What I should have said is that an XML parser will
first *replace* character entities, before passing the data to the
consuming application. When that happens in relation to parsing (i.e.
checking for well-formedness) is implementation-dependent, if I'm not
mistaken. I find the XML spec a little fuzzy on that point (I can't
wait for the English translation); it talks about at least < and some
other char entities being "escaped".

-gregg

Next message: Peter Constable: "RE: Regarding Correct Display of Extended Latin Devanagari"
Previous message: Eric Muller: "Re: character entities in UTF-8 files"
In reply to: Peter Kirk: "Re: character entities in UTF-8 files"
Next in thread: Andy Heninger: "Re: character entities in UTF-8 files"
Reply: Andy Heninger: "Re: character entities in UTF-8 files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jul 13 2005 - 10:38:23 CDT