From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Sep 22 2006 - 21:41:51 CDT
From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
> On Fri, 22 Sep 2006, Mark Cilia Vincenti wrote:
>
>> I'm using SSI to include UTF-8 encoded files within a UTF-encoded
>> HTML page on IIS (Internet Information Services). The problem is that
>> the byte order mark is not being stripped by the SSI parser,
>> resulting in BOMs within the HTML body.
>
> Can't you just remove the BOM? It's not needed in UTF-8 encoded data.
I tend to agree: embedding blindly the UTF-8 text as is without applying a special encapsulation filter may result in HTML (or XML...) violations according to its own higher-level syntax.
As soon as you realize this, you need a filter, and it's quite simple, when writing this filter, to test for the presence of a leading BOM in the text to encapsulate (unreading it if it's not a BOM) before applying the rest of the encapsulation where you'll need to detect occurences of "<" and "&" in the UTF-8 text (or if you choose to encapsulate it using "/*<![CDATA[*/ ... /*]]>*/", you'll basically just need to detect "]]>" which is more rare (but don't forget that the UTF-8 text may also contain unwanted controls that are forbidden in the HTML/XML data, and that HTML/XML treats several distinct encodings of newlines as if it was a single LF control, so extra filtering may be needed if you want to preserve the exact sequence of code points.
This is not specified in the Unicode standard; refer to the higher protocol about how to encapsulate arbitrary text in a HTML/XML text element...
This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 21:43:24 CDT