Re: Problem with SSI and BOM

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Sep 22 2006 - 21:41:51 CDT

Next message: vunzndi@vfemail.net: "Re: Unicode 5.0 success"

Previous message: John D. Burger: "Re: Unicode & space in programming & l10n"
In reply to: Jukka K. Korpela: "Re: Problem with SSI and BOM"
Next in thread: Mark Cilia Vincenti: "RE: Problem with SSI and BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
> On Fri, 22 Sep 2006, Mark Cilia Vincenti wrote:
>
>> I'm using SSI to include UTF-8 encoded files within a UTF-encoded
>> HTML page on IIS (Internet Information Services). The problem is that
>> the byte order mark is not being stripped by the SSI parser,
>> resulting in BOMs within the HTML body.
>
> Can't you just remove the BOM? It's not needed in UTF-8 encoded data.

I tend to agree: embedding blindly the UTF-8 text as is without applying a special encapsulation filter may result in HTML (or XML...) violations according to its own higher-level syntax.

As soon as you realize this, you need a filter, and it's quite simple, when writing this filter, to test for the presence of a leading BOM in the text to encapsulate (unreading it if it's not a BOM) before applying the rest of the encapsulation where you'll need to detect occurences of "<" and "&" in the UTF-8 text (or if you choose to encapsulate it using "/*<![CDATA[*/ ... /*]]>*/", you'll basically just need to detect "]]>" which is more rare (but don't forget that the UTF-8 text may also contain unwanted controls that are forbidden in the HTML/XML data, and that HTML/XML treats several distinct encodings of newlines as if it was a single LF control, so extra filtering may be needed if you want to preserve the exact sequence of code points.

This is not specified in the Unicode standard; refer to the higher protocol about how to encapsulate arbitrary text in a HTML/XML text element...

Next message: vunzndi@vfemail.net: "Re: Unicode 5.0 success"
Previous message: John D. Burger: "Re: Unicode & space in programming & l10n"
In reply to: Jukka K. Korpela: "Re: Problem with SSI and BOM"
Next in thread: Mark Cilia Vincenti: "RE: Problem with SSI and BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 21:43:24 CDT