Re: browsers and unicode surrogates

From: Lars Marius Garshol (larsga@garshol.priv.no)
Date: Mon Apr 22 2002 - 04:38:14 EDT


* Tex Texin
|
| In looking at the HTML 4.01 spec to quote the above, I noted an
| interesting sentence:
| "The META declaration must only be used when the character encoding
| is organized such that ASCII-valued bytes stand for ASCII characters
| (at least until the META element is parsed)."
|
| I am surprised by the "must only be used". It seems I am not
| conforming by including a meta statement in the utf-16 HTML page. I
| should either remove the statement or encode the HTML up to and
| including that statement as ascii. I'll check on this.

It doesn't make much sense to have the meta statement there, as I
would expect most browser to assume ASCII compatibility, but I agree
that "must only be used" sounds too harsh.

At Opera we had a problem with a page that displayed as a random
jumble of Unicode characters with no layout at all, and seemed to bear
no relation to the actual contents of the page. Other browsers
displayed it just fine.

After looking at it for a while we realized that the page claimed to
be UTF-16, but was not, so it was decoded as UTF-16, which basically
turned every two characters into a random new one. We were a bit
discouraged by this and wondered what to do for about 5 minutes before
it struck us: "if we can see that the page claims to be UTF-16, it
can't be, because our meta declaration scanning assumes ASCII
compatibility".

So that turned out to be an easy fix anyway.

If the HTTP response does not say what the character encoding is, we
detect UTF-16 by looking for BOM and (I think) looking for the byte
patterns peculiar to UTF-16.

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC        <URL: http://www.garshol.priv.no >



This archive was generated by hypermail 2.1.2 : Mon Apr 22 2002 - 05:32:47 EDT