variations of UTF-16/UTF-32 and browsers' interpretation (was Re: browsers and unicode surrogates)

From: Jungshik Shin (jshin@mailaps.org)
Date: Wed Apr 24 2002 - 12:25:55 EDT


On Mon, 22 Apr 2002, Stefan Persson wrote:

 I haven't added plane 1 characters, yet (Tex let me do that, thanks !).
However, my test pages can be used to test how various web browsers
interpret various forms of UTF-16 and UTF-32 with or without BOM and
with or without external info. (such as MIME charset in http C-T header).
This is not of practical importance/interest(UTF-8 is much less ambigous
and better supported than UTF-16/32 by various web browsers), but it's
interesting nonetheless because the way various forms of UTF-16/32 have
to be interpreted has been discussed recently.

> ----- Original Message -----
> From: <jshin@mailaps.org>
> Sent: den 22 april 2002 20:24

> > Thank you for this tip. I didn't know this and ended up
> > 'cluttering' my filenames with charset suffices at
> > <http://jshin.net/i18n/utftest>.
>
> The following pages display Korean text:
>
> * All UTF-16 with BOM
> * All UTF-32LE with BOM
> * UTF-16LE without BOM, encoding specified as UTF-16
>
> The following pages are displayed as Latin-1 jibberish, ASCII displayed
> properly:
> * UTF-16 without BOM, encoding specified as UTF-16LE, UTF-16BE, or not
> specified at all
> * All UTF-32BE
> * All UTF-32LE without BOM
>
> This page is misinterpreted as UTF-16LE without line breaking:
> * UTF-16BE without BOM, encoding specified as UTF-16
>
> I'm using IE 5.5 under Windows 98.

  Thank you for your test result. MS IE 5.5. seems to *ignore*
MIME charset specified in http header. It appears to *solely* rely on
the presence of BOM. If it's not specified, it assumes the platform
byte order. Is this behavior compatible with what Mark and
Ken described as to how to interpret various forms
of UTF-16 and UTF-32 last week and this week again? It doesn't seem to be.
The way Mozilla interprets various forms of UTF-16|32 appears
to be more in line with what Mark and Ken have written although
there are some issues to be resolved as well. It'll be interesting
to see how Opera does.

  Here's the test result with Mozilla 0.9.9 on ix86 Linux (that is,
the platform byte order is the same as your case).

 * The following pages always get displayed as intended

   - All UTF-16's and UTF-32's with MIME charset (*with* endian
     at the end. i.e. UTF-32(LE|BE), UTF-16(LE|BE) )
     specified in http header regardless of the endian and the presence
     of BOM
     (In UTF-32 pages, BOM is NOT ignored and rendered as
      'ZWNBS' enclosed by a dotted square) : 8 cases

   - UTF-16BE with BOM but without MIME charset specified
     : 1 cases

   - UTF-16BE and UTF-32BE without BOM but MIME charset specified
     as UTF-16 and UTF-32 : 2 cases

   - UTF-16BE and UTF-32BE with BOM but MIME charset specified
     as UTF-16 and UTF-32 : 2 cases

 * For the following pages, auto-detection sometimes works but not
   always.

   - UTF-16LE and UTF-32LE with BOM but without MIME charset specified
     : 2 cases

   - UTF-32BE with BOM but without MIME charset specified
     : 1 cases

  * The following pages are recognized as Latin-1. US-ASCII
    characters are rendered correctly with one or three hollow
    boxes before or after each of them depending on the endian(BE/LE)
    and the size (16/32)

    - UTF-16LE and UTF-32LE without BOM and without MIME charset
      (2 cases)

    - UTF-16BE and UTF-32BE without BOM and without MIME charset
      (2 cases)

  * The following pages are recognized as UTF-16BE and UTF-32BE.

    - UTF-16LE and UTF-32LE without BOM but with MIME charset specified
      as UTF-16 and UTF-32 (2 cases)

    - UTF-16LE and UTF-32LE with BOM but with MIME charset specified
      as UTF-16 and UTF-32 (2 cases)

  Jungshik Shin



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 13:19:23 EDT