From: Doug Ewell (dewell@adelphia.net)
Date: Thu Feb 13 2003 - 11:24:46 EST
John Cowan <cowan at mercury dot ccil dot org> wrote:
>> On top of that, you may wish to put BOM at teh very beg. of
>> your UTF-16LE html files although that's not necessary
>> with the correct C-T http header as above.
>
> No, no! In UTF-16LE, if the first two bytes are FF FE, that means an
> actual ZWNBSP character. (Analogously in UTF-16BE.) The whole point
> of the charsets "UTF-16LE" and "UTF-16BE" is that there is no BOM.
> In the charset "UTF-16", however, there may or may not be a BOM.
Thanks for the correction to my post comparing "UTF-16LE" and "UTF-16".
I had written that "UTF-16" implies the presence of a BOM. You are
correct that the BOM may or may not be present. Furthermore, if it is
not, big-endian is assumed (the source of Weiwu's original question
about big-endian being "preferred").
But having said that...
Suppose Weiwu follows Jungshik's suggestion and inserts the character
U+FEFF at the beginning of his HTML. And suppose John is right, that
because the file is tagged "UTF-16LE" the character U+FEFF actually
represents a ZWNBSP instead of a BOM.
What harm has been done? It's an Web page, not a data file for which
absolute byte-for-byte fidelity is required. The ZWNBSP is totally
invisible to the user viewing the page. It has no behavior -- a ZWNBSP
is supposed to prevent a break between the preceding and following
characters, but in this case there is no preceding character, so what is
the ZWNBSP supposed to do?
In some future version of Unicode, say 5.0 or above, I'd really like to
see a resolution to this nonsensical "initial ZWNBSP" case. U+FEFF at
the beginning of a file or stream (not fragment) could logically only be
a BOM. We have U+2060 WORD JOINER to handle the ZWNBSP semantic now.
Talk about something that needs to be deprecated.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 12:31:20 EST