From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Thu Jan 20 2005 - 13:10:29 CST
> The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
> appearing as though in UTF-16. 0xFEFF is Unicode number, and
> could be still
> translated into UTF-8. So the BOM in UTF-8 is a really strange animal.
I hesitate to feed the thread, but what the heck.
This is confusingly written, but I believe it is wrong.
The Unicode scalar value (for the BOM character) is U+FEFF. In UTF-8 this is encoded as the byte sequence:
0xEF 0xBB 0xBF
This is the byte sequence that Notepad writes at the start of UTF-8 files saved from that editor.
Given all the misinformation on this thread, I direct your attention to the FAQ:
http://www.unicode.org/faq/utf_bom.html#BOM
Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com
Chair, W3C Internationalization Working Group
http://www.w3.org/International
Internationalization is an architecture.
It is not a feature.
> -----Original Message-----
> From: unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org]On Behalf Of Hans Aberg
> Sent: 2005年1月20日 10:17
> To: cfynn@gmx.net; Unicode List
> Subject: Re: UTF-8 'BOM'
>
>
> On 2005/01/20 14:14, Christopher Fynn at cfynn@gmx.net wrote:
>
> > Hans Aberg wrote:
> >
> >
> >> It is much better if the BOM is illegal in UTF-8. It does not
> prevent MS to
> >> use it, instead labelling it as a file format marker for MS
> text files. A
> >> program that then deals with MS text files must then know
> about the BOM and
> >> remove it when and if appropriate. At the same time, it does
> not cause any
> >> problems for programs that normally do not handle MS text
> files but only
> >> plain text: They are fine as they are. Everyone should be able
> to be happy.
> >
> > Since BOM is a valid Unicode & ISO 110646 character and UTF-8 is a
> > transformation format of Unicode & 10646, if BOM were illegal in UTF-8
> > it couldn't be used for *all* Unicode characters.
>
> The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
> appearing as though in UTF-16. 0xFEFF is Unicode number, and
> could be still
> translated into UTF-8. So the BOM in UTF-8 is a really strange animal.
>
This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 13:15:33 CST