From: Doug Ewell (dewell@adelphia.net)
Date: Wed Jan 14 2004 - 02:41:46 EST
Deepak Chand Rathore <deepakr at aztec dot soft dot net> wrote:
> But, there is one concern. In some cases the utf8 byte stream starts
> with a BOM,( for eg. when we try reading bytes from a text file that
> is saved using notepad (using utf8 option )in WIN2k, after first few
> bytes( i suppose first 3 bytes), the actual text start.
> So how do we detect whether the byte stream starts with a BOM or
> not ??
> or the first few bytes represent BOM or the actual text ??
What you are asking is, if a UTF-8 byte stream starts with the character
U+FEFF, should that character be treated as a signature (BOM) or as a
zero-width no-break space?
You'll probably get different responses to this, having to do with
tagging or streams broken in the middle. My view is that a zero-width
no-break space has *no business* appearing at the start of a text
stream. With no character to precede it, what would it prevent a break
between? U+FEFF, or specifically the bytes EF BB BF, at the true start
of a UTF-8 stream should be always interpreted as a signature.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
I don't speak for the Unicode Consortium.
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 03:13:26 EST