Re: detecting encoding in plain text (related to utf8)

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Jan 14 2004 - 02:41:46 EST

Next message: Mustafa Jabbar: "RE: New MS Mac Office and Unicode?"

Previous message: D. Starner: "Re: Detecting encoding in Plain text"
In reply to: Deepak Chand Rathore: "RE: detecting encoding in plain text (related to utf8)"
Next in thread: Markus Scherer: "Re: detecting encoding in plain text (related to utf8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Deepak Chand Rathore <deepakr at aztec dot soft dot net> wrote:

> But, there is one concern. In some cases the utf8 byte stream starts
> with a BOM,( for eg. when we try reading bytes from a text file that
> is saved using notepad (using utf8 option )in WIN2k, after first few
> bytes( i suppose first 3 bytes), the actual text start.
> So how do we detect whether the byte stream starts with a BOM or
> not ??
> or the first few bytes represent BOM or the actual text ??

What you are asking is, if a UTF-8 byte stream starts with the character
U+FEFF, should that character be treated as a signature (BOM) or as a
zero-width no-break space?

You'll probably get different responses to this, having to do with
tagging or streams broken in the middle. My view is that a zero-width
no-break space has *no business* appearing at the start of a text
stream. With no character to precede it, what would it prevent a break
between? U+FEFF, or specifically the bytes EF BB BF, at the true start
of a UTF-8 stream should be always interpreted as a signature.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
I don't speak for the Unicode Consortium.

Next message: Mustafa Jabbar: "RE: New MS Mac Office and Unicode?"
Previous message: D. Starner: "Re: Detecting encoding in Plain text"
In reply to: Deepak Chand Rathore: "RE: detecting encoding in plain text (related to utf8)"
Next in thread: Markus Scherer: "Re: detecting encoding in plain text (related to utf8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 03:13:26 EST