Re: detecting encoding in plain text (related to utf8)

From: Markus Scherer ([email protected])
Date: Wed Jan 14 2004 - 12:17:48 EST

Next message: Mark Davis: "Re: Detecting encoding in Plain text"

Previous message: Peter Kirk: "Re: New MS Mac Office and Unicode?"
In reply to: Deepak Chand Rathore: "RE: detecting encoding in plain text (related to utf8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Deepak Chand Rathore wrote:
> unicode range
> utf 8 encoded bytes
> U-00000000 - U-0000007F: 0xxxxxxx
> U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
> U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
> ...

This table is not correct. Please check Unicode 3.2 or Unicode 4 for the correct table.

Table 3.1B. Legal UTF-8 Byte Sequences in http://www.unicode.org/reports/tr28/#3_1_conformance
Conformance chapter in http://www.unicode.org/versions/Unicode4.0.0/

> But, there is one concern. In some cases the utf8 byte stream starts with a
> BOM,( for eg. when we try reading bytes from a text file that
> is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
> i suppose first 3 bytes), the actual text start.
> So how do we detect whether the byte stream starts with a BOM or not ??
> or the first few bytes represent BOM or the actual text ??

There is a whole FAQ section on this topic at http://www.unicode.org/faq/utf_bom.html#BOM

Best regards,
markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Next message: Mark Davis: "Re: Detecting encoding in Plain text"
Previous message: Peter Kirk: "Re: New MS Mac Office and Unicode?"
In reply to: Deepak Chand Rathore: "RE: detecting encoding in plain text (related to utf8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 12:57:54 EST