From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Jan 14 2004 - 12:17:48 EST
Deepak Chand Rathore wrote:
> unicode range
> utf 8 encoded bytes
> U-00000000 - U-0000007F: 0xxxxxxx
> U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
> U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
> ...
This table is not correct. Please check Unicode 3.2 or Unicode 4 for the correct table.
Table 3.1B. Legal UTF-8 Byte Sequences in http://www.unicode.org/reports/tr28/#3_1_conformance
Conformance chapter in http://www.unicode.org/versions/Unicode4.0.0/
> But, there is one concern. In some cases the utf8 byte stream starts with a
> BOM,( for eg. when we try reading bytes from a text file that
> is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
> i suppose first 3 bytes), the actual text start.
> So how do we detect whether the byte stream starts with a BOM or not ??
> or the first few bytes represent BOM or the actual text ??
There is a whole FAQ section on this topic at http://www.unicode.org/faq/utf_bom.html#BOM
Best regards,
markus
-- Opinions expressed here may not reflect my company's positions unless otherwise noted.
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 12:57:54 EST