Re: detecting encoding in plain text (related to utf8)

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Jan 14 2004 - 12:17:48 EST

  • Next message: Mark Davis: "Re: Detecting encoding in Plain text"

    Deepak Chand Rathore wrote:
    > unicode range
    > utf 8 encoded bytes
    > U-00000000 - U-0000007F: 0xxxxxxx
    > U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
    > U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
    > ...

    This table is not correct. Please check Unicode 3.2 or Unicode 4 for the correct table.

    Table 3.1B. Legal UTF-8 Byte Sequences in http://www.unicode.org/reports/tr28/#3_1_conformance
    Conformance chapter in http://www.unicode.org/versions/Unicode4.0.0/

    > But, there is one concern. In some cases the utf8 byte stream starts with a
    > BOM,( for eg. when we try reading bytes from a text file that
    > is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
    > i suppose first 3 bytes), the actual text start.
    > So how do we detect whether the byte stream starts with a BOM or not ??
    > or the first few bytes represent BOM or the actual text ??

    There is a whole FAQ section on this topic at http://www.unicode.org/faq/utf_bom.html#BOM

    Best regards,
    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 12:57:54 EST