Re: Detecting encoding in Plain text

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Jan 12 2004 - 00:48:12 EST

  • Next message: Philippe Verdy: "Re: Detecting encoding in Plain text"

    Brijesh Sharma <bssharma at quark dot co dot in> wrote:

    > I writing a small tool to get text from a txt file into a edit box.
    > Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
    > Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
    > My problem is that I can distinguish between UTF-8 or UTF-16 using
    > the BOM.
    > But how do I auto detect the others.
    > Any kind of help will be appreciated.

    This has always been an interesting topic to me, even before the Unicode
    era. The best information I have ever seen on this topic is Li and
    Momoi's paper. To reiterate the URL:

    http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

    If you are "writing a small tool," however, you may not have the space
    or time to implement everything Li and Momoi described.

    You probably need to divide the problem into (1) detection of Unicode
    encodings and (2) detection of non-Unicode encodings, because these are
    really different problems.

    Detecting Unicode encodings, of course, is trivial if the stream begins
    with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't
    always count on the signature being present. You need to rely primarily
    on what Li and Momoi call the "coding scheme method," searching for
    valid (and invalid) sequences in the various encoding schemes. This
    works well for UTF-8 in particular; most non-contrived text that
    contains at least one valid multibyte UTF-8 sequence and no invalid
    UTF-8 sequences is very likely to be UTF-8.

    In UTF-16 practically any sequence of bytes is valid, and since you
    can't assume you know the language, you can't employ distribution
    statistics. Twelve years ago, when most text was not Unicode and all
    Unicode text was UTF-16, Microsoft documentation suggested a heuristic
    of checking every other byte to see if it was zero, which of course
    would only work for Latin-1 text encoded in UTF-16. If you need to
    detect the encoding of non-Western-European text, you would have to be
    more sophisticated than this.

    Here are some notes I've taken on detecting a byte stream known to be in
    a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU). This is a
    work in progress and is not expected to be complete or perfect, so feel
    free to send corrections and enhancements but not flames:

    0A 00
    • inverse of U+000A LINE FEED
    • U+0A00 = unassigned Gurmukhi code point
    • may indicate little-endian UTF-16

    0A 0D
    • 8-bit line-feed + carriage return
    • U+0A0D = unassigned Gurmukhi code point
    • probably indicates 8-bit encoding

    0D 00
    • inverse of U+000D CARRIAGE RETURN
    • U+0D00 = unassigned Malayalam code point
    • may indicate little-endian UTF-16

    0D 0A
    • 8-bit carriage return + line feed
    • U+0D0A = MALAYALAM LETTER UU
      • text should include other Malayalam characters (U+0D00—U+0D7F)
    • otherwise, probably indicates 8-bit encoding

    20 00
    • inverse of U+0020 SPACE
    • U+2000 = EN QUAD (infrequent character)
    • may indicate UTF-16 (probably little-endian)

    28 20
    • inverse of U+2028 LINE SEPARATOR
    • U+2820 = BRAILLE PATTERN DOTS-6
      • text should include other Braille characters (U+2800—U+28FF)
    • may indicate little-endian UTF-16
    • but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis)

    E2 80 A8
    • UTF-8 representation of U+2028 LINE SEPARATOR
    • probably indicates UTF-8

    05 28
    • SCSU representation of U+2028 LINE SEPARATOR
    • U+0528 is unassigned
    • U+2805 is BRAILLE PATTERN DOTS-13
      • should be surrounded by other Braille characters
    • otherwise, probably indicates SCSU

    00 00 00
    • probably a Basic Latin character in UTF-32 (either byte order)

    Detecting non-Unicode encodings is quite another matter, and here you
    really need to study the techniques described by Li and Momoi.
    Distinguishing straight ASCII from ISO 8859-1 from Windows-1252 is
    easy -- just check which subsets of Windows-1252 are present -- but
    throwing Mac Roman and East Asian double-byte sets into the mix is
    another matter.

    I once wrote a program to detect the encoding of a text sample known to
    be in one of the following Cyrillic encodings:

    • KOI8-R
    • Windows code page 1251
    • ISO 8859-5
    • MS-DOS code page 866
    • MS-DOS code page 855
    • Mac Cyrillic

    Given the Unicode scalar values corresponding to each byte value, the
    program calculates the proportion of Cyrillic characters (as opposed to
    punctuation and dingbats) when interpreted in each possible encoding,
    and picks the encoding with the highest proportion (confidence level).
    This is a dumbed-down version of Li and Momoi's character distribution
    method, but works surprisingly well so long as the text really is in one
    of these Cyrillic encodings. It fails spectacularly for text in
    Latin-1, Mac Roman, UTF-8, etc. It would probably also be unable to
    detect differences between almost-identical character sets, like KOI8-R
    and KOI8-U.

    The smaller your list of "possible" encodings, the easier your job of
    detecting one of them.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 01:36:18 EST