Re: Detecting encoding in Plain text

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Jan 12 2004 - 11:11:20 EST

  • Next message: Doug Ewell: "Re: Detecting encoding in Plain text"

    One thing I have done in the past that was along similar lines:

    If you know that it is a UTF, and if you know that you support the latest
    version of Unicode, then you can walk through the bytes in 7 parallel paths,
    with each fetching a code point in each of the 7 encoding schemes and testing
    it. If you hit an illegal sequence or unassigned code point, then you 'turn off'
    that path. If you have a single path at any point, then jump to a faster routine
    to do the rest of the conversion. (I actually had 8 paths, since I also could
    have Latin-1.)

    I never put in anything to settle the cases where you end up with more than one
    path, except for a simple priority order. In those rare cases where necessary, I
    suspect something simple like capturing the frequency of a some common
    characters, such as new lines, space, and certain punctuation, and some uncommon
    characters (most controls) would go a long way.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Doug Ewell" <dewell@adelphia.net>
    To: "Unicode Mailing List" <unicode@unicode.org>
    Cc: "Brijesh Sharma" <bssharma@quark.co.in>
    Sent: Sun, 2004 Jan 11 21:48
    Subject: Re: Detecting encoding in Plain text

    > Brijesh Sharma <bssharma at quark dot co dot in> wrote:
    >
    > > I writing a small tool to get text from a txt file into a edit box.
    > > Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
    > > Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
    > > My problem is that I can distinguish between UTF-8 or UTF-16 using
    > > the BOM.
    > > But how do I auto detect the others.
    > > Any kind of help will be appreciated.
    >
    > This has always been an interesting topic to me, even before the Unicode
    > era. The best information I have ever seen on this topic is Li and
    > Momoi's paper. To reiterate the URL:
    >
    > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
    >
    > If you are "writing a small tool," however, you may not have the space
    > or time to implement everything Li and Momoi described.
    >
    > You probably need to divide the problem into (1) detection of Unicode
    > encodings and (2) detection of non-Unicode encodings, because these are
    > really different problems.
    >
    > Detecting Unicode encodings, of course, is trivial if the stream begins
    > with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't
    > always count on the signature being present. You need to rely primarily
    > on what Li and Momoi call the "coding scheme method," searching for
    > valid (and invalid) sequences in the various encoding schemes. This
    > works well for UTF-8 in particular; most non-contrived text that
    > contains at least one valid multibyte UTF-8 sequence and no invalid
    > UTF-8 sequences is very likely to be UTF-8.
    >
    > In UTF-16 practically any sequence of bytes is valid, and since you
    > can't assume you know the language, you can't employ distribution
    > statistics. Twelve years ago, when most text was not Unicode and all
    > Unicode text was UTF-16, Microsoft documentation suggested a heuristic
    > of checking every other byte to see if it was zero, which of course
    > would only work for Latin-1 text encoded in UTF-16. If you need to
    > detect the encoding of non-Western-European text, you would have to be
    > more sophisticated than this.
    >
    > Here are some notes I've taken on detecting a byte stream known to be in
    > a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU). This is a
    > work in progress and is not expected to be complete or perfect, so feel
    > free to send corrections and enhancements but not flames:
    >
    > 0A 00
    > • inverse of U+000A LINE FEED
    > • U+0A00 = unassigned Gurmukhi code point
    > • may indicate little-endian UTF-16
    >
    > 0A 0D
    > • 8-bit line-feed + carriage return
    > • U+0A0D = unassigned Gurmukhi code point
    > • probably indicates 8-bit encoding
    >
    > 0D 00
    > • inverse of U+000D CARRIAGE RETURN
    > • U+0D00 = unassigned Malayalam code point
    > • may indicate little-endian UTF-16
    >
    > 0D 0A
    > • 8-bit carriage return + line feed
    > • U+0D0A = MALAYALAM LETTER UU
    > • text should include other Malayalam characters (U+0D00—U+0D7F)
    > • otherwise, probably indicates 8-bit encoding
    >
    > 20 00
    > • inverse of U+0020 SPACE
    > • U+2000 = EN QUAD (infrequent character)
    > • may indicate UTF-16 (probably little-endian)
    >
    > 28 20
    > • inverse of U+2028 LINE SEPARATOR
    > • U+2820 = BRAILLE PATTERN DOTS-6
    > • text should include other Braille characters (U+2800—U+28FF)
    > • may indicate little-endian UTF-16
    > • but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis)
    >
    > E2 80 A8
    > • UTF-8 representation of U+2028 LINE SEPARATOR
    > • probably indicates UTF-8
    >
    > 05 28
    > • SCSU representation of U+2028 LINE SEPARATOR
    > • U+0528 is unassigned
    > • U+2805 is BRAILLE PATTERN DOTS-13
    > • should be surrounded by other Braille characters
    > • otherwise, probably indicates SCSU
    >
    > 00 00 00
    > • probably a Basic Latin character in UTF-32 (either byte order)
    >
    > Detecting non-Unicode encodings is quite another matter, and here you
    > really need to study the techniques described by Li and Momoi.
    > Distinguishing straight ASCII from ISO 8859-1 from Windows-1252 is
    > easy -- just check which subsets of Windows-1252 are present -- but
    > throwing Mac Roman and East Asian double-byte sets into the mix is
    > another matter.
    >
    > I once wrote a program to detect the encoding of a text sample known to
    > be in one of the following Cyrillic encodings:
    >
    > • KOI8-R
    > • Windows code page 1251
    > • ISO 8859-5
    > • MS-DOS code page 866
    > • MS-DOS code page 855
    > • Mac Cyrillic
    >
    > Given the Unicode scalar values corresponding to each byte value, the
    > program calculates the proportion of Cyrillic characters (as opposed to
    > punctuation and dingbats) when interpreted in each possible encoding,
    > and picks the encoding with the highest proportion (confidence level).
    > This is a dumbed-down version of Li and Momoi's character distribution
    > method, but works surprisingly well so long as the text really is in one
    > of these Cyrillic encodings. It fails spectacularly for text in
    > Latin-1, Mac Roman, UTF-8, etc. It would probably also be unable to
    > detect differences between almost-identical character sets, like KOI8-R
    > and KOI8-U.
    >
    > The smaller your list of "possible" encodings, the easier your job of
    > detecting one of them.
    >
    > -Doug Ewell
    > Fullerton, California
    > http://users.adelphia.net/~dewell/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 11:59:49 EST