Re: Detecting encoding in Plain text

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Jan 12 2004 - 11:11:20 EST

Next message: Doug Ewell: "Re: Detecting encoding in Plain text"

Previous message: Philippe Verdy: "Re: Detecting encoding in Plain text"
In reply to: Doug Ewell: "Re: Detecting encoding in Plain text"
Next in thread: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

One thing I have done in the past that was along similar lines:

If you know that it is a UTF, and if you know that you support the latest
version of Unicode, then you can walk through the bytes in 7 parallel paths,
with each fetching a code point in each of the 7 encoding schemes and testing
it. If you hit an illegal sequence or unassigned code point, then you 'turn off'
that path. If you have a single path at any point, then jump to a faster routine
to do the rest of the conversion. (I actually had 8 paths, since I also could
have Latin-1.)

I never put in anything to settle the cases where you end up with more than one
path, except for a simple priority order. In those rare cases where necessary, I
suspect something simple like capturing the frequency of a some common
characters, such as new lines, space, and certain punctuation, and some uncommon
characters (most controls) would go a long way.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: "Unicode Mailing List" <unicode@unicode.org>
Cc: "Brijesh Sharma" <bssharma@quark.co.in>
Sent: Sun, 2004 Jan 11 21:48
Subject: Re: Detecting encoding in Plain text

> Brijesh Sharma <bssharma at quark dot co dot in> wrote:
>
> > I writing a small tool to get text from a txt file into a edit box.
> > Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
> > Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
> > My problem is that I can distinguish between UTF-8 or UTF-16 using
> > the BOM.
> > But how do I auto detect the others.
> > Any kind of help will be appreciated.
>
> This has always been an interesting topic to me, even before the Unicode
> era. The best information I have ever seen on this topic is Li and
> Momoi's paper. To reiterate the URL:
>
> http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>
> If you are "writing a small tool," however, you may not have the space
> or time to implement everything Li and Momoi described.
>
> You probably need to divide the problem into (1) detection of Unicode
> encodings and (2) detection of non-Unicode encodings, because these are
> really different problems.
>
> Detecting Unicode encodings, of course, is trivial if the stream begins
> with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't
> always count on the signature being present. You need to rely primarily
> on what Li and Momoi call the "coding scheme method," searching for
> valid (and invalid) sequences in the various encoding schemes. This
> works well for UTF-8 in particular; most non-contrived text that
> contains at least one valid multibyte UTF-8 sequence and no invalid
> UTF-8 sequences is very likely to be UTF-8.
>
> In UTF-16 practically any sequence of bytes is valid, and since you
> can't assume you know the language, you can't employ distribution
> statistics. Twelve years ago, when most text was not Unicode and all
> Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> of checking every other byte to see if it was zero, which of course
> would only work for Latin-1 text encoded in UTF-16. If you need to
> detect the encoding of non-Western-European text, you would have to be
> more sophisticated than this.
>
> Here are some notes I've taken on detecting a byte stream known to be in
> a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU). This is a
> work in progress and is not expected to be complete or perfect, so feel
> free to send corrections and enhancements but not flames:
>
> 0A 00
> • inverse of U+000A LINE FEED
> • U+0A00 = unassigned Gurmukhi code point
> • may indicate little-endian UTF-16
>
> 0A 0D
> • 8-bit line-feed + carriage return
> • U+0A0D = unassigned Gurmukhi code point
> • probably indicates 8-bit encoding
>
> 0D 00
> • inverse of U+000D CARRIAGE RETURN
> • U+0D00 = unassigned Malayalam code point
> • may indicate little-endian UTF-16
>
> 0D 0A
> • 8-bit carriage return + line feed
> • U+0D0A = MALAYALAM LETTER UU
> • text should include other Malayalam characters (U+0D00—U+0D7F)
> • otherwise, probably indicates 8-bit encoding
>
> 20 00
> • inverse of U+0020 SPACE
> • U+2000 = EN QUAD (infrequent character)
> • may indicate UTF-16 (probably little-endian)
>
> 28 20
> • inverse of U+2028 LINE SEPARATOR
> • U+2820 = BRAILLE PATTERN DOTS-6
> • text should include other Braille characters (U+2800—U+28FF)
> • may indicate little-endian UTF-16
> • but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis)
>
> E2 80 A8
> • UTF-8 representation of U+2028 LINE SEPARATOR
> • probably indicates UTF-8
>
> 05 28
> • SCSU representation of U+2028 LINE SEPARATOR
> • U+0528 is unassigned
> • U+2805 is BRAILLE PATTERN DOTS-13
> • should be surrounded by other Braille characters
> • otherwise, probably indicates SCSU
>
> 00 00 00
> • probably a Basic Latin character in UTF-32 (either byte order)
>
> Detecting non-Unicode encodings is quite another matter, and here you
> really need to study the techniques described by Li and Momoi.
> Distinguishing straight ASCII from ISO 8859-1 from Windows-1252 is
> easy -- just check which subsets of Windows-1252 are present -- but
> throwing Mac Roman and East Asian double-byte sets into the mix is
> another matter.
>
> I once wrote a program to detect the encoding of a text sample known to
> be in one of the following Cyrillic encodings:
>
> • KOI8-R
> • Windows code page 1251
> • ISO 8859-5
> • MS-DOS code page 866
> • MS-DOS code page 855
> • Mac Cyrillic
>
> Given the Unicode scalar values corresponding to each byte value, the
> program calculates the proportion of Cyrillic characters (as opposed to
> punctuation and dingbats) when interpreted in each possible encoding,
> and picks the encoding with the highest proportion (confidence level).
> This is a dumbed-down version of Li and Momoi's character distribution
> method, but works surprisingly well so long as the text really is in one
> of these Cyrillic encodings. It fails spectacularly for text in
> Latin-1, Mac Roman, UTF-8, etc. It would probably also be unable to
> detect differences between almost-identical character sets, like KOI8-R
> and KOI8-U.
>
> The smaller your list of "possible" encodings, the easier your job of
> detecting one of them.
>
> -Doug Ewell
> Fullerton, California
> http://users.adelphia.net/~dewell/
>
>
>

Next message: Doug Ewell: "Re: Detecting encoding in Plain text"
Previous message: Philippe Verdy: "Re: Detecting encoding in Plain text"
In reply to: Doug Ewell: "Re: Detecting encoding in Plain text"
Next in thread: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 11:59:49 EST