Re: Detecting encoding in Plain text

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Jan 12 2004 - 00:48:12 EST

Next message: Philippe Verdy: "Re: Detecting encoding in Plain text"

Previous message: Clark Cox: "Re: [OT] ASCII support in C/C++ (was: doubt)"
In reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Philippe Verdy: "Re: Detecting encoding in Plain text"
Reply: Philippe Verdy: "Re: Detecting encoding in Plain text"
Reply: Mark Davis: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Brijesh Sharma <bssharma at quark dot co dot in> wrote:

> I writing a small tool to get text from a txt file into a edit box.
> Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
> Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
> My problem is that I can distinguish between UTF-8 or UTF-16 using
> the BOM.
> But how do I auto detect the others.
> Any kind of help will be appreciated.

This has always been an interesting topic to me, even before the Unicode
era. The best information I have ever seen on this topic is Li and
Momoi's paper. To reiterate the URL:

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

If you are "writing a small tool," however, you may not have the space
or time to implement everything Li and Momoi described.

You probably need to divide the problem into (1) detection of Unicode
encodings and (2) detection of non-Unicode encodings, because these are
really different problems.

Detecting Unicode encodings, of course, is trivial if the stream begins
with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't
always count on the signature being present. You need to rely primarily
on what Li and Momoi call the "coding scheme method," searching for
valid (and invalid) sequences in the various encoding schemes. This
works well for UTF-8 in particular; most non-contrived text that
contains at least one valid multibyte UTF-8 sequence and no invalid
UTF-8 sequences is very likely to be UTF-8.

In UTF-16 practically any sequence of bytes is valid, and since you
can't assume you know the language, you can't employ distribution
statistics. Twelve years ago, when most text was not Unicode and all
Unicode text was UTF-16, Microsoft documentation suggested a heuristic
of checking every other byte to see if it was zero, which of course
would only work for Latin-1 text encoded in UTF-16. If you need to
detect the encoding of non-Western-European text, you would have to be
more sophisticated than this.

Here are some notes I've taken on detecting a byte stream known to be in
a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU). This is a
work in progress and is not expected to be complete or perfect, so feel
free to send corrections and enhancements but not flames:

0A 00
• inverse of U+000A LINE FEED
• U+0A00 = unassigned Gurmukhi code point
• may indicate little-endian UTF-16

0A 0D
• 8-bit line-feed + carriage return
• U+0A0D = unassigned Gurmukhi code point
• probably indicates 8-bit encoding

0D 00
• inverse of U+000D CARRIAGE RETURN
• U+0D00 = unassigned Malayalam code point
• may indicate little-endian UTF-16

0D 0A
• 8-bit carriage return + line feed
• U+0D0A = MALAYALAM LETTER UU
• text should include other Malayalam characters (U+0D00—U+0D7F)
• otherwise, probably indicates 8-bit encoding

20 00
• inverse of U+0020 SPACE
• U+2000 = EN QUAD (infrequent character)
• may indicate UTF-16 (probably little-endian)

28 20
• inverse of U+2028 LINE SEPARATOR
• U+2820 = BRAILLE PATTERN DOTS-6
• text should include other Braille characters (U+2800—U+28FF)
• may indicate little-endian UTF-16
• but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis)

E2 80 A8
• UTF-8 representation of U+2028 LINE SEPARATOR
• probably indicates UTF-8

05 28
• SCSU representation of U+2028 LINE SEPARATOR
• U+0528 is unassigned
• U+2805 is BRAILLE PATTERN DOTS-13
• should be surrounded by other Braille characters
• otherwise, probably indicates SCSU

00 00 00
• probably a Basic Latin character in UTF-32 (either byte order)

Detecting non-Unicode encodings is quite another matter, and here you
really need to study the techniques described by Li and Momoi.
Distinguishing straight ASCII from ISO 8859-1 from Windows-1252 is
easy -- just check which subsets of Windows-1252 are present -- but
throwing Mac Roman and East Asian double-byte sets into the mix is
another matter.

I once wrote a program to detect the encoding of a text sample known to
be in one of the following Cyrillic encodings:

• KOI8-R
• Windows code page 1251
• ISO 8859-5
• MS-DOS code page 866
• MS-DOS code page 855
• Mac Cyrillic

Given the Unicode scalar values corresponding to each byte value, the
program calculates the proportion of Cyrillic characters (as opposed to
punctuation and dingbats) when interpreted in each possible encoding,
and picks the encoding with the highest proportion (confidence level).
This is a dumbed-down version of Li and Momoi's character distribution
method, but works surprisingly well so long as the text really is in one
of these Cyrillic encodings. It fails spectacularly for text in
Latin-1, Mac Roman, UTF-8, etc. It would probably also be unable to
detect differences between almost-identical character sets, like KOI8-R
and KOI8-U.

The smaller your list of "possible" encodings, the easier your job of
detecting one of them.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Philippe Verdy: "Re: Detecting encoding in Plain text"
Previous message: Clark Cox: "Re: [OT] ASCII support in C/C++ (was: doubt)"
In reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Philippe Verdy: "Re: Detecting encoding in Plain text"
Reply: Philippe Verdy: "Re: Detecting encoding in Plain text"
Reply: Mark Davis: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 01:36:18 EST