From: Doug Ewell (dewell@adelphia.net)
Date: Thu Aug 11 2005 - 09:24:06 CDT
Tom Emerson <tree at basistech dot com> wrote:
> Indeed: I wrote a detecter for Arabic encodings that did exactly this,
> in that it could differentiate between ISO-8859-6, Windows CP1256,
> Unicode transformation formats, and ASMO-449. In this particular
> application it was known a priori that the text was Arabic, just not
> the encoding.
I wrote a similar program for Cyrillic encodings some years ago, which
could distinguish between CP855, CP866, CP1251, KOI8-R, ISO 8859-5, Mac
Cyrillic (thrown in more as an exercise than anything else), and UTF-8.
Other than the usual UTF-8 heuristic, however, my approach was more
simplistic and brute force than Tom's: I converted each byte to all six
encodings and calculated the frequency of valid combinations of upper-
and lower-case letters (excluding reversed titlecase) in each encoding.
Perhaps surprisingly, this approach was fairly accurate, as long as it
was known a priori that the text was Cyrillic (not necessarily Russian).
-- Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 09:25:54 CDT