Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Aug 11 2005 - 09:24:06 CDT

Next message: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

Previous message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Tom Emerson <tree at basistech dot com> wrote:

> Indeed: I wrote a detecter for Arabic encodings that did exactly this,
> in that it could differentiate between ISO-8859-6, Windows CP1256,
> Unicode transformation formats, and ASMO-449. In this particular
> application it was known a priori that the text was Arabic, just not
> the encoding.

I wrote a similar program for Cyrillic encodings some years ago, which
could distinguish between CP855, CP866, CP1251, KOI8-R, ISO 8859-5, Mac
Cyrillic (thrown in more as an exercise than anything else), and UTF-8.
Other than the usual UTF-8 heuristic, however, my approach was more
simplistic and brute force than Tom's: I converted each byte to all six
encodings and calculated the frequency of valid combinations of upper-
and lower-case letters (excluding reversed titlecase) in each encoding.
Perhaps surprisingly, this approach was fairly accurate, as long as it
was known a priori that the text was Cyrillic (not necessarily Russian).

--
Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Previous message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 09:25:54 CDT