Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Aug 11 2005 - 09:24:06 CDT

  • Next message: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Tom Emerson <tree at basistech dot com> wrote:

    > Indeed: I wrote a detecter for Arabic encodings that did exactly this,
    > in that it could differentiate between ISO-8859-6, Windows CP1256,
    > Unicode transformation formats, and ASMO-449. In this particular
    > application it was known a priori that the text was Arabic, just not
    > the encoding.

    I wrote a similar program for Cyrillic encodings some years ago, which
    could distinguish between CP855, CP866, CP1251, KOI8-R, ISO 8859-5, Mac
    Cyrillic (thrown in more as an exercise than anything else), and UTF-8.
    Other than the usual UTF-8 heuristic, however, my approach was more
    simplistic and brute force than Tom's: I converted each byte to all six
    encodings and calculated the frequency of valid combinations of upper-
    and lower-case letters (excluding reversed titlecase) in each encoding.
    Perhaps surprisingly, this approach was fairly accurate, as long as it
    was known a priori that the text was Cyrillic (not necessarily Russian).

    --
    Doug Ewell
    Fullerton, California
    http://users.adelphia.net/~dewell/
    


    This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 09:25:54 CDT