Autodetection of CP437 vs. Latin-1

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Feb 10 2007 - 02:55:25 CST

  • Next message: Frank Ellermann: "Re: Autodetection of CP437 vs. Latin-1"

    I'm looking for tips on automatically detecting text data in MS-DOS
    CP437 (or 850, etc.) versus Latin-1 or Windows CP1252. It doesn't have
    to be a perfect solution, but pretty good.

    One problem is detecting text with the MS-DOS box-drawing characters,
    many of which occupy the same code points as Latin-1 accented letters.
    This means that simple range-checking often doesn't work.

    Please send replies off-list unless you feel they would interest the
    list. Please don't tell me this is anachronistic; I know it is. I'm
    trying to migrate a lot of that anachronistic data to UTF-8, as
    automatically as possible.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 02:58:28 CST