Autodetection of CP437 vs. Latin-1

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Feb 10 2007 - 02:55:25 CST

Next message: Frank Ellermann: "Re: Autodetection of CP437 vs. Latin-1"

Previous message: Doug Ewell: "Current GB18030 mapping table?"
Next in thread: Frank Ellermann: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Frank Ellermann: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Addison Phillips: "Re: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm looking for tips on automatically detecting text data in MS-DOS
CP437 (or 850, etc.) versus Latin-1 or Windows CP1252. It doesn't have
to be a perfect solution, but pretty good.

One problem is detecting text with the MS-DOS box-drawing characters,
many of which occupy the same code points as Latin-1 accented letters.
This means that simple range-checking often doesn't work.

Please send replies off-list unless you feel they would interest the
list. Please don't tell me this is anachronistic; I know it is. I'm
trying to migrate a lot of that anachronistic data to UTF-8, as
automatically as possible.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages

Next message: Frank Ellermann: "Re: Autodetection of CP437 vs. Latin-1"
Previous message: Doug Ewell: "Current GB18030 mapping table?"
Next in thread: Frank Ellermann: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Frank Ellermann: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Addison Phillips: "Re: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 02:58:28 CST