Re: Detecting encoding in Plain text

From: D. Starner (shalesller@writeme.com)
Date: Thu Jan 08 2004 - 09:20:45 EST

Next message: Patrick Andries: "Re: Detecting encoding in Plain text"

Previous message: jon@hackcraft.net: "Re: Detecting encoding in Plain text"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Tex Texin: "Re: Detecting encoding in Plain text"
Reply: Tex Texin: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Given any sizeable chunk of text, it ought to be possible to estimate
> the statistical likelihood of its being in a certain
> encoding/[language] even if it's in an unspecified 8859-* encoding.
> It would be quite an interesting exercise, but I'd be surprised if
> someone hasn't done it before. Perhaps someone here knows.

http://www.let.rug.nl/~vannoord/TextCat/ has a paper on the subject
and an implemenation in Perl. http://mnogosearch.org has an alternate
implementation in compiled code (called mguesser).

-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Next message: Patrick Andries: "Re: Detecting encoding in Plain text"
Previous message: jon@hackcraft.net: "Re: Detecting encoding in Plain text"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Tex Texin: "Re: Detecting encoding in Plain text"
Reply: Tex Texin: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 08 2004 - 10:04:13 EST