RE: Detecting encoding in Plain text

From: Tom Emerson (Tree@basistech.com)
Date: Mon Jan 12 2004 - 11:57:47 EST

Next message: Markus Scherer: "Re: Confusion about composition"

Previous message: Doug Ewell: "Re: Detecting encoding in Plain text"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Curtis Clark: "Re: Detecting encoding in Plain text"
Reply: Curtis Clark: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Perhaps a meta question is this: how often are you going to encounter
unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never
seen it during the development of our language/encoding identifier.

Sure, it's an interesting thought problem, but it doesn't happen.
And fortunately detecting UTF-8 is relatively easy.

The real problem is differentiating between the ISO 8859-x family and
EUC-CN vs. EUC-KR. These are wondefully ambiguous.

The key to doing this right is having _a_lot_ of valid training data.
You also have to deal with oddities of language: I tried one open
source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED
THAT SHOUTED ENGLISH WAS ACTUALLY CZECH.

It's difficult to separate the language detection from the encoding
Detection when dealing with non-Unicode text.

-tree

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Next message: Markus Scherer: "Re: Confusion about composition"
Previous message: Doug Ewell: "Re: Detecting encoding in Plain text"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Curtis Clark: "Re: Detecting encoding in Plain text"
Reply: Curtis Clark: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 12:43:50 EST