From: Tom Emerson (Tree@basistech.com)
Date: Mon Jan 12 2004 - 11:57:47 EST
Perhaps a meta question is this: how often are you going to encounter
unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never
seen it during the development of our language/encoding identifier.
Sure, it's an interesting thought problem, but it doesn't happen.
And fortunately detecting UTF-8 is relatively easy.
The real problem is differentiating between the ISO 8859-x family and
EUC-CN vs. EUC-KR. These are wondefully ambiguous.
The key to doing this right is having _a_lot_ of valid training data.
You also have to deal with oddities of language: I tried one open
source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED
THAT SHOUTED ENGLISH WAS ACTUALLY CZECH.
It's difficult to separate the language detection from the encoding
Detection when dealing with non-Unicode text.
-tree
-- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 12:43:50 EST