From: Patrick Andries (patrick.andries@xcential.com)
Date: Thu Aug 11 2005 - 09:56:14 CDT
Tom Emerson a écrit :
>Indeed: I wrote a detecter for Arabic encodings that did exactly this,
>in that it could differentiate between ISO-8859-6, Windows CP1256,
>Unicode transformation formats, and ASMO-449. In this particular
>application it was known a priori that the text was Arabic, just not
>the encoding.
>
About 8 years we had a tool that did this for many languages and
character sets.
You could present a page in an unknown language and character set and it
would guess both for you.
The trick is simply to train a Hidden Markovian Model (modèle markovien
caché) with a larger corpus of tagged (for both variables) content.
Incidentally, this probabilistic model, given enough documents, will
automatic detect the most common sequence of n consecutive bytes (n = 2,
3, 4 as you wish) for a given pair <language, character set> as one of
its result (and one should thus find "die, der, das" having a high
probability for <de, latin-1> for instance, but "est, les, lui" for
<fr,latin-1>). Detecting the language and encoding is then "simply" a
matter of calculating the [compound] relative probability of a given
passage and chosing the one with the highest probability for a given
pair <language, character set>.
Used for <http://alis.isoc.org/palmares.html>
P. A.
This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 09:57:25 CDT