From: Andy Heninger (andy.heninger@gmail.com)
Date: Thu Aug 11 2005 - 13:04:12 CDT
It's more on language detection, and less on charset detection, but there is
an interesting paper from IBM Research here
ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
Linguini: Language Identification for Multilingual Documents
John M. Prager
Some of the ideas from this ended up in the charset detection that was just
added to the Java version of the ICU library.
We are just now starting to look at doing a C version of that charset
detection. If anyone would like to weigh in with opinions on how the API
should look, the icu-design mail list is the place to do it.
http://lists.sourceforge.net/lists/listinfo/icu-design
-- Andy Heninger
On 8/11/05, Patrick Andries <patrick.andries@xcential.com> wrote:
>
>
> You could present a page in an unknown language and character set and it
> would guess both for you.
>
> The trick is simply to train a Hidden Markovian Model (modèle markovien
> caché) with a larger corpus of tagged (for both variables) content.
> Incidentally, this probabilistic model, given enough documents, will
> automatic detect the most common sequence of n consecutive bytes (n = 2,
> 3, 4 as you wish) for a given pair <language, character set> as one of
> its result (and one should thus find "die, der, das" having a high
> probability for <de, latin-1> for instance, but "est, les, lui" for
> <fr,latin-1>). Detecting the language and encoding is then "simply" a
> matter of calculating the [compound] relative probability of a given
> passage and chosing the one with the highest probability for a given
> pair <language, character set>.
>
> Used for <http://alis.isoc.org/palmares.html>
>
> P. A.
>
>
>
This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 13:05:52 CDT