Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Andy Heninger (andy.heninger@gmail.com)
Date: Thu Aug 11 2005 - 13:04:12 CDT

Next message: Jony Rosenne: "RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)"

Previous message: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Jony Rosenne: "RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It's more on language detection, and less on charset detection, but there is
an interesting paper from IBM Research here

ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf

Linguini: Language Identification for Multilingual Documents
John M. Prager

Some of the ideas from this ended up in the charset detection that was just
added to the Java version of the ICU library.

We are just now starting to look at doing a C version of that charset
detection. If anyone would like to weigh in with opinions on how the API
should look, the icu-design mail list is the place to do it.

http://lists.sourceforge.net/lists/listinfo/icu-design

-- Andy Heninger

On 8/11/05, Patrick Andries <patrick.andries@xcential.com> wrote:
>
>
> You could present a page in an unknown language and character set and it
> would guess both for you.
>
> The trick is simply to train a Hidden Markovian Model (modèle markovien
> caché) with a larger corpus of tagged (for both variables) content.
> Incidentally, this probabilistic model, given enough documents, will
> automatic detect the most common sequence of n consecutive bytes (n = 2,
> 3, 4 as you wish) for a given pair <language, character set> as one of
> its result (and one should thus find "die, der, das" having a high
> probability for <de, latin-1> for instance, but "est, les, lui" for
> <fr,latin-1>). Detecting the language and encoding is then "simply" a
> matter of calculating the [compound] relative probability of a given
> passage and chosing the one with the highest probability for a given
> pair <language, character set>.
>
> Used for <http://alis.isoc.org/palmares.html>
>
> P. A.
>
>
>

Next message: Jony Rosenne: "RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)"
Previous message: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Jony Rosenne: "RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 13:05:52 CDT