Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Patrick Andries (patrick.andries@xcential.com)
Date: Thu Aug 11 2005 - 09:56:14 CDT

Next message: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

Previous message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Tom Emerson a écrit :

>Indeed: I wrote a detecter for Arabic encodings that did exactly this,
>in that it could differentiate between ISO-8859-6, Windows CP1256,
>Unicode transformation formats, and ASMO-449. In this particular
>application it was known a priori that the text was Arabic, just not
>the encoding.
>
About 8 years we had a tool that did this for many languages and
character sets.

You could present a page in an unknown language and character set and it
would guess both for you.

The trick is simply to train a Hidden Markovian Model (modèle markovien
caché) with a larger corpus of tagged (for both variables) content.
Incidentally, this probabilistic model, given enough documents, will
automatic detect the most common sequence of n consecutive bytes (n = 2,
3, 4 as you wish) for a given pair <language, character set> as one of
its result (and one should thus find "die, der, das" having a high
probability for <de, latin-1> for instance, but "est, les, lui" for
<fr,latin-1>). Detecting the language and encoding is then "simply" a
matter of calculating the [compound] relative probability of a given
passage and chosing the one with the highest probability for a given
pair <language, character set>.

Used for <http://alis.isoc.org/palmares.html>

P. A.

Next message: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Previous message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 09:57:25 CDT