Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 11 2005 - 03:06:14 CDT

Next message: Theo Veenker: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

Previous message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: "Unicode Mailing List" <unicode@unicode.org>
Sent: Thursday, August 11, 2005 7:18 AM
Subject: Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
>> To detect a language, you could also try searching for very common
>> terms like "the", "is", "are", "have", "and" in English, "le", "un",
>> "a", "à", "est", "et" in French, "der", "das", "ist" in German.
>
> This is not a bad heuristic in general, but I don't think I'd suggest
> using "a" as an indication that the text is in French. That word has a
> tendency to occur in English now and then.

I know, but it counts positively to French and English (probably more in
English than in French were it is just a common conjugated form of an
essential auxiliary verb). The idea is not to count single words, but to
compute a summary statistic for lists of candidate languages, using list of
words rated by occurence probability. Such a list of words will be much
larger than the few examples I gave, and will include other common words and
contractions.

Another idea is also to compute the rates of letter occurences in all words,
as their distribution is often very specific to languages, if the parsed
text is long enough. Some letters are quite rare in a language but much more
frequent in another (for example 'k' and 'y' are much more frequent in
English than in French, and 'é' is very frequent in French but extremely
rare in English; the 'e', 's' and 'a' are the three most frequent letters in
both languages, with a slight difference for 'r' and 'n').

You can do the same on digraphs/trigraphs or contextual occurences of
letters (for example, English 'sh' versus French 'ch'; French 'ou' versus
English 'ay'; French final 'e' or 'es'...).

The average length of words is also an indicator (small words with 6 letters
or less are much more frequent in English than in French).

These ideas can be used after training the statistics based on various texts
made of phrases (not on dictionnaries or word lists, as the statistics will
be skewed by the lack of many word variants and conjugated verbs, and a too
much flattened distribution of words).

For the case of Arabic, the first indicator is effectively the alphabet, but
I think that there are similar usage pattern that helps making distinction
between Arabic and Urdu. Anyway, the various encodings used for the Arabic
script will be easily determined by letter occurences statistics.

Next message: Theo Veenker: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Previous message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 03:10:12 CDT