From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 11 2005 - 03:06:14 CDT
----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: "Unicode Mailing List" <unicode@unicode.org>
Sent: Thursday, August 11, 2005 7:18 AM
Subject: Re: Cp1256 (Windows Arabic) Characters not supported by UTF8
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
>> To detect a language, you could also try searching for very common
>> terms like "the", "is", "are", "have", "and" in English, "le", "un",
>> "a", "à", "est", "et" in French, "der", "das", "ist" in German.
>
> This is not a bad heuristic in general, but I don't think I'd suggest
> using "a" as an indication that the text is in French. That word has a
> tendency to occur in English now and then.
I know, but it counts positively to French and English (probably more in
English than in French were it is just a common conjugated form of an
essential auxiliary verb). The idea is not to count single words, but to
compute a summary statistic for lists of candidate languages, using list of
words rated by occurence probability. Such a list of words will be much
larger than the few examples I gave, and will include other common words and
contractions.
Another idea is also to compute the rates of letter occurences in all words,
as their distribution is often very specific to languages, if the parsed
text is long enough. Some letters are quite rare in a language but much more
frequent in another (for example 'k' and 'y' are much more frequent in
English than in French, and 'é' is very frequent in French but extremely
rare in English; the 'e', 's' and 'a' are the three most frequent letters in
both languages, with a slight difference for 'r' and 'n').
You can do the same on digraphs/trigraphs or contextual occurences of
letters (for example, English 'sh' versus French 'ch'; French 'ou' versus
English 'ay'; French final 'e' or 'es'...).
The average length of words is also an indicator (small words with 6 letters
or less are much more frequent in English than in French).
These ideas can be used after training the statistics based on various texts
made of phrases (not on dictionnaries or word lists, as the statistics will
be skewed by the lack of many word variants and conjugated verbs, and a too
much flattened distribution of words).
For the case of Arabic, the first indicator is effectively the alphabet, but
I think that there are similar usage pattern that helps making distinction
between Arabic and Urdu. Anyway, the various encodings used for the Arabic
script will be easily determined by letter occurences statistics.
This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 03:10:12 CDT