Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 11 2005 - 03:06:14 CDT

  • Next message: Theo Veenker: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    ----- Original Message -----
    From: "Doug Ewell" <dewell@adelphia.net>
    To: "Unicode Mailing List" <unicode@unicode.org>
    Sent: Thursday, August 11, 2005 7:18 AM
    Subject: Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    >
    >> To detect a language, you could also try searching for very common
    >> terms like "the", "is", "are", "have", "and" in English, "le", "un",
    >> "a", "à", "est", "et" in French, "der", "das", "ist" in German.
    >
    > This is not a bad heuristic in general, but I don't think I'd suggest
    > using "a" as an indication that the text is in French. That word has a
    > tendency to occur in English now and then.

    I know, but it counts positively to French and English (probably more in
    English than in French were it is just a common conjugated form of an
    essential auxiliary verb). The idea is not to count single words, but to
    compute a summary statistic for lists of candidate languages, using list of
    words rated by occurence probability. Such a list of words will be much
    larger than the few examples I gave, and will include other common words and
    contractions.

    Another idea is also to compute the rates of letter occurences in all words,
    as their distribution is often very specific to languages, if the parsed
    text is long enough. Some letters are quite rare in a language but much more
    frequent in another (for example 'k' and 'y' are much more frequent in
    English than in French, and 'é' is very frequent in French but extremely
    rare in English; the 'e', 's' and 'a' are the three most frequent letters in
    both languages, with a slight difference for 'r' and 'n').

    You can do the same on digraphs/trigraphs or contextual occurences of
    letters (for example, English 'sh' versus French 'ch'; French 'ou' versus
    English 'ay'; French final 'e' or 'es'...).

    The average length of words is also an indicator (small words with 6 letters
    or less are much more frequent in English than in French).

    These ideas can be used after training the statistics based on various texts
    made of phrases (not on dictionnaries or word lists, as the statistics will
    be skewed by the lack of many word variants and conjugated verbs, and a too
    much flattened distribution of words).

    For the case of Arabic, the first indicator is effectively the alphabet, but
    I think that there are similar usage pattern that helps making distinction
    between Arabic and Urdu. Anyway, the various encodings used for the Arabic
    script will be easily determined by letter occurences statistics.



    This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 03:10:12 CDT