From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 25 2005 - 05:04:54 CDT
From: <jarkko.hietaniemi@nokia.com>
> I believe a relatively simple exercise in statistics, playing with the
> typical n-gram frequencies,
> shows that you need to have dozens of letters to get any reasonably
> reliable results.
My opinion was not to defeat the idea of n-gram analysis, but that it
effectively fails to identify languages at the indicated success rates.
That's why I tested it starting by short phrases, augmenting them until it
gives some positive value.
And I've found that in MANY cases, this method will NOT detect the correct
language before MUCH longer sentences than what as been shown on various
places.
But additionally, some other implementations give much better results using
n-gram analysis only. This reveals that the statistics used in their
implementations are better tuned, probably made with less constrained corpus
of text, and on realistic text examples.
For example, I found that the XEROX's language identifier performs much
better, on MUCH shorter texts. I can think that it uses a distinct
mathematical model, and that it combines several analysis instead of just
one heuristic with badly tuned statistic models. Notably XEROX seems to use
variable-length ngram analysis, instead of fixed-length ngrams, and it also
uses short word analysis (there have been reports where both ngram and
short-word analysis were identifying languages at same or similar success
rate, but combining the two orthogonal approach gives much more significant
results; a generlized method would be to combine the two using a
variable-length ngram analysis, and that's apparently what XEROX has done --
variable-length analysis does not attempt to identify arbitrary fixed-length
ngrams, but instead attempts to approximate well the syllabic level).
The extension of this variable-length ngram analysis would be to build quite
reliable syllable breakers without using huge dictionnaries...
For now, I can just conclude that basic fixed-length ngram analysis fails
for too many practical cases where language identification is needed. It
will only succeed when parsing significant monolingual non-technical texts
(for example articles about history in Wikipedia). There are tons of other
texts for which one would need automatic language identification, notably in
plain-text search engines and indexers.
I am not speaking about what Google does, because Google already has a huge
database of dictionnaries available, which is constantly augmented by the
very large corpus of web sites it can index. Google can then identify
languages not by ngrams and short words only, and most probably not by
syllabic structure only, but directly at the word and phrase level, and also
most probably at the semantic level (using the semantic relations created by
matching occurences of terms in the same paragraphs, from a large corpus of
texts written by different source). (Google for example could correlate the
various conjugated verbs together using such approach, and discover
relations between singular/plural or feminine/masculine forms, or casted
forms, simply because of the relations that exist between words within
phrases found in lots of documents). And for this reason, the heuristic used
to identify languages, is certainly MUCH more complex (I don't know if it
has been implemented in Google Desktop Search, or if just uses an heuristic
tweaked in favor of the desktop user's locale; in fact I don't see a
language selection in Google Desktop Search, so I doubt that it is
implemented).
Anyway, it seems that all works related to ngram analysis (and short word
analysis) has been finished in early 1996, and no more significant results
have been published since then. Nearly 10 years have elapsed, and I'm sure
that there exists now other approaches that can be combined to offer better
identification results.
This archive was generated by hypermail 2.1.5 : Thu Aug 25 2005 - 05:05:42 CDT