From: Tom Emerson (tree@basistech.com)
Date: Wed Aug 24 2005 - 20:59:26 CDT
Philippe Verdy writes:
> From: "Tom Emerson" <tree@basistech.com>
> > http://odur.let.rug.nl/~vannoord/TextCat/
> >
> > Indeed, all of the data van Noord uses is included in his distribution.
>
> I tried his demo page just with French, and the conclusion are not good.
Oh, his system is not very good... I didn't mean (if I did at all)
that it was. It's just one that is raised repeatedly when people
evaluate the language/encoding identifier my company sells. His
training corpora are rediculously small for building any useful
model. What's more, as soon as you feed it unclean text, with weird
capitalization (for example) it gives up the ghost completely.
> I fear that it bases its results only on digrams, but does not use trigrams.
The Cavner and Trenkle algorithm generates n grams (and van Noord's
implementation of it), 1 <= n <= 5, and keeps the 300 hundred most
frequent. These are usually the unigrams of the language, as well as
some bigrams. Only when you train on a *lot* of a data do you see
n-grams in the top-300 with n > 3. I've successfully used their
algorithm for dialect identification, for example, because it is so
trivially implemented.
> Now the quoted references are quite old (about 1996). There are certainly
> better technics today than just n-grams...
Just n-grams gets you a long way, actually. However, there are other
techiques that are used in larger and more accurate systems: I will
dig up references to more recent work that utilize hidden markhov
models and other probabilistic methods to good affect.
-tree
-- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "You can't fake quality any more than you can fake a good meal." (W.S.B.)
This archive was generated by hypermail 2.1.5 : Wed Aug 24 2005 - 21:03:49 CDT