From: John Burger (john@mitre.org)
Date: Wed Jan 14 2004 - 10:16:41 EST
Mark E. Shoulson wrote:
> If it's a heuristic we're after, then why split hairs and try to make
> all the rules ourselves? Get a big ol' mess of training data in as
> many languages as you can and hand it over to a class full of CS
> graduate students studying Machine Learning.
Absolutely my reaction. All of these suggested heuristics are great,
but would almost certainly simply fall out of a more rigorous approach
using a generative probabilistic model, or some other classification
technique. Useful features would include n-graphs frequencies, as Mark
suggests, as well as lots of other things. For particular
applications, you could use a cache model, e.g., using statistics from
other documents from the same web site, or other messages from the same
email address, or even generalizing across country-of-origin.
Additionally, I'm pretty sure that you could get some mileage out of
unsupervised data, that is, all of the documents in the training set
needn't be labeled with language/encoding. And one thing we have a lot
of on the web is unsupervised data.
I would be extremely surprised if such an approach couldn't achieve 99%
accuracy - and I really do mean 99%, or better.
By the way, I still don't quite understand what's special about Thai.
Could someone elaborate?
- John Burger
MITRE
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 10:57:25 EST