From: Ken Krugler (ken@transpac.com)
Date: Wed Aug 24 2005 - 11:51:33 CDT
Hi all,
Kevin Burton has created an open source language detector written in
Java (see
<http://www.feedblog.org/2005/08/ngram_language_.html>http://www.feedblog.org/2005/08/ngram_language_.html)
and he's asking for contributions of sample data for additional
languages.
Any suggestions for a multi-lingual corpus that could be used as
training data? I believe he used some Wikipedia entries, but I'm
hoping there are larger and more complete public data sets available.
Thanks,
-- Ken
-- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200
This archive was generated by hypermail 2.1.5 : Wed Aug 24 2005 - 12:02:48 CDT