From: Tom Emerson (tree@basistech.com)
Date: Wed Aug 24 2005 - 13:26:03 CDT
Philippe Verdy writes:
> I wonder if it's a good idea to provide him with such data, if he
> does not want to publish anything in fact (there may be legal issues
> with his source, notably if he used copyrighted materials such as
> the paper he is citing).
Well, the Cavnar and Trenkle paper has been around for a long time:
it's a trivial algorithm to implement, and has served as the
foundation for many of the open sourced or freely available
language/encoding ID systems that are out there. Most notably is van
Noord's Perl "TextCat" program, which has profiles for 77
language/encoding pairs:
http://odur.let.rug.nl/~vannoord/TextCat/
Indeed, all of the data van Noord uses is included in his distribution.
The copyright issue is a real one, and he'll need to be careful if he
decides to re-release te data.
-tree
-- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "You can't fake quality any more than you can fake a good meal." (W.S.B.)
This archive was generated by hypermail 2.1.5 : Wed Aug 24 2005 - 13:27:04 CDT