RE: Multi-lingual corpus?

From: jarkko.hietaniemi@nokia.com
Date: Thu Aug 25 2005 - 03:33:23 CDT

Next message: Adam Twardoch: "Re: Windows Glyph Handling"

Previous message: Michael Everson: "Re: Windows Glyph Handling"
Maybe in reply to: Ken Krugler: "Multi-lingual corpus?"
Next in thread: Philippe Verdy: "Re: Multi-lingual corpus?"
Reply: Philippe Verdy: "Re: Multi-lingual corpus?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I tried his demo page just with French, and the conclusion
> are not good.
> - starting by "essai", it replied finnish
> - extending it to "un essai", it replied romanian
> - extending it to "un essai long", "un essai plus long", or
> "un essai encore
> plus long", it replied "rumantsh"
> - extending it to "ceci est un essai long", "ceci est un
> essai trop long",
> "ceci est un essai encore trop long", "ceci est un essai
> suffisant", it
> replied again romanian...

I think you are being much too harsh in your judgment, it would do well to sit
down and think for a moment what does it do, based on what input, and what does
it output. Instead, you could have some fun, and see what it does.

a irish
au welsh
auk malay
auke german
aukea basque
aukeam malay
aukeama swahili
aukeamaa sanskrit
aukeamaan finnish

(The 'aukeamaan' being a valid Finnish word.) My main point being, I guess, that take
a look at the replies: 'a' is a valid word in MANY languages - but it replies only with
one. Ditto for 'au' and 'auk', and 'auke'. 'aukea', 'aukeama', and 'aukeamaa' are valid
Finnish words, but apparently they could be Basque, Malay, and Swahili.

I believe a relatively simple exercise in statistics, playing with the typical n-gram frequencies,
shows that you need to have dozens of letters to get any reasonably reliable results.

Next message: Adam Twardoch: "Re: Windows Glyph Handling"
Previous message: Michael Everson: "Re: Windows Glyph Handling"
Maybe in reply to: Ken Krugler: "Multi-lingual corpus?"
Next in thread: Philippe Verdy: "Re: Multi-lingual corpus?"
Reply: Philippe Verdy: "Re: Multi-lingual corpus?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 25 2005 - 03:36:23 CDT