À 01:40 23/03/99 -0800, Joerg Knappen a écrit :
>Alain LaBonte schrieb:
>> Try to filter non-ASCII from French and messages are unreadable at best,
>> even if about 97% of characters are indeed ASCII (statistics on some
>> corpuses I have)... but the 3% remaining is highly relevant, do not forget
>> to mention, and essential.
>
>Provocative thought: How much of the filtered information can be restored by
>a decent spell-checker
I don't think a spell-checker would do a good job, as there are often
multiple accentuation hypotheses for an unaccented word. The software
described at:
http://www.alis.com/castil/reacc/index.html
generates all those hypotheses for a sentence, analyses the sentence and
then picks the most likely forms using a statisitical language model. It
achieves an error rate of about 1 in 130 (better than 99%).
> (in the worst case: Non-ASCII-characters are just
>discarded and not replaced by something more or less insensible) ?
Ah! that's harder. The above is starting from unaccented but otherwise
correct words (correct base letters). Including the possibility of adding
accented letters to the words would multiply the number of accentuation
hypotheses and consequently increase the error rate, but I don't really
know to what extent.
-- François Yergeau
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT