From: Doug Ewell (dewell@adelphia.net)
Date: Thu Sep 21 2006 - 22:39:21 CDT
Hans Aberg <haberg at math dot su dot se> wrote:
> So then, why not (if this is not what you already is doing) just take
> a large English text body, and compute the statistics of the words in
> it. Then sort the list, putting the more frequent words first, and
> give the words the number they have in this list. Then apply UTF-8...
This would be intended as a general-purpose scheme, of course, not for
the specific purpose I cited of character names, which are nowhere near
representative of English word frequency.
You bring up some interesting points, some of which I've already thought
of -- particularly the ability to fall back to character-by-character
spelling of rarer words, just as sign languages include a fallback to
fingerspelling. One possible pitfall is the number of "common" words in
English; the more words are assigned tokens, the greater the average (or
longest) token size. You have to decide where to draw the line.
This is really becoming OT for the Unicode list, but I'll be happy to
discuss it further in private mail.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ RFC 4645 * UTN #14
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 22:44:01 CDT