Re: Unicode & space in programming & l10n

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Sep 21 2006 - 22:39:21 CDT

Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

Previous message: Steve Summit: "Re: Unicode & space in programming & l10n"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg <haberg at math dot su dot se> wrote:

> So then, why not (if this is not what you already is doing) just take
> a large English text body, and compute the statistics of the words in
> it. Then sort the list, putting the more frequent words first, and
> give the words the number they have in this list. Then apply UTF-8...

This would be intended as a general-purpose scheme, of course, not for
the specific purpose I cited of character names, which are nowhere near
representative of English word frequency.

You bring up some interesting points, some of which I've already thought
of -- particularly the ability to fall back to character-by-character
spelling of rarer words, just as sign languages include a fallback to
fingerspelling. One possible pitfall is the number of "common" words in
English; the more words are assigned tokens, the greater the average (or
longest) token size. You have to decide where to draw the line.

This is really becoming OT for the Unicode list, but I'll be happy to
discuss it further in private mail.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
RFC 4645  *  UTN #14

Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"
Previous message: Steve Summit: "Re: Unicode & space in programming & l10n"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 22:44:01 CDT