From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 10:21:57 CST
From: "Ray Mullan" <ray@mullan.net>
>I don't see how the one million available codepoints in the Unicode
>Standard could possibly accommodate a grammatically accurate vocabulary of
>all the world's languages.
You have misread the message from Tim: he wanted to use "code points" above
U+10FFFF within the full 32-bit space (meaning more than 4 billions
codepoints, when Unicode and ISO-10646 only allow 2 millions...)
He wanted to use that to encode words on a single code point, as a possible
compression scheme. But he forgets that words can have its component letters
affected by style or during rendering.
Also a "font" or renderer would be unable to draw the text without having
the equivalent of an indexed dictionnary of all words on the planet!
If compression is a goal, he forgets that the space gain offered by such
compression will be very modest face to more generic data compressors like
deflate or bzip2 that can compress the represented texts more efficiently
without even needing such large dictionnary (that is in perpetual evolution
by every speaker of any language, without any prior standard agreement
anywhere!).
Forget his idea, it is technically impossible to do. At best you could
create some protocols that will compact some widely used words (this is what
WAP does for widely used HTML elements or attributes), but this is still not
a standard outside of this limited context.
Suppose that Unicode encodes the common English words "the", "an", "is",
etc... then a protocol could decide that these words are not important and
will filter them. What will happen if these "words" do appear in non-English
languages where they are semantically significant? These words would be
missing. To paliate this inconvenient the codepoints would only designate
the words used in one language and not the other, so "an" would have
different codes whever it is used in English or in another language.
The last problem is that too many languages do not have well-established and
computerized lexical dictionnaries, and grammatical rules that allow
composing words are not always known. The number of words in a single
language cannot also be bound to a known maximum (a good example in German
where composed words are virtually unlimited!)
So forget this idea: Unicode will not create a standard to encode words.
Words will be represented after modeling them to a script system made of
simpler sets of "letters" or "ideographs" or punctuation and diacritics. The
representation of words with those letters is an orthographic system,
specific to each language, that Unicode will not standardize.
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 10:28:01 CST