From: Doug Ewell (dewell@adelphia.net)
Date: Sun Dec 05 2004 - 21:02:15 CST
Hohberger, Clive <CHohberger at zebra dot com> wrote:
> When I went back and recoded those same words with leading or trailing
> spaces (denoted here by "_") as: "_the", "the_" "_and", "and_", etc.
> as single bytes, I found a huge gain in efficiency in terms of the
> number of bytes to encode the sma e English text. Next, when you look
> at the most common word starting letters and encode them as "_s" and
> "_t", etc., and the most common word terminator letters and encode
> them as "r_", "d_", etc., you gain additional efficiency in a 256-
> codeword alphabet/word encoding for English.
>
> What it said to me is that from a coding efficiency viewpoint is that
> we need to think of words in an alphabetic language at a sequence of
> letters with the space as either a prefix or terminator character,
> rather than the space as a separator character between words
> represented as alphabetic strings.
A word-based encoding for English could automatically assume spaces
where they are appropriate. The sentence:
"What means this, my lord?"
would have seven encodable elements: the five words, the comma, and the
question mark. Spaces would be automatically filled in as needed, not
explicitly encoded. This implies "standard" English punctuation and
spacing conventions, however that is defined. For French conventions,
there would probably be a space before the question mark as well.
Such an encoding would probably also include logic to capitalize the
first word of each sentence, plus the ability to override this logic for
proper names and non-capitalized sentences. There might also be
unification of conjugations and declensions (and similar for other
languages) to conserve space. "Boy" and "boys" might be encoded with
the same code point, with contextual clues elsewhere in the sentence to
disambiguate the two.
And, of course, there would have to be an escape mechanism to ordinary
character-based encoding, because such a system will never contain every
word one might wish to encode, even just for English (think proper names
again), and because "standard" punctuation and spacing rules don't
always apply. This is similar to the situation with sign languages,
which are word- and phrase-based but allow a fallback to fingerspelling.
None of this, however interesting it may be, has anything to do with
Unicode. Unicode is a system for encoding characters, not words or
pictures or ideas.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 21:04:42 CST