From: Jon Hanna (firstname.lastname@example.org)
Date: Mon Jan 18 2010 - 10:05:48 CST
William_J_G Overington wrote:
> Hopefully one day some localizable sentences, not necessarily those used by me in the experiments, will be encoded into regular Unicode.
Hopefully indeed! But let us not rest there.
Obviously such an approach under-reaches and lacks ambition. Why settle
for some sentences when we can have all sentences?
Now, the number of sentences in practical use is bounded; eventually
even the worse culprit of over-long sentences (mea culpa, mea maxima
culpa) at some point runs out of breath and feels the need to add a
full-stop (or sentence break indicator of whatever script you're using
today) and then begin on another sentence, or perhaps a new paragraph,
or maybe they're finished the whole thing. Phew, knew that full-stop was
going to happen eventually.
Theoretically though, the length of a sentence is boundless. We know of
course that we can create an English sentence with any number (n > 0) of
occurrences of the word "buffalo" and result in a grammatically correct,
sentence, albeit an ambiguous one for higher values of n.
This alone gives us an infinite number of sentences where no word is
used other than "buffalo". In languages where there is no homophones
meaning, "bison", "intimidate" and "the second most populous city in the
state of New York", then this doesn't hold, but we can still use this
feature of English to deal with recording this particular infinite
subset of the infinite set of grammatically valid sentences (itself a
subset of the infinite set of sentences) and then use the resulting
encoding as a key to localised resources for other languages.
Being infinite, elements of this set cannot be represented by a
fixed-size unit, but by variable-sized strings, which requires us to do
one of the following:
1. Prefix the entire sentence with an indicator of length.
2. Prefix the entire sentence with an indicator of length which also
does the job of containing some of the following data.
3. End the sentence with an indicator that the end has been reached.
The third of these is self-correcting (if we miss it we do not
mis-interpret data as length and vice-versa, and at the end of the next
sentence we have two sentences corrupted into one over-long sentence,
followed by correct data rather than having corrupted the entire
I propose we use U+002E.
At this point we have a system adequately capable of recording and
reproducing any of the infinite set of sentences that consist entirely
of the word "buffalo".
Now we want to extend this to other sentences, let us start with those
which are similarly simple. /people( people)+/ and /police( police)+/
both describe infinite sets of grammatically-valid sentences, and can be
easily added in like manner.
At this point, it becomes clear that the number of words for which we
are doing this is itself large. Fish and smelt both work, and who knows
how many others we will find? We need some way to reduce this set in a
Notably, there is a certain similarity between the first sound of both
"police" and "people". This sound is also repeated later in "people". If
we pick a token to encode this, we can reduce the set of tokens we need.
I propose U+0070.
Taken in like manner we can then work on other sounds to produce a way
of representing them in non-audible formats.
I'm sure this can't be done perfectly, but we can probably agree to some
conventions and live with any persisting disagreements or legacies that
will result from changes in language.
Now for the final task, of Internationalising all of these strings. Our
encoding is based on just one language, but that's okay; we can take
these language-dependent data and use each such datum as a key, along
with an indicator of the language we are interested in, to retrieve a
similar variable-length piece of data.
Extended, every sentence feasible can be encoded in a language-dependent
way, and then used as a key to versions in other languages.
See, by just extending your idea sensibly, we can move forward to
something that's of a level of technology we should expect in the 27th
This archive was generated by hypermail 2.1.5 : Mon Jan 18 2010 - 10:10:36 CST