Re: Localizable sentences are in a PUA for development purposes (from Re: off-topic discussions)

From: Jon Hanna (jon@hackcraft.net)
Date: Mon Jan 18 2010 - 10:05:48 CST

  • Next message: Julian Bradfield: "Re: Localizable sentences are in a PUA for development purposes (from Re: off-topic discussions)"

    William_J_G Overington wrote:
    > Hopefully one day some localizable sentences, not necessarily those used by me in the experiments, will be encoded into regular Unicode.

    Hopefully indeed! But let us not rest there.

    Obviously such an approach under-reaches and lacks ambition. Why settle
    for some sentences when we can have all sentences?

    Now, the number of sentences in practical use is bounded; eventually
    even the worse culprit of over-long sentences (mea culpa, mea maxima
    culpa) at some point runs out of breath and feels the need to add a
    full-stop (or sentence break indicator of whatever script you're using
    today) and then begin on another sentence, or perhaps a new paragraph,
    or maybe they're finished the whole thing. Phew, knew that full-stop was
    going to happen eventually.

    Theoretically though, the length of a sentence is boundless. We know of
    course that we can create an English sentence with any number (n > 0) of
    occurrences of the word "buffalo" and result in a grammatically correct,
    sentence, albeit an ambiguous one for higher values of n.

    This alone gives us an infinite number of sentences where no word is
    used other than "buffalo". In languages where there is no homophones
    meaning, "bison", "intimidate" and "the second most populous city in the
    state of New York", then this doesn't hold, but we can still use this
    feature of English to deal with recording this particular infinite
    subset of the infinite set of grammatically valid sentences (itself a
    subset of the infinite set of sentences) and then use the resulting
    encoding as a key to localised resources for other languages.

    Being infinite, elements of this set cannot be represented by a
    fixed-size unit, but by variable-sized strings, which requires us to do
    one of the following:

    1. Prefix the entire sentence with an indicator of length.
    2. Prefix the entire sentence with an indicator of length which also
    does the job of containing some of the following data.
    3. End the sentence with an indicator that the end has been reached.

    The third of these is self-correcting (if we miss it we do not
    mis-interpret data as length and vice-versa, and at the end of the next
    sentence we have two sentences corrupted into one over-long sentence,
    followed by correct data rather than having corrupted the entire
    data-stream).

    I propose we use U+002E.

    At this point we have a system adequately capable of recording and
    reproducing any of the infinite set of sentences that consist entirely
    of the word "buffalo".

    Now we want to extend this to other sentences, let us start with those
    which are similarly simple. /people( people)+/ and /police( police)+/
    both describe infinite sets of grammatically-valid sentences, and can be
    easily added in like manner.

    At this point, it becomes clear that the number of words for which we
    are doing this is itself large. Fish and smelt both work, and who knows
    how many others we will find? We need some way to reduce this set in a
    manageable way.

    Notably, there is a certain similarity between the first sound of both
    "police" and "people". This sound is also repeated later in "people". If
    we pick a token to encode this, we can reduce the set of tokens we need.
    I propose U+0070.

    Taken in like manner we can then work on other sounds to produce a way
    of representing them in non-audible formats.

    I'm sure this can't be done perfectly, but we can probably agree to some
    conventions and live with any persisting disagreements or legacies that
    will result from changes in language.

    Now for the final task, of Internationalising all of these strings. Our
    encoding is based on just one language, but that's okay; we can take
    these language-dependent data and use each such datum as a key, along
    with an indicator of the language we are interested in, to retrieve a
    similar variable-length piece of data.

    Extended, every sentence feasible can be encoded in a language-dependent
    way, and then used as a key to versions in other languages.

    See, by just extending your idea sensibly, we can move forward to
    something that's of a level of technology we should expect in the 27th
    Century (BCE).



    This archive was generated by hypermail 2.1.5 : Mon Jan 18 2010 - 10:10:36 CST