Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 15:51:01 CST

  • Next message: Peter Kirk: "Re: No Invisible Character - NBSP at the start of a word"

    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > Now consider scanning forwards. We want to strip a beginning of a
    > string. For example the string is an irc message prefixed with a
    > command and we want to take the message only for further processing.
    > We have found the end of the prefix and we want to produce a string
    > from this position to the end (a copy, since strings are immutable).

    All those are not demonstration: decoding IRC commands or similar things
    does not constitute the need to encode large sets of texts. In your
    examples, you show applications that need to handle locally some strings
    made for computer languages.

    Texts of human languages, or even a collection of person names, or places
    are not like this, and have a much wider variety, but with huge
    possibilities for data compression (inherent to the phonology of human
    languages and their overall structure, but also due to repetitive
    conventions spread throughout the text to allow easier reading and
    understanding).

    Scanning backward a person name or human text is possibly needed locally,
    but such text has a strong forward directionality without which it does not
    make sense. Same thing if you scan such text starting at random positions:
    you could make many false interpretations of this text by extracting random
    fragments like this.

    Anyway, if you have a large database of texts to process or even to index,
    you will, in fine, need to scan this text linearily first from the beginning
    to the end, should it be only to create an index for accessing it later
    randomly. You will still need to store the indexed text somewhere, and in
    order to maximize the performance, or responsiveness of your application,
    you'll need to minimize its storage: that's where compression takes place.
    This does not change the semantic of the text, does not remove its
    semantics, but this is still an optimization, which does not prevent a
    further access with more easily parsable representation as stateless streams
    of characters, through surjective (sometimes bijective) converters between
    the compressed and uncompressed forms.

    My conclusion: there's no "best" representation to fit all needs. Each
    representation has its merits in its domain. The Unicode UTFs are excellent
    only for local processing of limited texts, but they are not necessarily the
    best for long term storage or for large text sets.

    And even for texts that will be accessed frequently, compressed schemes can
    still constitute optimizations, even if these texts need to be decompressed
    repeatedly each time they are needed. I am clearly against the arguments
    with "one scheme fits all needs", even if you think that UTF-32 is the only
    viable long-term solution.



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 15:57:33 CST