From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 15:51:01 CST
From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
> Now consider scanning forwards. We want to strip a beginning of a
> string. For example the string is an irc message prefixed with a
> command and we want to take the message only for further processing.
> We have found the end of the prefix and we want to produce a string
> from this position to the end (a copy, since strings are immutable).
All those are not demonstration: decoding IRC commands or similar things
does not constitute the need to encode large sets of texts. In your
examples, you show applications that need to handle locally some strings
made for computer languages.
Texts of human languages, or even a collection of person names, or places
are not like this, and have a much wider variety, but with huge
possibilities for data compression (inherent to the phonology of human
languages and their overall structure, but also due to repetitive
conventions spread throughout the text to allow easier reading and
understanding).
Scanning backward a person name or human text is possibly needed locally,
but such text has a strong forward directionality without which it does not
make sense. Same thing if you scan such text starting at random positions:
you could make many false interpretations of this text by extracting random
fragments like this.
Anyway, if you have a large database of texts to process or even to index,
you will, in fine, need to scan this text linearily first from the beginning
to the end, should it be only to create an index for accessing it later
randomly. You will still need to store the indexed text somewhere, and in
order to maximize the performance, or responsiveness of your application,
you'll need to minimize its storage: that's where compression takes place.
This does not change the semantic of the text, does not remove its
semantics, but this is still an optimization, which does not prevent a
further access with more easily parsable representation as stateless streams
of characters, through surjective (sometimes bijective) converters between
the compressed and uncompressed forms.
My conclusion: there's no "best" representation to fit all needs. Each
representation has its merits in its domain. The Unicode UTFs are excellent
only for local processing of limited texts, but they are not necessarily the
best for long term storage or for large text sets.
And even for texts that will be accessed frequently, compressed schemes can
still constitute optimizations, even if these texts need to be decompressed
repeatedly each time they are needed. I am clearly against the arguments
with "one scheme fits all needs", even if you think that UTF-32 is the only
viable long-term solution.
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 15:57:33 CST