Re: unicode string representation in PL

From: spir (denis.spir@free.fr)
Date: Wed Jan 20 2010 - 02:42:37 CST

  • Next message: John (Eljay) Love-Jensen: "RE: unicode string representation in PL"

    On Tue, 19 Jan 2010 10:01:07 -0800
    Mark Davis ☕ <mark@macchiato.com> wrote:

    > Most programming languages represent a string as a Unicode String (see
    > http://unicode.org/glossary/#Unicode_String), that is, not guaranteed to be
    > either valid UTF nor in a valid normalization form (such as UTF-8).
    >
    > While it is certainly possible to write a string class that has both UTF and
    > normalization as invariants, it is typically less expensive to do a
    > verification/conversion in the circumstances that demand it. For a
    > high-level scripting language, where the overhead of always maintaining
    > those invariants is lost in the noise, it might be a reasonable choice (I
    > don't know enough about Lua to say).

    Maybe. I'll try to find time to measure the overhead of a custom character type and of systematic normalisation.
    (Lua: I would say falls in the general category of dynamic languages, indeed, but much more efficient, and allows lower-level customization.)

    > The key issue for any of these questions is indexing:
    >
    > - What do I get when I ask for the nth character?
    > - What do I get when I ask for a substring from index Start to index
    > Limit?
    > - What do I get when I ask for the next character (iteration)? (The last
    > is particularly important, since in typical programs most character access
    > is sequential.)

    size --> in characters
    indexing --> character # n
    slicing --> characters # n to m
    iteration --> on characters

    This is precisely my goal: that basic string methods operate at the character-level (not the byte or code unit or whatever, makes no sense for me). Ditto for more sophisticated operations like finding, replacing, etc...

    > What Java and ICU do is typical: have all of the indexing be by code unit;
    > so asking for the substring from 3 to 6 gets the code units 3, 4, and 5. If
    > you want to guarantee that you are getting complete code points, you call a
    > a routine to find the boundaries, or iterate; if you want to guarantee that
    > you are getting complete other segments (like grapheme cluster, word, line,
    > etc), then you use different routines for finding boundaries or iterating.
    > (BreakIterator) as substrings.

    Higher-level units like words, lines, etc are indeed meaningful. But I don't even understand the sense of lowel-level ones (byte or code unit) -- for text processing, for the programmer.

    > If you try to 'normalize on the fly', which seems to be your third option,
    > you probably don't want the answer for the nth question to be different
    > before and after the (hidden) normalization.
    > So the normalization would need
    > to be triggered by almost any method call on the string class.

    Precisely, in the 3rd option, this would be a programmer choice. If the source is known to be "secure" (no annoying characters such as decomposed ones), then the basic code-character mapping applies. No need for more sophiscated character representation, neither for normalisation. String methods apply on sequences of simple items (codes), more or less like on ASCII byte-strings. What we need is codecs and a format for literals.
    This is where my current implementation stops ;-)
    In this case of simpler strings, then normalisation should not alter anything. Precisely, if normalisation provokes changes (eg the result of indexing), then it means it is needed, the programmer should know that and require it.
    I wonder whether this is sensible. If the gain in performance (for not norlising if not needed) is important enough to introduce such complications (bith in implementation and for the programmer).

    > > why is the de facto standard ICU implementation utf-8 based?
    >
    > It isn't. ICU is UTF-16 based, although it has an increasing number of
    > methods that are optimized to handle both.

    Sorry for the noise -- need to wtch that closer.
     
    > Mark

    Denis



    This archive was generated by hypermail 2.1.5 : Wed Jan 20 2010 - 02:44:24 CST