Re: unicode string representation in PL

From: Mark Davis ☕ (mark@macchiato.com)
Date: Tue Jan 19 2010 - 12:01:07 CST

  • Next message: Rick McGowan: "Re: Auto-retrieving Unicode fonts from centralized server in absence of @font-face or built-in support"

    Most programming languages represent a string as a Unicode String (see
    http://unicode.org/glossary/#Unicode_String), that is, not guaranteed to be
    either valid UTF nor in a valid normalization form (such as UTF-8).

    While it is certainly possible to write a string class that has both UTF and
    normalization as invariants, it is typically less expensive to do a
    verification/conversion in the circumstances that demand it. For a
    high-level scripting language, where the overhead of always maintaining
    those invariants is lost in the noise, it might be a reasonable choice (I
    don't know enough about Lua to say).

    The key issue for any of these questions is indexing:

       - What do I get when I ask for the nth character?
       - What do I get when I ask for a substring from index Start to index
       Limit?
       - What do I get when I ask for the next character (iteration)? (The last
       is particularly important, since in typical programs most character access
       is sequential.)

    What Java and ICU do is typical: have all of the indexing be by code unit;
    so asking for the substring from 3 to 6 gets the code units 3, 4, and 5. If
    you want to guarantee that you are getting complete code points, you call a
    a routine to find the boundaries, or iterate; if you want to guarantee that
    you are getting complete other segments (like grapheme cluster, word, line,
    etc), then you use different routines for finding boundaries or iterating.
    (BreakIterator) as substrings.

    If you try to 'normalize on the fly', which seems to be your third option,
    you probably don't want the answer for the nth question to be different
    before and after the (hidden) normalization. So the normalization would need
    to be triggered by almost any method call on the string class.

    > why is the de facto standard ICU implementation utf-8 based?

    It isn't. ICU is UTF-16 based, although it has an increasing number of
    methods that are optimized to handle both.

    Mark

    On Tue, Jan 19, 2010 at 06:20, spir <denis.spir@free.fr> wrote:

    > Hello,
    >
    >
    >
    > New to the list, half new to the world of unicode; an amateur programmer
    > and a lover of language design. I hope the following is not too trivial or
    > even stupid for you --anyway, I take the risk ;-).
    > I'm curently trying to figure out what could/should be a
    > programmer-friendly library for unicode strings in a PL. I'm aware that this
    > question is rather difficult in the case of unicode, and partly subjective
    > too. Anyway, for me this means a programmer could basically manipulate
    > strings just like legacy ASCII strings, with complications only coming from
    > the additional power of unicode and the complications of the standard(s).
    > My aim is rather to explore the topic, play with it, and better understand
    > unicode; than to build a production tool. Performance is not the main choice
    > criterion.
    >
    >
    >
    > I have an implementation of a basic unicode string type (in Lua) in which
    > strings are represented as sequences of characters, themselves represented
    > as codes. Works fine, with utf8 decode/encode for test cases.
    > Now, the obvious limitation is due to multiple representations of a single
    > abstract characters in the standards. This makes simple character/substring
    > comparison inaccurate in the general case, and all methods that depend on
    > compare routines: find, replace, sort...
    > In short, this set of issues is adressed by normalisation. But
    > normalisation in turn does not allow such a simple representation of
    > strings. So, my first topic of discussion is the proper representation of
    > unicode strings. Below some possible options:
    >
    > -1- code sequence
    > This option considers that the issues addressed by normal forms do not
    > belong to a general-purpose library. This point of view could be held
    > because:
    > (1) a majority of applications may not need normalisation, for every
    > character would be represented by a single code (eg legacy apps ported to
    > unicode)
    > (2) applications that need normalisation also have to cope with additional
    > issues anyway, such as proper rendering (eg an app for IPA)
    > Indeed, this is not a unicode-compliant option, but some aspects of the
    > unicode standard are arguably debatable.
    >
    > -2- character sequence
    > The opposite option may be to design a kind of character type able to
    > represented any possible character, represented by a single or multiple
    > codes, possibly normalised; whatever the "concrete" (sic!) form it takes on
    > the implementation side (conceptually, a character may be represented as a
    > nested sequence of codes, that would often be a singleton).
    > The advantage is simplicity: a string remains a sequence of characters. The
    > obvious drawback is a possibly big overhead at string creation(*); and undue
    > overhead for every string processing routine in use cases where simple codes
    > would have done the job.
    >
    > -3- both
    > Have two forms, and use the second one only when the user requires it. This
    > switch may be triggered when the user asks for normalisation, possibly via a
    > config parameter. So that in a way the actual representation of a given
    > string can be more or less transparent --because on the user side all works
    > like a character sequence.
    > This is possibly the most efficient choice, but I dislike it for reasons I
    > can hardly explain. Mainly, everything is more complicated, including double
    > implementation of numerous methods.
    >
    >
    >
    > All three options are based on the idea that a string is a sequence of
    > items that *are supposed to* (unambiguously) represent abstract characters.
    > Which would be true in numerous cases, maybe the majority. With the first
    > option, applications are required to cope with the fact that this assertion
    > does not hold in the general case. The library may help eg by providing a
    > normalization routine, possibly a consistent compare func; not a general
    > string representation coping with this issue.
    >
    >
    >
    > -4- utf-8 based
    > This is a format I have never read of. The idea is to use the fact that a
    > utf-8 formatted abstract character is already a sequence of octets. So,
    > characters that are represented with multiple codes only require longer
    > utf-8 sequences. Conceptually, this means utf-8 source strings must first be
    > analysed to build nested sequences representing characters; but need not be
    > decoded; conversely, encoding to utf-8 only requires flattening back the
    > string. Obviously, this form only is advantageous when most sources are
    > utf-8 formatted. Normalisation issues indeed remain.
    >
    >
    >
    > This leads to a first side-question: why is the de facto standard ICU
    > implementation utf-8 based? From my point of view, there are 3 relevant
    > distinct concepts:
    > * character set, mapping abstract characters to codes
    > * character (encoding) format for saving/transfer
    > * character representation, in a programming language
    > Legacy character sets such as ASCII and numerous single-byte ones allow a
    > simple equivalence between these 3 forms. This indeed is not possible for
    > unicode, so that a choice must be made for PL representation:
    > a- Make it equivalent to an encoding format, namely utf-8.
    > b- Make it equivalent to the unicode character set.
    > c- Chose a better-suited representation for string processing.
    >
    > IMO, option a as used by ICU does not make sense. A format is designed for
    > practicle/efficient saving or transfer of text, not for string processing in
    > programs. Also, utf-8 is only one of numerous formats. Chosing utf-8 as PL
    > string representation is imo like eg processing images directly in a
    > particuliar image file format, or sound in any of the numerous sound-saving
    > ones.
    > From the programming point of view, utf-8 is a kind of string
    > *serialisation* form: to be used for persistence, and more generally output,
    > of processed strings. But ICU designers certainly are more clever guysthan
    > me: so I would lke to know what is wrong in my pov -- and why.
    >
    > The choice between options b and c depends on the suitedness of the unicode
    > character set as is for string processing in programs: this is more or less
    > what I currently try to figure out.
    >
    >
    >
    > Another side-issue is whether it makes sense to systematically normalise
    > strings. Like for the representation of strings as character sequences
    > (option 2), the obvious advantage is simplicity, and the drawback a possibly
    > big overhead.
    >
    >
    >
    > Denis
    >
    >
    > (*) I intend to do some timing measures to evaluate the overhead of
    > creating such characters compared to simple (integer) codes (in Lua). And
    > especially compare this to the machine-time required for basic decoding from
    > utf-8, and to normalization. Possibly the overhead is not such relevant. But
    > this also depends on actual implementation choice and on concrete PL
    > features.
    > ________________________________
    >
    > la vita e estrany
    >
    > http://spir.wikidot.com/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Jan 19 2010 - 12:05:27 CST