unicode string representation in PL

From: spir (denis.spir@free.fr)
Date: Tue Jan 19 2010 - 08:20:50 CST

  • Next message: Brett Zamir: "Auto-retrieving Unicode fonts from centralized server in absence of @font-face or built-in support"


    New to the list, half new to the world of unicode; an amateur programmer and a lover of language design. I hope the following is not too trivial or even stupid for you --anyway, I take the risk ;-).
    I'm curently trying to figure out what could/should be a programmer-friendly library for unicode strings in a PL. I'm aware that this question is rather difficult in the case of unicode, and partly subjective too. Anyway, for me this means a programmer could basically manipulate strings just like legacy ASCII strings, with complications only coming from the additional power of unicode and the complications of the standard(s).
    My aim is rather to explore the topic, play with it, and better understand unicode; than to build a production tool. Performance is not the main choice criterion.

    I have an implementation of a basic unicode string type (in Lua) in which strings are represented as sequences of characters, themselves represented as codes. Works fine, with utf8 decode/encode for test cases.
    Now, the obvious limitation is due to multiple representations of a single abstract characters in the standards. This makes simple character/substring comparison inaccurate in the general case, and all methods that depend on compare routines: find, replace, sort...
    In short, this set of issues is adressed by normalisation. But normalisation in turn does not allow such a simple representation of strings. So, my first topic of discussion is the proper representation of unicode strings. Below some possible options:

    -1- code sequence
    This option considers that the issues addressed by normal forms do not belong to a general-purpose library. This point of view could be held because:
    (1) a majority of applications may not need normalisation, for every character would be represented by a single code (eg legacy apps ported to unicode)
    (2) applications that need normalisation also have to cope with additional issues anyway, such as proper rendering (eg an app for IPA)
    Indeed, this is not a unicode-compliant option, but some aspects of the unicode standard are arguably debatable.

    -2- character sequence
    The opposite option may be to design a kind of character type able to represented any possible character, represented by a single or multiple codes, possibly normalised; whatever the "concrete" (sic!) form it takes on the implementation side (conceptually, a character may be represented as a nested sequence of codes, that would often be a singleton).
    The advantage is simplicity: a string remains a sequence of characters. The obvious drawback is a possibly big overhead at string creation(*); and undue overhead for every string processing routine in use cases where simple codes would have done the job.

    -3- both
    Have two forms, and use the second one only when the user requires it. This switch may be triggered when the user asks for normalisation, possibly via a config parameter. So that in a way the actual representation of a given string can be more or less transparent --because on the user side all works like a character sequence.
    This is possibly the most efficient choice, but I dislike it for reasons I can hardly explain. Mainly, everything is more complicated, including double implementation of numerous methods.

    All three options are based on the idea that a string is a sequence of items that *are supposed to* (unambiguously) represent abstract characters. Which would be true in numerous cases, maybe the majority. With the first option, applications are required to cope with the fact that this assertion does not hold in the general case. The library may help eg by providing a normalization routine, possibly a consistent compare func; not a general string representation coping with this issue.

    -4- utf-8 based
    This is a format I have never read of. The idea is to use the fact that a utf-8 formatted abstract character is already a sequence of octets. So, characters that are represented with multiple codes only require longer utf-8 sequences. Conceptually, this means utf-8 source strings must first be analysed to build nested sequences representing characters; but need not be decoded; conversely, encoding to utf-8 only requires flattening back the string. Obviously, this form only is advantageous when most sources are utf-8 formatted. Normalisation issues indeed remain.

    This leads to a first side-question: why is the de facto standard ICU implementation utf-8 based? From my point of view, there are 3 relevant distinct concepts:
    * character set, mapping abstract characters to codes
    * character (encoding) format for saving/transfer
    * character representation, in a programming language
    Legacy character sets such as ASCII and numerous single-byte ones allow a simple equivalence between these 3 forms. This indeed is not possible for unicode, so that a choice must be made for PL representation:
    a- Make it equivalent to an encoding format, namely utf-8.
    b- Make it equivalent to the unicode character set.
    c- Chose a better-suited representation for string processing.

    IMO, option a as used by ICU does not make sense. A format is designed for practicle/efficient saving or transfer of text, not for string processing in programs. Also, utf-8 is only one of numerous formats. Chosing utf-8 as PL string representation is imo like eg processing images directly in a particuliar image file format, or sound in any of the numerous sound-saving ones.
    From the programming point of view, utf-8 is a kind of string *serialisation* form: to be used for persistence, and more generally output, of processed strings. But ICU designers certainly are more clever guysthan me: so I would lke to know what is wrong in my pov -- and why.

    The choice between options b and c depends on the suitedness of the unicode character set as is for string processing in programs: this is more or less what I currently try to figure out.

    Another side-issue is whether it makes sense to systematically normalise strings. Like for the representation of strings as character sequences (option 2), the obvious advantage is simplicity, and the drawback a possibly big overhead.


    (*) I intend to do some timing measures to evaluate the overhead of creating such characters compared to simple (integer) codes (in Lua). And especially compare this to the machine-time required for basic decoding from utf-8, and to normalization. Possibly the overhead is not such relevant. But this also depends on actual implementation choice and on concrete PL features.

    la vita e estrany


    This archive was generated by hypermail 2.1.5 : Tue Jan 19 2010 - 08:26:07 CST