Re: unicode string representation in PL

From: spir (
Date: Wed Jan 20 2010 - 02:25:32 CST

  • Next message: spir: "Re: unicode string representation in PL"

    On Tue, 19 Jan 2010 20:15:23 -0000
    "Richard Wordingham" <> wrote:

    > Spir wrote on Tuesday, January 19, 2010 2:20 PM
    > > -2- character sequence
    > > The opposite option may be to design a kind of character type able to
    > > represent any possible character, represented by a single or multiple
    > > codes, possibly normalised; whatever the "concrete" (sic!) form it takes
    > > on the implementation side (conceptually, a character may be represented
    > > as a nested sequence of codes, that would often be a singleton).
    > Note that the simplest form of a 'character type' in this sense will be a
    > string. Logically, the simplest form will be a 'default grapheme cluster',
    > though there is an argument for including items joined by conjoiners. (In
    > the latter case, have fun with ZWJ and ZWNJ following the conjoiners.) You
    > may have noticed that the set of one codepoint strings is not closed under
    > full casing operations.

    Hem, I _may_ understand what you mean. The representation I evoked above cannot be implemented with (ordinary or byte) strings, because there cannot be nested strings in the same sense as nested sequences; unless we invent delimiters, but then comes the issue of escaping. I thought at real sequences (like python lists) to represent unicode strings (this is already the case). Then, individual chars can indeed be whatever kind od sequence, including byte-strings. But conceptually, they act as nested sequences, or maybe it's what you call "cluster":
    string : ((a) (b c) (d) (e f g) (h))
    where each letter is a code (point)

    > At this point it may be helpful to contemplate whether the string consisting
    > of U+01ED LATIN SMALL LETTER O WITH OGONEK AND MACRON contains the string
    > consisting of <U+01EB LATIN SMALL LETTER O WITH OGONEK> or the string
    > consisting of <U+014D LATIN SMALL LETTER WITH MACRON>.

    For me, the answer is no. There is nothing like (what represents) a character containing another one. The only meaningful logical comparisons are for equality (--> find) and order (--> sort).
    In the format above, 2 characters reside in 2 different clusters. So, searching cannot return false positive (if it's the issue you have in mind).
    size --> in characters
    indexing --> n-th character
    slicing --> characters n to m
    iteration --> on characters

    > > Legacy character sets such as ASCII and numerous single-byte ones allow a
    > > simple equivalence between these 3 forms. This indeed is not possible for
    > > unicode, so that a choice must be made for PL representation:
    > > a- Make it equivalent to an encoding format, namely utf-8.
    > > b- Make it equivalent to the unicode character set.
    > I presume you mean UTF-32 (or 3-bytes per character if speed of computation
    > doesn't matter to you.)

    Yes. I didn't find a proper expression. A PL representation of string equivalent to UTF-32.

    > > c- Chose a better-suited representation for string processing.
    > An independent issue, more related to the choice of normal form (ignoring
    > composition exceptions). NFD has its advantages:
    > 1) I find it more likely to render, as my rendering system does not explore
    > decompositions when it can't find glyphs. (I do use fonts with lots of
    > combining characters.) On the other hand, when both render, NFC may render
    > better than NFD. Perhaps it's just time to upgrade the software I use.
    > 2) Collation is defined in terms of NFD.

    Yes, I tend to think the same (and intended to introduce this question of proper normal form as a separate thread, but since it comes now...)
    The primary advantage in front NFC (I don't even consider NFK* for information is potentially lost) is, if I understand correctly, that chanracter are always represented in their most decomposed form. So, at least, we know that!
    While NFC-normalised combining characters may be:
    * combined
    * decomposed
    * and even half-composed
    (Hope I use terms in a meaningful way ;-)

    > > IMO, option a as used by ICU does not make sense. A format is designed for
    > > practicle/efficient saving or transfer of text, not for string processing
    > > in programs. Also, utf-8 is only one of numerous formats. Chosing utf-8 as
    > > PL string representation is imo like eg processing images directly in a
    > > particuliar image file format, or sound in any of the numerous
    > > sound-saving ones.
    > > From the programming point of view, utf-8 is a kind of string
    > > *serialisation* form: to be used for persistence, and more generally
    > > output, of processed strings. But ICU designers certainly are more clever
    > > guysthan me: so I would lke to know what is wrong in my pov -- and why.
    > One of the biggest appeals of UTF-8 is that many of the C routines for
    > (narrow) character I/O continue to work for UTF-8 - provided you can ignore
    > Unicode properties and don't have to truncate strings.

    Yes, I'm aware of this, and it made me wonder for a while. But this advantage is rather secondary compared to the enormous comfort of a real character sequence.

    A deeper distinction may be made for a string representation in a PL: namely between the implementation and the interface on the programmer side. It's indeed possible to have eg utf8 strings under the hood, which act like like character strings at the language level. (Meaning the interface abstracts underlying complication, mainly by offering methods only from the character-level.)
    [I thought ICU was more or less implemented that way (but not hiding complexity at all), but Mark's answer shows I was wroing on that.]

    > (It also helps that
    > purely ASCII data is already in UTF-8). The compactness of UTF-8 data
    > (especially when data has a lot of ASCII-based mark-up) is just a minor
    > convenience.

    I think so.

    > A disadvantage is that it is incompatible with the definition
    > of a multibyte encoding if wide characters are only 16 bits wide.

    I don't understand what you mean here, sorry.

    > Richard.



    la vita e estrany

    This archive was generated by hypermail 2.1.5 : Wed Jan 20 2010 - 02:29:55 CST