Re: unicode string representation in PL

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Tue Jan 19 2010 - 14:15:23 CST

  • Next message: Mark Davis ☕: "Re: Confusables and Script restrictions"

    Spir wrote on Tuesday, January 19, 2010 2:20 PM

    > -2- character sequence
    > The opposite option may be to design a kind of character type able to
    > represented any possible character, represented by a single or multiple
    > codes, possibly normalised; whatever the "concrete" (sic!) form it takes
    > on the implementation side (conceptually, a character may be represented
    > as a nested sequence of codes, that would often be a singleton).

    Note that the simplest form of a 'character type' in this sense will be a
    string. Logically, the simplest form will be a 'default grapheme cluster',
    though there is an argument for including items joined by conjoiners. (In
    the latter case, have fun with ZWJ and ZWNJ following the conjoiners.) You
    may have noticed that the set of one codepoint strings is not closed under
    full casing operations.

    At this point it may be helpful to contemplate whether the string consisting
    of U+01ED LATIN SMALL LETTER O WITH OGONEK AND MACRON contains the string
    consisting of <U+01EB LATIN SMALL LETTER O WITH OGONEK> or the string
    consisting of <U+014D LATIN SMALL LETTER WITH MACRON>.

    > Legacy character sets such as ASCII and numerous single-byte ones allow a
    > simple equivalence between these 3 forms. This indeed is not possible for
    > unicode, so that a choice must be made for PL representation:
    > a- Make it equivalent to an encoding format, namely utf-8.
    > b- Make it equivalent to the unicode character set.
    I presume you mean UTF-32 (or 3-bytes per character if speed of computation
    doesn't matter to you.)
    > c- Chose a better-suited representation for string processing.

    An independent issue, more related to the choice of normal form (ignoring
    composition exceptions). NFD has its advantages:
    1) I find it more likely to render, as my rendering system does not explore
    decompositions when it can't find glyphs. (I do use fonts with lots of
    combining characters.) On the other hand, when both render, NFC may render
    better than NFD. Perhaps it's just time to upgrade the software I use.
    2) Collation is defined in terms of NFD.

    > IMO, option a as used by ICU does not make sense. A format is designed for
    > practicle/efficient saving or transfer of text, not for string processing
    > in programs. Also, utf-8 is only one of numerous formats. Chosing utf-8 as
    > PL string representation is imo like eg processing images directly in a
    > particuliar image file format, or sound in any of the numerous
    > sound-saving ones.
    > From the programming point of view, utf-8 is a kind of string
    > *serialisation* form: to be used for persistence, and more generally
    > output, of processed strings. But ICU designers certainly are more clever
    > guysthan me: so I would lke to know what is wrong in my pov -- and why.

    One of the biggest appeals of UTF-8 is that many of the C routines for
    (narrow) character I/O continue to work for UTF-8 - provided you can ignore
    Unicode properties and don't have to truncate strings. (It also helps that
    purely ASCII data is already in UTF-8). The compactness of UTF-8 data
    (especially when data has a lot of ASCII-based mark-up) is just a minor
    convenience. A disadvantage is that it is incompatible with the definition
    of a multibyte encoding if wide characters are only 16 bits wide.

    Richard.



    This archive was generated by hypermail 2.1.5 : Tue Jan 19 2010 - 14:18:35 CST