Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 08 2004 - 07:33:30 CST

  • Next message: John Cowan: "Re: OpenType not for Open Communication?"

    "D. Starner" <shalesller@writeme.com> writes:

    > You could hide combining characters, which would be extremely useful if
    > we were just using Latin and Cyrillic scripts.

    It would need a separate API for examining the contents of a combining
    character. You can't avoid the sequence of code points completely.

    It would yield to surprising semantics: for example if you concatenate
    a string with N+1 possible positions of an iterator with a string with
    M+1 positions, you don't necessarily get a string with N+M+1 positions
    because there can be combining characters at the border.

    It's simpler to overlay various grouping styles on top of a sequence
    of code points than to start with automatically combined combining
    characters and process inwards and outwards from there (sometimes
    looking inside characters, sometimes grouping them even more).

    It would impose complexity in cases where it's not needed. Most of the
    time you don't care which code points are combining and which are not,
    for example when you compose a text file from many pieces (constants
    and parts filled by users) or when parsing (if a string is specified
    as ending with a double quote, then programs will in general treat a
    double quote followed by a combining character as an end marker).

    I believe code points are the appropriate general-purpose unit of
    string processing.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 07:42:15 CST