Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 08 2004 - 07:33:30 CST

Next message: John Cowan: "Re: OpenType not for Open Communication?"

Previous message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: D. Starner: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"D. Starner" <shalesller@writeme.com> writes:

> You could hide combining characters, which would be extremely useful if
> we were just using Latin and Cyrillic scripts.

It would need a separate API for examining the contents of a combining
character. You can't avoid the sequence of code points completely.

It would yield to surprising semantics: for example if you concatenate
a string with N+1 possible positions of an iterator with a string with
M+1 positions, you don't necessarily get a string with N+M+1 positions
because there can be combining characters at the border.

It's simpler to overlay various grouping styles on top of a sequence
of code points than to start with automatically combined combining
characters and process inwards and outwards from there (sometimes
looking inside characters, sometimes grouping them even more).

It would impose complexity in cases where it's not needed. Most of the
time you don't care which code points are combining and which are not,
for example when you compose a text file from many pieces (constants
and parts filled by users) or when parsing (if a string is specified
as ending with a double quote, then programs will in general treat a
double quote followed by a combining character as an end marker).

I believe code points are the appropriate general-purpose unit of
string processing.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: John Cowan: "Re: OpenType not for Open Communication?"
Previous message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: D. Starner: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 07:42:15 CST