From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Tue Jan 19 2010 - 14:15:23 CST
Spir wrote on Tuesday, January 19, 2010 2:20 PM
> -2- character sequence
> The opposite option may be to design a kind of character type able to
> represented any possible character, represented by a single or multiple
> codes, possibly normalised; whatever the "concrete" (sic!) form it takes
> on the implementation side (conceptually, a character may be represented
> as a nested sequence of codes, that would often be a singleton).
Note that the simplest form of a 'character type' in this sense will be a
string. Logically, the simplest form will be a 'default grapheme cluster',
though there is an argument for including items joined by conjoiners. (In
the latter case, have fun with ZWJ and ZWNJ following the conjoiners.) You
may have noticed that the set of one codepoint strings is not closed under
full casing operations.
At this point it may be helpful to contemplate whether the string consisting
of U+01ED LATIN SMALL LETTER O WITH OGONEK AND MACRON contains the string
consisting of <U+01EB LATIN SMALL LETTER O WITH OGONEK> or the string
consisting of <U+014D LATIN SMALL LETTER WITH MACRON>.
> Legacy character sets such as ASCII and numerous single-byte ones allow a
> simple equivalence between these 3 forms. This indeed is not possible for
> unicode, so that a choice must be made for PL representation:
> a- Make it equivalent to an encoding format, namely utf-8.
> b- Make it equivalent to the unicode character set.
I presume you mean UTF-32 (or 3-bytes per character if speed of computation
doesn't matter to you.)
> c- Chose a better-suited representation for string processing.
An independent issue, more related to the choice of normal form (ignoring
composition exceptions). NFD has its advantages:
1) I find it more likely to render, as my rendering system does not explore
decompositions when it can't find glyphs. (I do use fonts with lots of
combining characters.) On the other hand, when both render, NFC may render
better than NFD. Perhaps it's just time to upgrade the software I use.
2) Collation is defined in terms of NFD.
> IMO, option a as used by ICU does not make sense. A format is designed for
> practicle/efficient saving or transfer of text, not for string processing
> in programs. Also, utf-8 is only one of numerous formats. Chosing utf-8 as
> PL string representation is imo like eg processing images directly in a
> particuliar image file format, or sound in any of the numerous
> sound-saving ones.
> From the programming point of view, utf-8 is a kind of string
> *serialisation* form: to be used for persistence, and more generally
> output, of processed strings. But ICU designers certainly are more clever
> guysthan me: so I would lke to know what is wrong in my pov -- and why.
One of the biggest appeals of UTF-8 is that many of the C routines for
(narrow) character I/O continue to work for UTF-8 - provided you can ignore
Unicode properties and don't have to truncate strings. (It also helps that
purely ASCII data is already in UTF-8). The compactness of UTF-8 data
(especially when data has a lot of ASCII-based mark-up) is just a minor
convenience. A disadvantage is that it is incompatible with the definition
of a multibyte encoding if wide characters are only 16 bits wide.
Richard.
This archive was generated by hypermail 2.1.5 : Tue Jan 19 2010 - 14:18:35 CST