RE: Non-ascii string processing?

From: jon@spin.ie
Date: Tue Oct 07 2003 - 05:45:05 CST


> Now - a count of DEFAULT GRAPHEME CLUSTERs might be useful (for example,
> for display on a console which uses fixed-width fonts). Indeed, a whole
> class of DEFAULT GRAPHEME CLUSTER handling functions might come in very
> handy indeed. Bytes are useful. Default grapheme clusters are useful.
> But a "character"? What's the point?

Because characters are a useful intermeditary point between bytes and grapheme clusters. Such an intermeditary may be entirely wrapped by code stepping from octets to grapheme clusters, or exposed by an API which will be used by higher level code to produce the grapheme clusters since that is the lowest level an API could expose while remaining encoding neutral (hence that is the level at which XML APIs expose CDATA, element names, etc.). The alternative would be to have a straight mapping between octets and grapheme clusters...

> But then, a default grapheme cluster might theoretically require up to
> 16 Unicode characters. (Maybe more, I don't know). Even bit-packed to 21
> bits per character, that still gives us 336 bits. So I conclude that our
> string processing functions could go a lot faster if only we'd all use
> UTF-336. Er....?

Certainly if we allow for linguistic improbabilities (such as C with two graves, an acute and a couple of Hebrew vowel points) there would be no limit. Allowing for linguistic improbabilities has the advantage of making it more likely that we are allowing for linguistic edge-cases.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST