From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 12 2003 - 07:49:52 EST
On 12/12/2003 04:13, jon@hackcraft.net wrote:
>>Thank you. I was supposing that isolated combining marks were considered
>>in some way defective,
>>
>>
>
><blockquote cite="http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf">
>D17a: Defective combining character sequence: A combining character sequence
>that does not start with a base character.
>
>[Explanatory Note] Defective combining character sequences occur when a
>sequence of combining
>characters appears at the start of a string or follows a control or format
>character.
>Such sequences are defective from the point of view of handling of combining
>marks, but are not ill-formed.
></blockquote>
>
>"in some way defective" is actually a good way to put it methinks, they aren't
>illegal, and in some cases you can do things with them that are both reasonable
>and useful, but in other situations they may be problematic.
>
>
>
>
Indeed. But I was thinking more in terms of grapheme clusters, as
defined in UAX #29. Is a defective combining sequence a grapheme
cluster? Probably not according to the definition "what the user thinks
of as a character or basic unit of the language". But the boundary rule
"/Break at the start and end of text./" implies that the algorithm will
count a defective combining sequence at the start of text (and possibly
what follows) as a default grapheme cluster. So it is "in some way
defective" as a grapheme cluster as well as as a character sequence.
I note the following in UAX #29, which backs up my comments on functions
to count characters:
> In those rare circumstances where end-users need character counts, the
> counts should correspond to the grapheme cluster boundaries.
This implies that end users should not require counts of code units or
code points.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 08:47:03 EST