From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Nov 24 2009 - 19:05:26 CST
Karl Williamson wrote:
> Thanks for your reply. I'm afraid I'm still confused.
>
> The sentence before Table 1b is the first mention in this document of
> combining character sequences; it would be nice it it discussed what
> they were, and why even mention them at all? In the past, I just
> presumed they were an earlier concept that was superseded by grapheme
> clusters.
It is an earlier concept. But it is not superseded by grapheme
clusters.
>
> They are discussed some in 3.6 of the actual standard, and here there
> seem to me to be contradictions:
>
> "• A grapheme cluster is similar, but not identical to a combining
> character sequence. A combining character sequence starts with a base
> character and extends across any subsequent sequence of combining marks,
> nonspacing or spacing. A combining character sequence is most directly
> relevant to processing issues related to normalization, comparison, and
> searching.
> • A grapheme cluster starts with a grapheme base and extends across any
> subsequent sequence of nonspacing marks. A grapheme cluster is most
> directly relevant to text rendering and such processes as cursor
> placement and text selection in editing."
>
> This seems to me to imply that a base character is always the first item
> of a combining character sequence,
Usually, yes, but not definitionally. Read D56 and D57 carefully.
A *defective* combining character sequence doesn't start with
a base character, but is a combining character sequence nonetheless.
> and the word 'any' seems to me to
> imply 0 or more marks following it.
For a grapheme cluster, yes. A single base character *is*
a grapheme cluster. It is *not* a combining character sequence.
> And this doesn't help me understand why there is the concept of a
> combining character sequence and why that is more relevant than a
> grapheme cluster to normalization, comparison, and searching.
Normalization is not defined in terms of grapheme clusters.
Grapheme clusters are about segmentation issues in text (which
is why they are defined in UAX #29, the UAX about text segmentation).
Normalization, on the other hand, is *definitionally* concerned
with combining character sequences, because at the core
of normalization is the canonical ordering of sequences of
combining marks. See the Canonical Ordering Algorithm subsection
of Section 3.11 Normalization Forms in the latest posted
version of the standard.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Nov 24 2009 - 19:09:17 CST