From: karl williamson (public@khwilliamson.com)
Date: Wed Nov 25 2009 - 13:13:13 CST
Kenneth Whistler wrote:
> Karl Williamson wrote:
>
>> Thanks for your reply. I'm afraid I'm still confused.
>>
>> The sentence before Table 1b is the first mention in this document of
>> combining character sequences; it would be nice it it discussed what
>> they were, and why even mention them at all? In the past, I just
>> presumed they were an earlier concept that was superseded by grapheme
>> clusters.
>
> It is an earlier concept. But it is not superseded by grapheme
> clusters.
>
>> They are discussed some in 3.6 of the actual standard, and here there
>> seem to me to be contradictions:
>>
>> "• A grapheme cluster is similar, but not identical to a combining
>> character sequence. A combining character sequence starts with a base
>> character and extends across any subsequent sequence of combining marks,
>> nonspacing or spacing. A combining character sequence is most directly
>> relevant to processing issues related to normalization, comparison, and
>> searching.
>> • A grapheme cluster starts with a grapheme base and extends across any
>> subsequent sequence of nonspacing marks. A grapheme cluster is most
>> directly relevant to text rendering and such processes as cursor
>> placement and text selection in editing."
>>
>> This seems to me to imply that a base character is always the first item
>> of a combining character sequence,
>
> Usually, yes, but not definitionally. Read D56 and D57 carefully.
> A *defective* combining character sequence doesn't start with
> a base character, but is a combining character sequence nonetheless.
Shouldn't the phrase in the first bullet item be then, "A grapheme
cluster is similar, but not identical to a combining character sequence.
A combining character sequence *generally* starts with a base character
and extends across any subsequent sequence of combining marks, ..." ?
Also the faq (http://unicode.org/faq/char_combmark.html#1) is wrong, as
it says "A combining character sequence is a base character followed by
any number of combining characters." That should be "one or more"
instead of "any number of". I presume you don't have to wait for a
formal open comment period to revise the faq's.
>> and the word 'any' seems to me to
>> imply 0 or more marks following it.
>
> For a grapheme cluster, yes. A single base character *is*
> a grapheme cluster. It is *not* a combining character sequence.
>
And another seeming contradiction in the documentation to me is that it
says in Chapter 3 that an extended grapheme cluster has no linguistic
basis; this implies to me that a non-extended grapheme cluster does, but
the extended version comes closest to matching what a user would view as
a single character based on text in tr29.
>
>> And this doesn't help me understand why there is the concept of a
>> combining character sequence and why that is more relevant than a
>> grapheme cluster to normalization, comparison, and searching.
>
> Normalization is not defined in terms of grapheme clusters.
> Grapheme clusters are about segmentation issues in text (which
> is why they are defined in UAX #29, the UAX about text segmentation).
>
> Normalization, on the other hand, is *definitionally* concerned
> with combining character sequences, because at the core
> of normalization is the canonical ordering of sequences of
> combining marks. See the Canonical Ordering Algorithm subsection
> of Section 3.11 Normalization Forms in the latest posted
> version of the standard.
>
> --Ken
>
>
>
Thanks for your help
This archive was generated by hypermail 2.1.5 : Wed Nov 25 2009 - 13:17:52 CST