Re: ? Wrong definitions for combining character sequence in tr 29

From: karl williamson (public@khwilliamson.com)
Date: Wed Nov 25 2009 - 13:13:13 CST

  • Next message: Michael Everson: "Correction to recent ISO 15924 update"

    Kenneth Whistler wrote:
    > Karl Williamson wrote:
    >
    >> Thanks for your reply. I'm afraid I'm still confused.
    >>
    >> The sentence before Table 1b is the first mention in this document of
    >> combining character sequences; it would be nice it it discussed what
    >> they were, and why even mention them at all? In the past, I just
    >> presumed they were an earlier concept that was superseded by grapheme
    >> clusters.
    >
    > It is an earlier concept. But it is not superseded by grapheme
    > clusters.
    >
    >> They are discussed some in 3.6 of the actual standard, and here there
    >> seem to me to be contradictions:
    >>
    >> "• A grapheme cluster is similar, but not identical to a combining
    >> character sequence. A combining character sequence starts with a base
    >> character and extends across any subsequent sequence of combining marks,
    >> nonspacing or spacing. A combining character sequence is most directly
    >> relevant to processing issues related to normalization, comparison, and
    >> searching.
    >> • A grapheme cluster starts with a grapheme base and extends across any
    >> subsequent sequence of nonspacing marks. A grapheme cluster is most
    >> directly relevant to text rendering and such processes as cursor
    >> placement and text selection in editing."
    >>
    >> This seems to me to imply that a base character is always the first item
    >> of a combining character sequence,
    >
    > Usually, yes, but not definitionally. Read D56 and D57 carefully.
    > A *defective* combining character sequence doesn't start with
    > a base character, but is a combining character sequence nonetheless.

    Shouldn't the phrase in the first bullet item be then, "A grapheme
    cluster is similar, but not identical to a combining character sequence.
    A combining character sequence *generally* starts with a base character
    and extends across any subsequent sequence of combining marks, ..." ?
    Also the faq (http://unicode.org/faq/char_combmark.html#1) is wrong, as
    it says "A combining character sequence is a base character followed by
    any number of combining characters." That should be "one or more"
    instead of "any number of". I presume you don't have to wait for a
    formal open comment period to revise the faq's.
    >> and the word 'any' seems to me to
    >> imply 0 or more marks following it.
    >
    > For a grapheme cluster, yes. A single base character *is*
    > a grapheme cluster. It is *not* a combining character sequence.
    >

    And another seeming contradiction in the documentation to me is that it
    says in Chapter 3 that an extended grapheme cluster has no linguistic
    basis; this implies to me that a non-extended grapheme cluster does, but
    the extended version comes closest to matching what a user would view as
    a single character based on text in tr29.
    >
    >> And this doesn't help me understand why there is the concept of a
    >> combining character sequence and why that is more relevant than a
    >> grapheme cluster to normalization, comparison, and searching.
    >
    > Normalization is not defined in terms of grapheme clusters.
    > Grapheme clusters are about segmentation issues in text (which
    > is why they are defined in UAX #29, the UAX about text segmentation).
    >
    > Normalization, on the other hand, is *definitionally* concerned
    > with combining character sequences, because at the core
    > of normalization is the canonical ordering of sequences of
    > combining marks. See the Canonical Ordering Algorithm subsection
    > of Section 3.11 Normalization Forms in the latest posted
    > version of the standard.
    >
    > --Ken
    >
    >
    >

    Thanks for your help



    This archive was generated by hypermail 2.1.5 : Wed Nov 25 2009 - 13:17:52 CST