There're a lot of good questions here. Some comments:
At 11:38 AM 3/6/02 -0600, Richard.Gillam@trilogy.com wrote:
> >The sole change required would be for the CGJ to be Me instead of Mn.
> >If we made this change, it would provide for a mechanism for
> >representing diacritics over multiple characters, without the addition
> >of any other characters -- or the wait for them to be encoded.
>
>Let me make sure I have this straight: Say you had لل and wanted to draw a
>breve over the whole pair of characters. How would you express that?
>
>I'm guessing that the answer is:
>
>a<umlaut><CGJ>a<umlaut><CGJ><breve>
>
>What it looks like is that you're taking the INVISIBLE ENCLOSING MARK
>which someone proposed a while back and giving this semantic to the
>CGJ. Seems like this gives the CGJ at least two distinct jobs:
>
>1) It causes the grapheme clusters on either side to be treated as a
>single grapheme cluster.
>2) It causes the preceding grapheme cluster to be treated as a single unit
>for the purpose of applying non-spacing marks (i.e., a non-spacing mark
>normally applies to the preceding base character; a CGJ causes it to be
>applied to the preceding grapheme cluster instead).
>
>The name "combining grapheme joiner" only suggests job #1 to me, and that
>makes me a little dubious about extending its charter to include job
>#2. Can we be completely confident that situations won't arise where the
>semantics of CGJ won't be ambiguous, where you don't know for sure whether
>meaning #1 or meaning #2 is intended? Even if we can, will the double
>usage be confusing to people?
This is a tough question and like you, I suspect that we don't have the answer.
>I think you can disambiguate them by specifying the following rules...
>
>1) If CGJ is followed by a non-combining character, meaning #1 (the
>original CGJ meaning) is intended.
>2) If CGJ is followed by a combining character, meaning #2 (IEM) is intended.
>
>...but I don't know that this is a good idea.
>
>[start of off-topic rambling]
>
>I haven't read the most recent draft of Unicode 3.2 yet, but this whole
>grapheme-cluster thing has always felt rather ill-defined to me,
>especially when it comes to how grapheme clusters and combining marks
>behave. As I see it, grapheme clusters have the following purposes:
Be careful - Base character + combining marks are also 'clusters' (even
those not using CGJ) and many of these 'rules' do not apply to them.
>1) In a text-editing application, arrow keys generally move forward and
>back an entire grapheme cluster at a time.
Not true for clusters containing Mc (spacing combining marks)
>2) In a text-editing application, the backspace and delete keys generally
>delete whole grapheme clusters.
Not true for clusters containing Mc (spacing combining marks)
>3) Grapheme clusters are always kept together on a single line, even in
>cases where words aren't.
>4) A search on a piece of text shouldn't report a hit if the matching text
>doesn't begin and end on grapheme-cluster boundaries.
I suspect that this may not be true for clusters containing Mc (spacing
combining marks) - but I may be wrong about this one. It depends on whether
it makes sense to allow searches for common prefixes whether or not they
are continued with an Mc or not.
>5) Language-sensitive comparison should generally treat grapheme clusters
>as single units (i.e., a grapheme cluster maps to a single collation
>element, not to one collation element for each component part).
Not true, o-umlaut may be collated as if it was oe under some tailorings.
Similar things may happen with other clusters. [I realize that this is not
the same as sorting o and umlaut as two units, but the simplistic 'one
cluster-one unit' rule is deceptive]
>6) Enclosing marks apply to the preceding grapheme cluster.
>7) Sometimes, the other combining marks also apply to the preceding
>grapheme cluster.
>
>Leaving aside for a moment the fact that I'm not sure the same sequences
>of characters should be considered "grapheme clusters" for all of the
>above purposes, 6 and 7 bother me.
>
>The big problem with 6 is that we've stated that a combining character
>sequence is a grapheme cluster. An enclosing mark, being a combining
>character, would thus be part of a combining character sequence. So
>you've got some sequence of code points being treated as a "grapheme
>cluster" solely for the purpose of figuring out how to draw the enclosing
>mark. The enclosing mark gets treated as part of the same "grapheme
>cluster" as the characters it encloses (and, for that matter, any
>following combining marks) for all other purposes. You've got grapheme
>clusters inside grapheme clusters. This seems confusing and weird.
>
>7 is even more problematic. Unicode 3.2 says explicitly that a
>non-spacing mark applies to an entire Hangul syllable, and not just to the
>last jamo, when the syllable is spelled out in jamo, and that it does this
>because a Hangul syllable is a grapheme cluster. But when a "grapheme
>cluster" is formed with a CGJ, a non-spacing mark only applies to the last
>character in the grapheme cluster (unless, if we adopt this new rule, the
>last character happens to be another CGJ). It's not clear whether a
>generic non-spacing mark (such as a tilde or macron) would apply to an
>entire Indic syllable cluster (following the Hangul-syllable precedent) or
>just to the last component (following the CGJ precedent). And, of course,
>if a non-spacing mark follows a normal combining character sequence, it's
>just considered a component part of a grapheme cluster and applies to the
>preceding base character.
>
>Enclosing marks, on the other hand, always apply to the immediately
>preceding grapheme cluster, so they interact with grapheme clusters (and
>particularly with CGJ) differently from non-spacing marks. I'm guessing
>combining spacing marks interact with grapheme clusters the same way
>non-spacing marks do, but this isn't clear either. In any event, you've
>not got different types of combining marks behaving differently in ways
>that you didn't have before grapheme clusters were introduced, and this
>seems questionable.
No, Me always behaved differently. Take applying a circumflex to a sequence
ending in combining (enclosing) circle. Clearly the circumflex needs to be
positioned relative to the circle, i.e. centered, even if the base
character would have had a non-centered accent (e.g. accent on top of a J
or L might not be centered on the glyph, but centered on the stem). In
other words, Me always created an unanalyzable cluster.
>I think I'm coming around to the idea that there should be one concept of
>"a group of code points" that's used to affect analysis algorithms such as
>searching, sorting, line breaking, and arrow-key movement, and another
>"group of code points" that's used to affect mark positioning, and that
>different formatting characters be used to control the partitioning of
>code points into the different types of groups. In any event, I feel that
>the grapheme-cluster concept in its current state (or at least in its
>state as of about six weeks ago) isn't as well-thought-out as it should be.
Since Mc (the spacing combinging marks) are typically edited one at a time,
and are not, like Mn (the non-spacing marks), fused into the cluster for
editing, it's toudhg to come up with a *single* set of rules that works for
all 7 of your contexts.
This archive was generated by hypermail 2.1.2 : Wed Mar 06 2002 - 14:46:44 EST