Public Review Issue #27

Joiner/Nonjoiner in
Combining Character Sequences

Unicode 4.0 describes the structure of Khmer syllables, saying that they may contain an interior ZWJ. There is a problem with this that needs to be resolved in 4.0.1, because some of the characters later in the syllable can be combining characters. This paper describes a proposal with to fix this problem. As a part of the proposal, a choice has to be made among two alternatives.

Background

Here is the current (4.0) definition of combining character sequences:

D17 Combining character sequence: A character sequence consisting of either a base character followed by a sequence of one or more combining characters, or a sequence of one or more combining characters.

Thus a combining character sequence cannot contain a ZWJ or any other Cf. Any use of a ZWJ before a combining mark produces a defective combining character sequence (D17a), which isolates the combining mark from any preceding base character. This is clearly not the desired behavior, both in rendering and in any text analysis based on combining character sequences.

In terms of grapheme clusters, default grapheme clusters do not include ZWJ; as a matter of fact, default grapheme clusters, except for Hangul Jamo Syllables, are a subset of combining sequences. See Grapheme Cluster Boundaries. What constitutes a tailored grapheme cluster is up to a particular process, and so one could contain a ZWJ. However, any combining mark after a ZWJ does not apply to a previous base character within that tailored grapheme cluster, so the use of a ZWJ would isolate that combining mark. Such a sequence would not correspond to anything used in a natural language.

There was extensive discussion of this, and Ken Whistler did an analysis of the Cf characters, to see which, if any, are candidates for inclusion in combining character sequences. Based on that and subsequent discussion, the only two Cf characters that are viable candidates are:

200C ZERO WIDTH NON-JOINER
200D ZERO WIDTH JOINER

Proposal

The proposal is to allow both of these character in combining character sequences, and define the scope of the joining behavior as follows: If a ZWJ or ZWNJ occurs directly before a base character (current normal situation) it indicates a request that that base character and the previous base character have joining (resp. non-joining) forms. If a ZWJ or ZWNJ is within (instead of on the boundary of) a CCS, it should just affect the adjacent characters.

Note: of course, whenever a glyph change shape, it can affect the shape and placement of neighboring characters. For example, when a ligature forms it can affect the placement of accents applied to the characters in that ligature.

Examples:

B C C C J B -- affects the two bases (current situation)
B C C J C B -- affects the 2nd and 3rd combining marks.
B C J C C B -- affects the 1st and 2nd combining marks.
B J C C C B -- affects the first B and the first C.

B = base, C = Combining mark, J = joiner or non-joiner

All effects are on the sequence as it would be after canonical reordering. Thus in the following, the joiner affects the shape of the <acute> and the <grave>.

A <acute> <dot> <joiner> <grave> B

Note: for Khmer, canonical reordering is not an issue, since all the marks are ccc=0.

Main Issue

Now, there are two ways to accomplish the above.

  1. Change the general category of ZWJ and ZWNJ to Mn.
  2. Keep the general category as Cf.

Both of these actions would still require other changes for a consistent model. The following describes some of the changes that would be needed for either change.

Here are the properties of the characters:
http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=200C#here
http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=200D#here

BIDI Algorithm: They are BN, but also have special behavior in Joiners. So they can either stay Cf or change to Mn; it wouldn't make any difference in this algorithm.

Line Break: Already combining mark, so doesn't make any difference whether they are Cf or Mn.

Normalization. Would not need a formal change either way, since the algorithm is stated in terms of the CCC, not the general category. Some examples might be better changing slightly, since they use the term base character.

Identifiers. If the ZWJ and ZWNJ are really part and parcel of Khmer words, then it would be much better if they were Mn than Cf, since Cf characters are removed from identifiers. We would have to restructure the approach somewhat if we wanted them to be exceptionally left in.

Script Analysis. If we don't change the GC to Mn, we should change the Script property value to Inherited.

Default Grapheme Cluster, Word, Sentence boundaries: If we don't change the GC to Mn, we need to change the composition of Grapheme_Extend to include ZWJ and ZWNJ.

Regular Expressions: Mostly fixed if we fix grapheme clusters, etc. However, if we don't change the GC to Mn, need to change R1.4, item 2. There was some text to go into the next version on word break, that if you are using Perl-style word break, the best heuristic is to treat combining marks as letters, since that will give better results (because most combining marks are applied to letters). That would have to be changed to also include joiners. Also need to change the word character definition in Table C.

Collation: Here is the data:

200C ; [.0000.0000.0000.0000] # [200C] ZERO WIDTH NON-JOINER
200D ; [.0000.0000.0000.0000] # [200D] ZERO WIDTH JOINER

There are some other Mn and Cf that are all ignorable, so that is not inconsistent either way; however, one should tailor Khmer in that case if the shape makes a semantic difference.

Definitions. These are "applies", "combining character sequence" in Chapter 3. The current 4.0.1 definition accepted by the UTC is:

D14 Combining character: a graphic character with the General Category of Combining Mark (M).

For clarity, the bullet should change to something like:

If we accept B, then we would need to change "other combining marks" to "other combining marks or ZWJ or ZWNJ".