Grapheme break

L2/01-339

From: Mark Davis [mark@macchiato.com]

Sent: Monday, September 10, 2001 11:16 AM

Subject: UTC Agenda Item: Grapheme Break

I want to reiterate that I think we made a mistake by not including the inverse of GRAPHEME JOINER, and that we do need to take some action.

The much more important requirement from the field is for the inverse: the grapheme break (GB). This would be used to indicate that a particular sequence in a given language is *not* actually a grapheme cluster. That would allow, for example, Slovak dictionaries and databases to flag the 1% cases where "ch" in Slovak is to be sorted as two separate characters with a GB; and not require flagging the 99% cases where it *is* considered a single character in Slovak with a GJ. [Another example, which has come up on the Unicode list recently, is "aa" in Danish.] It is clearly preferable to flag the exceptions rather than the normal cases in those languages.

Let's look at the alternatives for breaking grapheme clusters:

- ZWSP (aka 'allow line break'). Won't work, since it allows linebreak at that point

- SHY (soft hyphen). Won't work, since the position many not be a hyphenation point.

- ZWJ & ZWNJ. Won't work, since they can cause/break ligatures / cursive connections where not desired.

- ZWNBSP: May work. The only one that we might be able to overload is ZWNBSP (aka 'disallow line break'), or better yet, its new semantic replacement WORD JOINER. I believe that such an overload would work for Latin and most other scripts. It would not work for a script that:

(a) allows line break between letters and would thus need WORD JOINER to manually indicate specific positions that disallow linebreak (Thai and other languages that break between letters), AND

(b) has multi-base character grapheme clusters in collation, AND

Scripts and/or situations in which all three of these conditions are fulfilled may be so unusual that we could stretch the semantics for WORD JOINER, and avoid encoding the inverse function as a separate character.

However, the biggest barrier to this is that the semantics conflict conceptually to such a high degree: *join* words vs *break* grapheme clusters.

- Others. The other possibilities are even uglier. Here is the set of Cf & Cc's

0000..001F ; Cc # [32] <control>..<control>

007F..009F ; Cc # [33] <control>..<control>

070F ; Cf # SYRIAC ABBREVIATION MARK

180B..180E ; Cf # [4] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN

VOWEL SEPARATOR

200C..200F ; Cf # [4] ZERO WIDTH NON-JOINER..RIGHT-TO-LEFT MARK

202A..202E ; Cf # [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE

206A..206F ; Cf # [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES

FEFF ; Cf # ZERO WIDTH NO-BREAK SPACE

FFF9..FFFB ; Cf # [3] INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR

ANNOTATION TERMINATOR

1D173..1D17A ; Cf # [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE

E0001 ; Cf # LANGUAGE TAG

E0020..E007F ; Cf # [96] TAG SPACE..CANCEL TAG

I think by far the cleanest thing to do is to encode another character. However, should we decide against that, we need to decide which of the above should have its semantics enlarged ("stretched") to encompass the usage.