L2/01-339
From:
Mark Davis [mark@macchiato.com]
Sent:
Monday, September 10, 2001 11:16 AM
Subject: UTC Agenda Item: Grapheme Break
I want
to reiterate that I think we made a mistake by not including the inverse of
GRAPHEME JOINER, and that we do need to take some action.
The
much more important requirement from the field is for the inverse: the grapheme
break (GB). This would be used to indicate that a particular sequence in a
given language is *not* actually a grapheme cluster. That would allow, for
example, Slovak dictionaries and databases to flag the 1% cases where
"ch" in Slovak is to be sorted as two separate characters with a GB;
and not require flagging the 99% cases where it *is* considered a single character
in Slovak with a GJ. [Another example, which has come up on the Unicode list
recently, is "aa" in Danish.] It is clearly preferable to flag the
exceptions rather than the normal cases in those languages.
Let's
look at the alternatives for breaking grapheme clusters:
- ZWSP
(aka 'allow line break'). Won't work,
since it allows linebreak at that point
-
SHY (soft hyphen). Won't work, since the position many not be a hyphenation
point.
- ZWJ
& ZWNJ. Won't work, since they can cause/break ligatures / cursive connections
where not desired.
- ZWNBSP:
May work. The only one that we might be able to overload is ZWNBSP (aka
'disallow line break'), or better yet, its new semantic replacement WORD
JOINER. I believe that such an overload would work for Latin and most other
scripts. It would not work for a script that:
(a)
allows line break between letters and would thus need WORD JOINER to manually indicate
specific positions that disallow linebreak (Thai and other languages that break
between letters), AND
(b) has
multi-base character grapheme clusters in collation, AND
(c)
sometimes treats those multi-base character grapheme clusters as separate
letters in collation.
Scripts
and/or situations in which all three of these conditions are fulfilled may be
so unusual that we could stretch the semantics for WORD JOINER, and avoid
encoding the inverse function as a separate character.
However,
the biggest barrier to this is that the semantics conflict conceptually to such
a high degree: *join* words vs *break* grapheme clusters.
- Others.
The other possibilities are even uglier. Here is the set of Cf & Cc's
0000..001F ; Cc #
[32] <control>..<control>
007F..009F ; Cc #
[33] <control>..<control>
070F ; Cf # SYRIAC ABBREVIATION MARK
180B..180E ; Cf #
[4] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN
VOWEL SEPARATOR
200C..200F ; Cf #
[4] ZERO WIDTH NON-JOINER..RIGHT-TO-LEFT MARK
202A..202E ; Cf #
[5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
206A..206F ; Cf #
[6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES
FEFF ; Cf # ZERO WIDTH NO-BREAK SPACE
FFF9..FFFB ; Cf #
[3] INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR
ANNOTATION
TERMINATOR
1D173..1D17A ; Cf #
[8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE
E0001 ; Cf # LANGUAGE TAG
E0020..E007F ; Cf #
[96] TAG SPACE..CANCEL TAG
I think
by far the cleanest thing to do is to encode another character. However, should
we decide against that, we need to decide which of the above should have its
semantics enlarged ("stretched") to encompass the usage.