L2/02-045R
To: |
UTC |
Re: |
Rules
for default grapheme clustering |
From: |
Kent Karlsson |
Date: |
2002-02-04 |
Grapheme clustering must be done in
a way that is independent of where in a combining sequence a Grapheme_link
(combining) character occurs, since most Grapheme_links are of combining class
9, and thus are movable when doing canonical reordering. Therefore a Grapheme_link need not be last in
a combining sequence, and even if it is, it need not be the last combining
character in the sequence after normalisation.
Therefore
the rules for grapheme clustering must be independent of where
in a combining sequence the link character occurs, or for that matter where in
a combining sequence an "enclosing+" (more than
general category Me!) occurs. The
latter does (and should) however affect the "scope" for ensuing
combining characters in the same grapheme cluster: an A (an "enclosing+" character) and combining
characters following an A in the same combining sequence apply to the
entire preceding part of the grapheme cluster, not just the last letter of
it. Nested clustering is prohibited by
the occurrence of an A breaking any further clustering.
(Note
that the grapheme clustering is often, but not always, related to collation
clustering; Hangul being a major exception
Definitions of symbols used in
the rules below:
CR |
Carriage Return. |
LF |
Line Feed. |
VF |
Line Tabulation. |
FF |
Form Feed. |
JoinControl |
|
Combining |
Any combining
mark (M&). This includes all characters in Link, variation selectors are
included, as well as EnclosingCombining and NonEnclosingCombining. |
EnclosingCombining |
|
NonEnclosingCombining |
A combining mark
that is not enclosing: all in Combining that are not in EnclosingCombining. |
Link |
|
LogicalOrderException |
|
NonCombiningExtender |
|
SymbolBase |
|
LetterBase |
|
L |
Hangul leading jamo U+1100..U+115F. |
V |
Hangul vowel jamo U+1160..U+11A2. |
T |
Hangul trailing jamo
U+11A8..U+11F9. |
LV |
Precomposed Hangul
syllable that is canonically equivalent to a sequence of <L,V>. |
LVT |
Precomposed Hangul syllable
that is canonically equivalent to a sequence of <L,V,T>. |
Any |
Any character (includes
all of the above). |
Rules for where there is no grapheme
break:
Do not break between a CR
and LF, VT or FF (assuming that reports 13 and 14 are
revised similarly). |
|||
CR |
× |
(LF | VT | FF) |
(1) |
Do
not break conjoining Hangul sequences.
There is no break between L and T since a minimal insertion of
fillers gives L Vf T. |
|||
L |
× |
(L | V | T | LV | LVT) |
(2) |
(V | LV) |
× |
(V | T) |
(3) |
(T | LVT) |
× |
T |
(4) |
Do not break between a base
character and a combining mark, or within a sequence of combining marks. |
|||
(SymbolBase | LetterBase | Combining) |
× |
Combining |
(5) |
Do not break (by default) between non-combining
preextenders (these have the property Logical_order_exception) and a letter,
not between a letter combining sequence and a non-combining postextender
(letter modifiers, and some Thai and Lao vowels that would have been
combining if the Brahmic script model had been followed fully). |
|||
LogicalOrderException |
× |
LetterBase |
(6) |
(LetterBase Combining*) |
× |
NonCombiningExtender |
(7) |
The following two rules apply
if and only if in addition the match of NonEnclosingCombining+
contains at least one Link character (such a character is combining): do not
break conjoining combining sequences. |
|||
(LetterBase NonEnclosingCombining+) |
× |
JoinControl |
(8) |
(LetterBase NonEnclosingCombining+ JoinControl) |
× |
LetterBase |
(9) |
If none of the above is true, break
between any (other) adjacent character pairs. |
|||
Any |
÷ |
Any |
(10) |
Break at beginning and end of text. |
|||
beginning of text |
÷ |
Any |
(11) |
Any |
÷ |
end of text |
(12) |
Note that
a Link in a letter/digit based combining sequence makes it (the combining
sequence) “conjoin” with the next letter/digit combining sequence, but that an
EnclosingCombining in the combining sequence makes it non-conjoining and
overrides any Link (before or after); this prevents nesting. Note also that an EnclosingCombining
character and any follow-on combining characters apply to (the
preceding part of) the cluster, not just the last base in it.