L2/02-045
To: |
UTC |
Re: |
Rules for default grapheme clustering |
From: |
Kent Karlsson |
Date: |
2001-01-28 |
Grapheme
clustering must be done in a way that is independent of where in a combining
sequence a Grapheme_link (combining) character occurs, since most
Grapheme_links are of combining class 9, and thus are movable when doing
canonical reordering. Therefore a
Grapheme_link need not be last in a combining sequence, and even if it is, it
need not be the last combining character in the sequence after normalisation.
Therefore
the rules for grapheme clustering must be independent of where
in a combining sequence the link character occurs, or for that matter where in
a combining sequence an "enclosing+" (more than
general category Me!) occurs. The
latter does (and should) however affect the "scope" for ensuing
combining characters in the same grapheme cluster: an A (an "enclosing+" character) and combining
characters following an A in the same combining sequence apply to the
entire preceding part of the grapheme cluster, not just the last letter of
it. Nested clustering is prohibited by
the occurrence of an A breaking any further clustering.
(Note
that the grapheme clustering is often, but not always, related to collation
clustering; Hangul being a major exception
Definitions of symbols used in
the rules below:
CR |
Carriage Return. |
LF |
Line Feed. |
VF |
Line Tabulation. |
FF |
Form Feed. |
JoinControl |
|
Combining |
Any combining mark (M&). This includes all characters in Link, variation selectors are included, as well as EnclosingCombining and NonEnclosingCombining. |
EnclosingCombining |
|
NonEnclosingCombining |
A combining mark that is not enclosing: all in Combining that are not in EnclosingCombining. |
Link |
|
LogicalOrderException |
|
NonCombiningExtender |
|
SymbolBase |
|
LetterBase |
|
L |
Hangul leading jamo U+1100..U+115F. |
V |
Hangul vowel jamo U+1160..U+11A2. |
T |
Hangul trailing jamo U+11A8..U+11F9. |
LV |
Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V>. |
LVT |
Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V,T>. |
Any |
Any character (includes all of the above). |
Rules for where there is no grapheme break:
Do not break between a CR and LF, VT or FF (assuming that reports 13 and 14 are revised similarly). |
|||
CR |
× |
(LF | VT | FF) |
(1) |
Do not break Hangul syllable sequences. There is no break between L and T since a minimal insertion of fillers gives L Vf T. Combining characters are included at the right hand side here, since L, V, T, LV, and LVT are autoconjoining, and are therefore not included in LetterBase. |
|||
L |
× |
(L | V | T | LV | LVT | Combining) |
(2) |
(V | LV) |
× |
(V | T | Combining) |
(3) |
(T | LVT) |
× |
(T | Combining) |
(4) |
Do not break between a base character and a combining mark, or within a sequence of combining marks. |
|||
(SymbolBase | LetterBase | Combining) |
× |
Combining |
(5) |
Do not break (by default) between non-combining preextenders (these have the property Logical_order_exception) and a letter, not between a letter combining sequence and a non-combining postextender (letter modifiers, and some Thai and Lao vowels that would have been combining if the Brahmic script model had been followed fully). |
|||
LogicalOrderException |
× |
LetterBase |
(6) |
(LetterBase Combining*) |
× |
NonCombiningExtender |
(7) |
The following two rules apply if and only if in addition the match of NonEnclosingCombining+ contains at least one Link character (such a character is combining). |
|||
(LetterBase NonEnclosingCombining+) |
× |
JoinControl |
(8) |
(LetterBase NonEnclosingCombining+ JoinControl) |
× |
LetterBase |
(9) |
If none of the above is true, break after any character. |
|||
Any |
÷ |
|
(10) |
Note that a Link in a letter/digit based combining sequence makes it (the combining sequence) “conjoin” with the next letter/digit combining sequence, but that an EnclosingCombining in the combining sequence makes it non-conjoining and overrides any Link (before or after); this prevents nesting. Note also that an EnclosingCombining character and any follow-on combining characters apply to (the preceding part of) the cluster, not just the last base in it.