L2/02-045R

To:

UTC

Re:

Rules for default grapheme clustering

From:

Kent Karlsson

Date:

2002-02-04

 

            Grapheme clustering must be done in a way that is independent of where in a combining sequence a Grapheme_link (combining) character occurs, since most Grapheme_links are of combining class 9, and thus are movable when doing canonical reordering.  Therefore a Grapheme_link need not be last in a combining sequence, and even if it is, it need not be the last combining character in the sequence after normalisation.

            Therefore the rules for grapheme clustering must be independent of where in a combining sequence the link character occurs, or for that matter where in a combining sequence an "enclosing+" (more than general category Me!) occurs.  The latter does (and should) however affect the "scope" for ensuing combining characters in the same grapheme cluster:  an A (an "enclosing+" character) and combining characters following an A in the same combining sequence apply to the entire preceding part of the grapheme cluster, not just the last letter of it.  Nested clustering is prohibited by the occurrence of an A breaking any further clustering.

            (Note that the grapheme clustering is often, but not always, related to collation clustering; Hangul being a major exception

            Definitions of symbols used in the rules below:

 

CR

Carriage Return.

LF

Line Feed.

VF

Line Tabulation.

FF

Form Feed.

JoinControl

Join_Control, as determined by the UCD.

Combining

Any combining mark (M&). This includes all characters in Link, variation selectors are included, as well as EnclosingCombining and NonEnclosingCombining.

EnclosingCombining

Enclosing_Combining, (not yet in the UCD) as determined by the UCD.  A combining mark that is enclosing.  Includes all Me characters, and all combining Brahmic derived dependent vowels.

NonEnclosingCombining

A combining mark that is not enclosing: all in Combining that are not in EnclosingCombining.

Link

Grapheme_Link, as determined by the UCD.  Includes linking viramas and the combining grapheme joiner.

LogicalOrderException

Logical_Order_Exception, as determined by the UCD.  Some Thai and Lao vowels.

NonCombiningExtender

Grapheme_Extend (MODIFIED!!), as determined by the UCD.  Lm, and some Thai and Lao vowels.  [NonCombiningExtender = Lm + 0e30 + 0e32 + 0e33 + 0e45 + 0eb0 + 0eb2 + 0eb3 + 0ebd]       (what about TAMIL SIGN VISARGA?)

SymbolBase

Isolated_Base, (not yet in the UCD) as determined by the UCD.  Symbols and punctuation (including spaces).  [SymbolBase = P& + S& + Zs + Cn + Co + LogicalOrderException + NonCombiningExtender]

LetterBase

Grapheme_Base (MODIFIED!!), as determined by the UCD.  Includes L, V, T, LV, LVT (even though they are autoconjoining).  Does not include symbols, or punctuation.  [LetterBase = L& + N& – LogicalOrderException – NonCombiningExtender]

L

Hangul leading jamo U+1100..U+115F.

V

Hangul vowel jamo U+1160..U+11A2.

T

Hangul trailing jamo U+11A8..U+11F9.

LV

Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V>.

LVT

Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V,T>.

Any

Any character (includes all of the above).

 

            Rules for where there is no grapheme break:

 

Do not break between a CR and LF, VT or FF (assuming that reports 13 and 14 are revised similarly).

CR

×

(LF | VT | FF)

(1)

Do not break conjoining Hangul sequences.  There is no break between L and T since a minimal insertion of fillers gives L  Vf  T.

L

×

(L | V | T | LV | LVT)

(2)

(V | LV)

×

(V | T)

(3)

(T | LVT)

×

T

(4)

Do not break between a base character and a combining mark, or within a sequence of combining marks.

(SymbolBase | LetterBase | Combining)

×

Combining

(5)

Do not break (by default) between non-combining preextenders (these have the property Logical_order_exception) and a letter, not between a letter combining sequence and a non-combining postextender (letter modifiers, and some Thai and Lao vowels that would have been combining if the Brahmic script model had been followed fully).

LogicalOrderException

×

LetterBase

(6)

(LetterBase Combining*)

×

NonCombiningExtender

(7)

The following two rules apply if and only if in addition the match of NonEnclosingCombining+ contains at least one Link character (such a character is combining): do not break conjoining combining sequences.

(LetterBase NonEnclosingCombining+)

×

JoinControl

(8)

(LetterBase NonEnclosingCombining+ JoinControl)

×

LetterBase

(9)

If none of the above is true, break between any (other) adjacent character pairs.

Any

÷

Any

(10)

Break at beginning and end of text.

beginning of text

÷

Any

(11)

Any

÷

end of text

(12)

 

 

Note that a Link in a letter/digit based combining sequence makes it (the combining sequence) “conjoin” with the next letter/digit combining sequence, but that an EnclosingCombining in the combining sequence makes it non-conjoining and overrides any Link (before or after); this prevents nesting.  Note also that an EnclosingCombining character and any follow-on combining characters apply to (the preceding part of) the cluster, not just the last base in it.