Regex for Grapheme Cluster Breaks

From: Mark Davis ☕️ via Unicode <unicode_at_unicode.org>
Date: Wed, 3 Jan 2018 10:16:36 +0100

I had a UTC action to adjust
http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters
to update the regex, and other necessary changes surrounding text.

Here is what I've come up with for an EBNF formulation. The $x are the GCB
properties.

cluster = crlf | $Control | precore* core postcore* ;

crlf = $CR $LF ;

precore = $Prepend ;

postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] );

core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence
| [^$Control $CR $LF] );

hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ;

ri-sequence = $RI $RI ;

skin-sequence = $E_Base $E_Modifier ;

xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?:
$Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ;

virama-sequence = [$Virama $ZWJ] $LinkingConsonant ;

​I have tools to turn that into a (lovely) regex:

\p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}])(?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])*

​(It is a bit shorter if some more property names/values are abbreviated.)

I then tested against the current test file: GraphemeBreakTest.txt. There
is one outlying failure with that test file:

813) ☝̈🏻

hex: 261D 0308 1F3FB

test: [0, 4]

ebnf: [0, 2, 4]

I believe that is a problem with the test rather than the BNF, but I need
to track it down in any event.

​A regex is much easier for many applications to use than the current rule
syntax, so I'm going to see if the other segmentations could be
reformulated ​as ebnfs (ideally corresponding to regular grammars, or in
the worst case, for PEGs).

Feedback is welcome.


Mark
Received on Wed Jan 03 2018 - 03:17:18 CST

This archive was generated by hypermail 2.2.0 : Wed Jan 03 2018 - 03:17:18 CST