Regex for Grapheme Cluster Breaks from Mark Davis ☕️ via Unicode on 2018-01-03 (Unicode Mail List Archive)

From: Mark Davis ☕️ via Unicode <unicode_at_unicode.org>
Date: Wed, 3 Jan 2018 10:16:36 +0100

I had a UTC action to adjust
http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters
to update the regex, and other necessary changes surrounding text.

Here is what I've come up with for an EBNF formulation. The $x are the GCB
properties.

cluster = crlf | $Control | precore* core postcore* ;

crlf = $CR $LF ;

precore = $Prepend ;

postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] );

core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence
| [^$Control $CR $LF] );

hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ;

ri-sequence = $RI $RI ;

skin-sequence = $E_Base $E_Modifier ;

xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?:
$Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ;

virama-sequence = [$Virama $ZWJ] $LinkingConsonant ;

I have tools to turn that into a (lovely) regex:

\p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}])(?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])*

(It is a bit shorter if some more property names/values are abbreviated.)

I then tested against the current test file: GraphemeBreakTest.txt. There
is one outlying failure with that test file:

813) ☝̈🏻

hex: 261D 0308 1F3FB

test: [0, 4]

ebnf: [0, 2, 4]

I believe that is a problem with the test rather than the BNF, but I need
to track it down in any event.

A regex is much easier for many applications to use than the current rule
syntax, so I'm going to see if the other segmentations could be
reformulated as ebnfs (ideally corresponding to regular grammars, or in
the worst case, for PEGs).

Feedback is welcome.

Mark
Received on Wed Jan 03 2018 - 03:17:18 CST

This archive was generated by hypermail 2.2.0 : Wed Jan 03 2018 - 03:17:18 CST