Possible bug in formal grammar for extended grapheme cluster

From: David P. Kendal via Unicode <unicode_at_unicode.org>
Date: Sun, 17 Dec 2017 16:16:20 +0100

Hi,

It’s possible I’m missing something, but the formal grammar/regular
expression given for extended grapheme clusters appears to have a bug
in it.
<https://unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters>

The bug is here:

    RI-Sequence := Regional_Indicator+

If the formal grammar is intended to exactly match the rules given the
the “Grapheme Cluster Boundary Rules” section below it as-is, then
this should be

    RI-Sequence := Regional_Indicator Regional_Indicator

since as given it would cause any number of RI characters to coalesce
into a single grapheme cluster, instead of pairs of characters. That
is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
grapheme cluster instead of the correct two.

-- 
dpk (David P. Kendal) · Nassauische Str. 36, 10717 DE · http://dpk.io/
   we do these things not because they are easy,      +49 159 03847809
      but because we thought they were going to be easy
          — ‘The Programmers’ Credo’, Maciej Cegłowski
Received on Sun Dec 17 2017 - 11:00:26 CST

This archive was generated by hypermail 2.2.0 : Sun Dec 17 2017 - 11:00:26 CST