Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?
mathias at qiwi.be
Thu Apr 24 16:07:58 CDT 2014
On 24 Apr 2014, at 21:38, Whistler, Ken <ken.whistler at sap.com> wrote:
> Grapheme_Extend characters per se do not "apply" to anything.
> They are a mixture of different General_Category types -- mostly combining
> marks, but not all. The concept of applying to a base only refers to
> combining marks proper.
> The proper use of the Grapheme_Extend property is in the context of the
> text segmentation algorithms defined in UAX #29, and in particular:
> See that document for the proper use. They are relevant to the determination of grapheme cluster boundaries.
> And by the way, it is a very bad idea to be writing a program to just unilaterally strip away grapheme extenders from input strings. In particular, many dependent vowels in Indic scripts are defined as grapheme extenders. If you strip them away, the input string will just end up as random trash. That is very, very different from something which is trying to strip diacritics and accent marks off of Latin letters.
I agree. Don’t worry — I am not actually writing such a program, it was just an example to simplify my question.
The real program attempts to reverse a string while accounting for combining marks and grapheme extenders. Before reversing the code points one by one, some things need to happen:
* For combining marks, I use a regular expression that looks for non-combining marks followed by any number of combining marks, and then I swap the combining marks with the preceding character.
* Now I’m trying to figure out what to do about grapheme extenders (if anything). I was thinking: look for any non-grapheme extender symbol (or should it be only `Grapheme_Base` characters? Your reply suggested it shouldn’t) followed by a single grapheme extender (or should it be several, like with combining marks?), and then swap them. Would that be a correct approach?
I realize reversing a string has nothing to do with text segmentation – but ignoring grapheme extenders leads to unexpected results (since after reversing the code points, the grapheme extender might extend the wrong character): https://github.com/mathiasbynens/esrever/issues/5
More information about the Unicode