Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

From: Richard Wordingham via Unicode <unicode_at_unicode.org>
Date: Mon, 11 Dec 2017 10:16:31 +0000

On Sun, 10 Dec 2017 21:14:18 -0800
Manish Goregaokar via Unicode <unicode_at_unicode.org> wrote:

> > GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant
>
> You can also explicitly request ligatureification with a ZWJ, so
> perhaps this rule should be something like
>
> (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant
>
> -Manish
>
> On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ☕️ via Unicode <
> unicode_at_unicode.org> wrote:
>
> > 1. You make a good point about the GB9c. It should probably instead
> > be something like:
> >
> > GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant

This change is unnecessary. If we start from Draft 1 where there are:

GB9: × (Extend | ZWJ | Virama)
GB9c: (Virama | ZWJ ) × LinkingConsonant

If the classes used in the rules are to be disjoint, we then have to
split Extend into something like ViramaExtend and OtherExtend to allow
normalised (NFC/NFD) text, at which point we may as well continue to
have rules that work without any normalisation. Informally,

ViramaExtend = Extend and ccc ≠ 0.

OtherExtend = Extend and ccc = 0.

(We might need to put additional characters in ViramaExtend.)

This gives us rules:

GB9': × (OtherExtend | ViramaExtend | ZWJ | Virama)

GB9c': (Virama | ZWJ ) ViramaExtend* × LinkingConsonant

So, for a sequence <virama, ZWJ, nukta, LinkingConsonant>, GB9' gives us

virama × ZWJ × nukta LinkingConsonant

and GB9c' gives us

virama × ZWJ × nukta × LinkingConsonant

---
In Rule GB9c, what examples justify including ZWJ?  Are they just the C1
half-forms?  My knowledge suggests that
GB9c'': Virama (ZWJ | ViramaExtend)* × LinkingConsonant
might be more appropriate.
Richard.
Received on Mon Dec 11 2017 - 04:17:10 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 11 2017 - 04:17:12 CST