Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

From: Ken Whistler via Unicode <unicode_at_unicode.org>
Date: Tue, 29 May 2018 07:27:21 -0700

On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:
> How would one know that they are misapplied? And what if the author of
> the text has broken your rules? Are such texts never to be transcribed
> to pukka Unicode?

Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061,
Script=Latin) doesn't automatically make the Tamil vowel "inherit" the
Latin script property value, nor should it.

That said, if someone decides they want that sequence, and their text as
"broken my rules", so be it. I'm just not going to assume anything
particular about that text. Note that in terms of trying to determine
whether such a string is (naively) alphabetic, such a sequence doesn't
interfere with the determination. On the other hand, a process concerned
about text runs, script assignment, validity for domains, or other such
issues *will* be sensitive to such a boundary -- and should not be
overruled by some generic determination that combining marks inherit all
the properties of their base.

>
>
> Even without knowing exactly what is wanted, it looks to me as though
> it isn't. If he wants to allow <pulli, ZWNJ> as a substring, which
> he should, then that fails because there is no overlap between
> p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.

Yes, so if you are working with strings for Indic scripts (or for that
matter, Arabic), you add Join_Control to the mix:

Alphabetic  ∪ Diacritic ∪ Extender ∪ Join_Control

gets you a decent approximation of what is (naively) expected to fall
within an "alphabetic" string for most scripts.

For those following along, Alphabetic is roughly meant to cover the ABC,
かきくけこ,... plus ideographic elements of most scripts. Diacritic picks up
most of the applied combining marks, including nuktas, viramas, and tone
marks. Extender picks up spacing elements that indicate length,
reduplication, iteration, etc. And joiners are, well, joiners.

If one wants finer categorization specifically for Indic scripts, then I
would suggest turning to the Indic_Syllabic_Category property instead of
a union of PropList.txt properties and/or some twiddling with
General_Category values.

--Ken
Received on Tue May 29 2018 - 09:27:51 CDT

This archive was generated by hypermail 2.2.0 : Tue May 29 2018 - 09:27:51 CDT