Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic? from Ken Whistler via Unicode on 2018-05-28 (Unicode Mail List Archive)

From: Ken Whistler via Unicode <unicode_at_unicode.org>
Date: Mon, 28 May 2018 21:45:04 -0700

On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote:
> Hello Sundar,
>
> On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:
>> Hi,
>>
>> In languages like Ruby or Java
>> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
>>
>> functions to check if a character is alphabetic do that by looking for
>> the 'Alphabetic' property (defined true if it's in one of the L
>> categories, or Nl, or has 'Other_Alphabetic' property). When parsing
>> Tamil text, this works out well for independent vowels and consonants
>> (which are in Lo), and for most dependent signs (which are in Mc or Mn
>> but have the 'Other_Alphabetic' property), but the very common pulli
>> (VIRAMA)
>> is neither in Lo nor has 'Other_Alphabetic', and so leads to
>> concluding any string containing it to be non-alphabetic.
>>
>> This doesn't make sense to me since the Virama “◌்” as much of an
>> alphabetic character as any of the "Dependent Vowel" characters which
>> have been given the 'Other_Alphabetic' property. Is there a rationale
>> behind this difference, or is it an oversight to be corrected?
>
> I suggest submitting an error report via
> https://www.unicode.org/reporting.html. I haven't studied the issue in
> detail (sorry, just no time this week), but it sounds reasonable to
> give the VIRAMA the 'Other_Alphabetic' property.

Please don't. This is not an error in the Unicode property assignments,
which have been stable in scope for Alphabetic for some time now.

The problem is in assuming that the Java or Ruby isAphabetic() API,
which simply report the Unicode property value Alphabetic for a
character, suffices for identifying a string as somehow "wordlike". It
doesn't.

The approximation you are looking for is to add Diacritic to Alphabetic.
That will automatically pull in all the nuktas and viramas/killers for
Brahmi-derived scripts. It also will pull in the harakat for Arabic and
similar abjads, which are also not Alphabetic in the property values.
And it will pull in tone marks for various writing systems.

For good measure, also add Extender, which will pick up length marks and
iteration marks.

Please do not assume that the Alphabetic property just automatically
equates to "what I would write in a word". Or that it should be adjusted
to somehow make that happen. It would be highly advisable to study *all*
the UCD properties in more depth, before starting to report bugs in one
or another simply because using a single property doesn't produce the
string classification one assumes should be correct in a particular case.

Of course, to get a better approximation of what actually constitutes a
"word" in a particular writing system, instead of using raw property
API's, one should be using a WordBreak iterator, preferably one tailored
for the language in question.

--Ken

>
> I'd recommend to mention examples other than Tamil in your report
> (assuming they exist).
>
> BTW, what's the method you are using in Ruby? If there's a problem in
> Ruby (which I don't think; it's just using Unicode data), then please
> make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I
> should be able to follow up on that.
>
> Regards, Martin.
>
Received on Mon May 28 2018 - 23:45:29 CDT

This archive was generated by hypermail 2.2.0 : Mon May 28 2018 - 23:45:29 CDT