Re: Script Names

From: Marco Cimarosti (marco.cimarosti@europe.com)
Date: Mon May 22 2000 - 08:45:19 EDT


Mark Davis wrote:
>>There is a new proposed technical report on the Unicode site.
>>document: http://www.unicode.org/unicode/reports/tr24/

Good job! A very useful piece of information.

But how does this combine with Normalization Forms?

A brutal character-by-character application of the Script property from this file would achieve different results when the same grapheme is expressed in precomposed or decomposed form.

E.g.: U+00C0 (LATIN CAPITAL LETTER A WITH GRAVE) is "script = Latin", i.e. the letter and the accent are both "script = Latin, Latin". However, the equivalent decomposed sequence U+0041, U+0300 (LATIN CAPITAL LETTER A, COMBINING GRAVE ACCENT) is "script = Latin, Common".

To remove this ambiguity, why not assuming that a combining character has the same script property as the base character it is applied to?

This would, however, open the way to some tricky facets (although not necessarily wrong):

- The "script" property of shared diacritics (e.g. U+0300 COMBINING GRAVE ACCENT) would be variable and context-dependent.

- Script-specific combining marks could get assigned to a different script, if used in a strange context. E.g.U+093E (DEVANAGARI VOWEL SIGN AA) would be "script = Bengali" when following U+0995 (BENGALI LETTER KA).

A different approach could be to assume a particular normalization (e.g. Normalization Form D), and remove all derivable characters from Script.txt.

_ Marco
______________________________________________
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT