Mark Davis wrote:
>>There is a new proposed technical report on the Unicode site.
>>document: http://www.unicode.org/unicode/reports/tr24/
Good job! A very useful piece of information.
But how does this combine with Normalization Forms?
A brutal character-by-character application of the Script property from this file would achieve different results when the same grapheme is expressed in precomposed or decomposed form.
E.g.: U+00C0 (LATIN CAPITAL LETTER A WITH GRAVE) is "script = Latin", i.e. the letter and the accent are both "script = Latin, Latin". However, the equivalent decomposed sequence U+0041, U+0300 (LATIN CAPITAL LETTER A, COMBINING GRAVE ACCENT) is "script = Latin, Common".
To remove this ambiguity, why not assuming that a combining character has the same script property as the base character it is applied to?
This would, however, open the way to some tricky facets (although not necessarily wrong):
- The "script" property of shared diacritics (e.g. U+0300 COMBINING GRAVE ACCENT) would be variable and context-dependent.
- Script-specific combining marks could get assigned to a different script, if used in a strange context. E.g.U+093E (DEVANAGARI VOWEL SIGN AA) would be "script = Bengali" when following U+0995 (BENGALI LETTER KA).
A different approach could be to assume a particular normalization (e.g. Normalization Form D), and remove all derivable characters from Script.txt.
_ Marco
______________________________________________
FREE Personalized Email at Mail.com
Sign up at http://www.mail.com/?sr=signup
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT