Re: Merging combining classes, was: New contribution N2676

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Oct 29 2003 - 14:06:32 CST


From: "Jim Allan" <jallan@smrtytrek.com>

> Kent Karlson posted:
>
> > COMBINING COMMA BELOW is not "attached", even though cedilla is.
> > A turned comma above is not _attached_ above...
>
> Correct. COMBINING COMMA BELOW belongs to combining class 220.
>
> However by Unicode specifications both it and an attached lower cedilla
> on _g_ may be rendered by unattached turned comma above which interacts
> with characters not in their respective combining classes. And this new
> turned comma above of necessity would always be applied before normal
> upper class 230 diacritics.
>
> I don't see anything wrong with this in itself, but it is not what the
> standard wording about combining classes suggests when taken on its own.

This is another example where it seems that the standard combining class
uses for data interchange is not very appropriate for rendering in fonts.

May be there could be another set of combining classes defined that describe
more precisely hos characters interact, and that would allow defining a
similar normalized form generated by layout engines before using fonts data
and tables).

This set of combining classes could be defined by reference of the combining
classes already defined in the UCD, with a override file specifying
overrides needed for rendering.

As opposed to the UCD, this class override file could specify either new
combining values for individual characters, but may be also for groups of
characters.

There's currently such file with a related role, defined mostly for
collation purpose in Thai and Lao where specific characters are reordered.
However this is for another reason than combining and rendering as this is
used on base characters (i.e. distinct combining sequences).

For the case of Hebrew, such combining class override tool would also be
useful. But theree may also exist other specificities that may require other
sets of combining class values. For now we know two 3 applications that
benefit from such extended combining classes: string identity for
interchange (through NF* normalizations), collation (a particular but
complex case of transformation of strings into sub-clusters items),
rendering, why not other similar classes for other transforms like
word-breaking, line-breaking and hyphenation, standard transliterations...

Should Unicode publish such extra properties, to help improve
interoperability of common algorithms that are working on Unicode-encoded
texts, so that they will produce similar results? I think this set of extra
properties could also help font designers to improve their multilanguage
support, and benefit to users, i.e. those that create and edit encoded
texts, those that print or read it, those that search text. It would also
avoid inserting so many unneeded format control characters to get the
expected result, and would promote the adoption of a common encoding model
for grapheme clusters.

Any idea?



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST