From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Mar 24 2006 - 07:56:30 CST
From: "Antoine Leca" <Antoine10646@leca-marti.org>
> I agree U+0D57 (as are its siblings xx55, xx56 or xx57 in the other scripts)
> do have the same properties etc. as the vowel signs, so this use could be
> possible without surgical operations on the UCD. But the current (5.0 draft)
> database says... :
> 0D57 MALAYALAM AU LENGTH MARK
> * only a representation of the right half of 0D4C
> And I am not sure this should be interpreted as you did.
> In fact, I read the word "only" as implying... the complete contrary.
> The French translation is not clearer:
> 0D57 SIGNE DE LONGUEUR MALAYALAM AOU
> * simplement la représentation de la moitié droite de 0D4C
Unicode isclear in the Indic scripts description chapter. These were encoded mostly for compatibility with older standards that couldnot reorder vowels or break them in two parts. So instead of encoding a single AU vowel, these old standards decomposed it in its two parts: the right part which is the base vowel and that was encoded first, and the au length mark encoded that changes the semantic to the actual vowel (so this au length mark does not modify the Unicode base letter but the leading vowel mark; this is unusual with other Unicode combining marks).
It may have happened that some used this au length mark isolately in texts that describe the orthography. But this still does not make it a true vowel sign; isolately, it acts more like a symbol and it has no meaning as a letter in actual language.
In fact using the decomposed form is not recommanded, and any textprocessor should consider the pair with the "base" vowel and the au length mark as a whole, recomposing them as much as possible if they occur indecomposed form in some legacy texts converted to Unicode. (That's why they have canonical composition). But a renderer would likely redecompuse the AU vowel in its two parts,and then apply reordering to the first part, leaving the AU length mark at end after the consonnants of the cluster, possibly before a bindu that could still modify it, but in any case before the anusvara that could be there (to denote a nasalisation of the vowel and/or a final dead "n" consonnant, depending on regional accents).
Now if the right part of the vowel or a simple vowel written to the right is encoded before the candrabindu or anusvara, that right part should be reorder at end of the cluster by the renderer or font, so that the candra and/or anusvara will be rendered as diacritics of the graphical "base" consonnant.
When you have done allthese reordering of leading vowel parts and trailing vowel parts or bindus and anusvara, it remains only the consonnants in the middle (each possibly with their nukta, but each consonnant+nukta should be treated as a whole as if it was a single unit); In this list of consonnants, one of them adopts a full letter, and it is normally the last encoded one, except in some cases where a few dead consonnants are reordered after the live consonnant (and in that case, the moved deadconsonnant can adopt either the form of a diacritic, or that of a full consonnant).
When all these consonnants have been reordered, it remains just a possible list of leading consonnants in half-form (the last one may actually be a live consonnant phonetically, but it is not the one that willcarry the vowel diacritics or final anusvara) followed by a consonnant in full-form (unless it is truncated to half form by ZWJ with no other consonnant after it), and possibly followed by consonnants that have been reordered and moved forward to be shown in subjoined form (for example REPHA which is the subjoined form of a reordered leading dead RA)
However I wonder how one could render a REPHA under a half-form final consonant. My opinion is that ZWJ does not block that REPHA from being reorderered further, and that ZWJ is technically part of the consonnant cluster and not encoded after it: it can be used to block the formation of a ligature between a dead consonnant and another consonnant (if such ligature exists in the corresponding script) so that the dead consonnant remains in half-form. But I may be wrong and this also blocks REPHA from moving further to the right, for example on a consonnant encoded after ZWJ which adopts a full form given that there's still no full form consonnant before ZWJ.
To block the REPHA from going further, onewould have to use ZWNJ instead, and so the REPHA will join with the lastdead consonnant before ZWNJ.
Now you can apply locale-specific conjunct ligatures by pair, starting from by the last pair: each pair has a consonnant in half form, and another consonnant just after it in full-form (or in conjunct form if it is itself a ligature).
I have still not detailed every thing there, but this kind of algorithm is the one that will work with all indic scripts (including Tibetan, and even with Thai except that with Thai no reordering of leading vowels or vowel parts is necessary as these parts are encoded in visual order for compatibility with TIS-620). The differences are in the list of letters to consider, but almost all cases are already present in the Devanagari script: it's just a matter of generalization and creation of precise lists of letters for which each rule applies (and that varies according to scripts, or written traditions for some languages).
All this can be formulated as a context-free grammar that works on a limited alphabet, and so it can be preocessed by a finite state automata. And such automata will have states table with finished size, that can be represented in substitution tables of a OT font. So custom support by the renderer for a script is not needed, as long as the font already contains the substitution rules, and van map the final automata states with a defined glyph for final rendering.
Support in the renderer is only needed if the complete set of rules is not encoded in the font, but the font only contains a few descriptive mappings specific to the script (the missing subtitution rules are infered by the renderer which has the complete set: the feature just allows mapping the pseudo-glyph ids containined in the renderer's table to the actual glyph ids in the font).
However, if the renderer only considers its own rules and then lookups in the font table only for the minimum set of rules that it needs, it may forget to implement substitutions that are implemented in the font table (so it may forget interesting ligatures...). If the renderer still honors the other rulesstored in the font table, it will be OK, and we'll get the best of both worlds. There may be reasonreasons why a renderer willchoose to not honor all substitutions specified by the font; for example if a locale forbids some signatures or wants an alternate ligature, and the font was designed only with rules valid for one locale.
So who will implement the substitution and reordering tables? The font or the renderer? If it's a renderer, it simplifies a bit the development of the font for the font designer, that just has to concentrate on providing the necessary glyphs for the forms described by the "feature" standardized by the renderer, and then map those glyphs in the cmap or in the feature table. But for other applications that don't use that renderer, the font will not work as expected, so another feature table will be needed for the other specific renderer.
Conclusion: we get too many features to implement in the font, and it becomes then simpler to just provide a complete set accurate for one or more locales for which the font was designed, without even implementing those features.
The renderer should accept using such font anyway, using just the cmap (the font provides all the other needed rules). It will be necessary notably if the font design does not work the way assumed by the renderer, such as decorative fonts or font with "handwritten script" style, or fonts with shadowed styles or differences of stroke weights that requires finer typography and more distinct glyphs.
This archive was generated by hypermail 2.1.5 : Fri Mar 24 2006 - 08:03:18 CST