L2/05-111 (subset)

[This is the subset of L2/05-111, "Comments on Public Review Issues (Feb 4, 2005 - May 4, 2005)" related to PRI 66.]

66 Encoding of Chillu Forms in Malayalam

From: Cibu
Date: 2005-03-22 18:30:55 -0800
Subject: on Public Review Issue #66: Encoding of Chillu Forms in Malayalam

Hi,

Since Chillu-NA and NA + visible VIRAMA can give different meaning to a word, we cannot let the rendering system choose. Therefore, here are my preferences in the decreasing order:

1) Explicitly encode Chillu characters. Various issues are discussed in detail below. 2) <NA, VIRAMA> (without any joiner) should be mapped to NA with a visible Virama because, it will enforce uniformity. That is, Consonant + VIRAMA will form visible Virama symbol, irrespective of whether the consonant is capable of forming a Chillu or not. Example SA + VIRAMA and NA + VIRAMA will have visible Virama symbol.

Issues in current representation of a Chillu letter as Consonant + Virama + ZWJ

1) ZWJ and ZWNJ are supposed to be font directives, directing a font to select from two or more semantically same renderings. In case of Malayalam, this is no longer true. ZWJ becomes an alien language construct introduced to Malayalam by Unicode to produce Chillu letters. Thus, it is possible to produce two semantically different words, which differ only by ZWJ in their Unicode representation. Example: അവന്‍ (avan – meaning 'he') & അവന്‌ (avan~ - meaning 'for him')

2) When a word is searched in Unicode text, the search algorithm should ignore ZWJ & ZWNJ because it should not care about the rendering of the word. From the first reasoning, this does not hold good for Malayalam. However, if search algorithm does not ignore ZWJ & ZWNJ, then it surely is going to miss some words, which are semantically same but rendered differently by using/omitting ZWJ/ZWNJ.

3) Chillu of a consonant is different from its C1-conjoining form without inherent അ (A).

3.1)Phonetic differences Consider the combination: Vow + CC + Con. Vow - a vowel CC - a consonant capable of forming Chillu Con - a consonant

When CC takes its Chillu form, it is joins more with Vow. This effect produces a noticeable small stop between CC and Con.

When CC takes, its C2/C1-conjoining forming form without inherent അ (A), it is pronounced closer to Con.

Examples: ഉണര്‍വ്‌ ഉണര്വ്‌ (unlike its pair, not a meaningful word) കല്‍വിളക്ക്‌ വില്വാദ്രി കണ്‍വട്ടം കണ്വന്‍

4) Chillu of a consonant can be treated as Anusvara A. R. Raja Raja Varma states in his Keralapanineeyam (which is the foremost grammar book of Malayalam) "Anusvara is the Chillu of MA". Thus, we can say that Malayalam has more than one Anusvara. There is Anusvara for MA; there is Anusvara for NA, NNA, LA etc. This is essentially same as saying Malayalam got some number of Chillus, which includes MA, NA, LA etc.

If we look closely, the phonetic rules are also same for Anusvara and other Chillus. Most importantly the half stop property (please see Appendix A), if it occurs in the middle of a word. Examples:

സംയുക്തം സാമ്യം കല്‍വിളക്ക്‌ വില്വാദ്രി കണ്‍വട്ടം കണ്വന്‍

Essentially this means Unicode should do either of: 1. Include separate character locations for Chillu characters - solves the confusion of ല്‍ (Chillu of LA/TA) (see below) - Addresses above mentioned Chillu representation issues 2. Allow Anusvara to be encoded as MA + Virama + ZWJ - does not change existing encoding for Chillu - does not address previously explained Chillu representation issues

Background ----------

A) Overloading of visible Virama in Malayalam

Following are its functions: A.1) at end of a word, it acts as quarter vowel ഉ (U). Example: അവന്‌ (avan~) A.2) In the middle of a word, it means the consonant before is forming a conjunct with consonant after. Example: ശബ്‌ദം (Sabdam) In this context, it does not produce any sound what so ever. Functionality-(A.2) has been overloaded with this grapheme when typesetting friendly new orthography has been introduced. Unicode recognizes functionality-(A.2) alone with visible Virama of Malayalam. This contributes to the problem that Unicode representation of അവന്‍ (avan) & അവന്‌ (avan~) being different only by ZWJ/ZWNJ.

B) Evolution & Confusion of ല്‍ (Chillu LA/TA) For Sanskrit words used Malayalam, ത (TA) is pronounced as it is, only when a vowel or semi-vowel comes after it. For all other occasions, it is pronounced as ല (LA).

An example would be ഉത്സവം (ulsavam). Even though, it's Sanskrit originated form is ഉത്‌സവം (uthsavam), it is pronounced in Malayalam as ഉല്‌സവം (ulsavam).

This means, Chillu form of ത (TA) should be pronounced as if it is Chillu form of ല (LA). Thus, ല്‍ (chillu LA/TA) is in a very curious situation:

B.1) Grapheme level: Graphically it is Chillu of ത (TA). B.2) Character level: It can represent the characters – either ത (TA) or ല (LA). B.3) Phoneme level: Its pronunciation is the Chillu of ല (LA).

Reference: കേരളപാണിനീയം (kEraLapaaNineeyam), പീഠിക (peeThika) - A. R. Raja Raja Varma

thanks,
Cibu

-- More about me: http://www.blogger.com/profile/1246232

Date/Time: Tue May 3 18:31:33 CDT 2005
Contact: Antoine Leca

We are required to go one step back than what is exposed to understand a bit more of the issue. I am sorry to the lengthy explanation. I am sorry for the lack of definitive, clear-cut, answers.

The model for use of the joiners in Indic conjuncts in the framework of The Unicode Standard, has been designed and refined over the years. (This discussion is extracted from the very good exposition from Peter Constable, which I want to thank here, in a paper made available as Public Review 37, Spring of 2004; this paper discusses the “other” scripts, where it is more often C2 which is changing form; yet the discussion is worth considering here, because the funding Devanagari as well as the Malayalam script under study behave the same at this respect.) It considers three variations in the way a conjunct can be rendered:

a. a specific glyph is used for the conjunct;
b. a generic form (traditionally called half-consonant in Devanagari and by extend in the other scripts) is used for the dead C1 consonant, and the normal form for C2 is used;
c. the dead C1 is shown with a visible mark (called halant हलन्त in Hindi, candrakkala ചന്ദ്രക്കല in Malayalam), and the normal form for C2 is also used.

It also distinguishes three sequences for a conjunct, in order:
    1. <C1, VIRAMA, C2>
    2. <C1, VIRAMA, ZWJ, C2>
    3. <C1, VIRAMA, ZWNJ, C2>

Each one of the sequence express a restriction over the preceding one. That is, when the sequence 1 is used (by the writer), the rendering engine should use the first available of the three ways: that means no restriction, and the most appropriate form is to be used. When the third kind of sequences (using ZWNJ) is used, only the c form is acceptable. Till there, this is the basic model (as described in The Unicode Standard, version 1.0, volume 1, 1991, and it is even in accordance with about any other use of the joiners. Applied to Malayalam, this means that if a conjunct exist (as in the prototypical N.MA ന്മ example), the sequence <U+0D24, U+0D4D, U+0D2E> should show it, while the sequence <U+0D24, U+0D4D, U+200C, U+0D2E> should render ന്മ instead, using the default way to render a conjunct. On the other hand, when there are no specific glyph for the conjunct, as for example for the case of L.THA ല്ഥ (as a stupid example of some meaningless conjunct that is not expected to appear), the rendering will be always the same, using candrakkala ചന്ദ്രക്കല.

Thereafter, the various evolutions of The Unicode Standard introduced the intermediary step, the sequence of kind 2, with ZWJ. And the assigned meaning was to restrict to only the two latter renderings, or in other words to disallow the use of a specific glyph to render the conjunct (at least, as it appears printed.)

Let makes a small stop at this scheme. One striking point here is that it goes against the intended use of ZWJ: ZWJ would be used logically to request a closer conjunct, that is a representation that is occurring first or before in the a b c list; such a case occurs in modern Devanagari, with conjuncts like ट्ट TT.TTA or ङ्ख NG.KHA, which are nowadays often shown with visible halant, while the traditional Sanskrit form is to use a stacked conjunct; as a result, the simple sequence (kind 1) <U+091F, U+094D, U+091F> is usually rendered by glyphs as in scheme c, and none of ZWNJ or ZWJ (under the current rules) would have any effect, since c is already the more restricted option. And there are no obvious ways to request a stacked conjunct, that is a representation according to the scheme a, with a dedicated glyph.

Beyond this particular case, this scheme appears to work for Devanagari (at least when one does not tries to re-use the same character for another meaning, as it happened with the so-called eyelash-ra).

Here we need a second parenthesis. In the schemes b and c, the C2 consonant is always unmodified with respect to its standalone form, that is the form it would have if it stood outside any conjunct. As a result, the same process can be applied when there are no C2 consonants, particularly at the end of a word. However, there is an important difference here: the a scheme (the specific glyph) in such a case, is… the c scheme, to use the halant हलन्त! So in such a case, the sequence <C1, VIRAMA, ZWJ> should be understood as a request to use scheme b (if available). [In fact, it is historically how ZWJ was introduced into this game: it was equated with ISCII-88 character DB, INV, the invisible consonant (D9 in ISCII-91), which effect as C2 consonant in a conjunct is to force the use of scheme b, lacking any “specific glyph” proper of a scheme a. Yet ISCII INV has many other possible uses, many of them are not achieved using ZWJ in today’s Unicode.]

The whole scheme was also thought to work for Malayalam, considering the cillakṣaram ചില്ലക്ഷരം in a similar way as Devanagari half-consonants, in such a way that they are not used if a specific glyph exists, but they are preferred over the use of the visible candrakkala ചന്ദ്രക്കല. That is, use of cillakṣaram ചില്ലക്ഷരം for the C1 consonant is preferred over the use of candrakkala ചന്ദ്രക്കല, but a specific conjunct is definitively better. This axiom was established as the unwritten rule to render Malayalam.

When it comes to the final position in a word (the principal use of the cillakṣaram ചില്ലക്ഷരം), the same general rule could be used: sequence 2 and 3 restrict to less usual sequences.

As a result, we have the following rules for rendering:

  • the normal rendering for a <C1, U+0D4D, C2> conjunct is the specific glyph, if it exists (scheme a); else, if C1 can form a cillakṣaram ചില്ലക്ഷരം, this scheme (b) is used; else, a visible candrakkala ചന്ദ്രക്കല is shown on top of C1 (scheme c).
  • at the end of a word, the rendering for <C, U+0D4D> is the cillakṣaram ചില്ലക്ഷരം if it exists (scheme b); else, a visible candrakkala ചന്ദ്രക്കല is shown on top of C (scheme c).
  • the use of ZWNJ restricts to schemes c.
  • the use of ZWJ forces a cillakṣaram over a existing conjunct; as a result, the sequence <C, U+0D4D, U+200D> is a sure way to represent a cillakṣaram ചില്ലക്ഷരം, independently of context.

To answer the specific question of the issue, under this scheme, the sequence <U+0D28, U+0D4D> is either a part of a conjunct (like in N.MA ന്മ), or it is the cillu n ന്.

While this explanation might appear clear and logical exposed this way, the mere fact there is this issue shows it did not succeed; I believe there are three reasons for this lack of success.

First, while the schemes for rendering Devanagari or Tamil was explained in details as soon as 1992, it was not before 2001 that The Unicode Standard, then at version 4.0, cared to explain how Malayalam “worked”; furthermore, at the same time an agency of the Kerala government published another, quite distinct, standard about the use and rendering of Malayalam using the Unicode encoding as a basis; and to add more confusion to the case, a leading operating system publisher studied in about the same time frame its own “solution” to deal with the Malayalam script, a solution which when it was published in 2004 appears to be neither the scheme explained above nor the scheme proposed in Kerala. Given this confusion, it is understandable that the other people that are trying to implement Malayalam rendering are either asking for support, proposing inadequate solutions, or simply postponing the development.

A second important point is that the scheme above was conceived as a derivation of the Devanagari case; of course, it was to be expected, since ISCII also is highly based on the rules for Devanagari; and the scholar material available to the “Western” experts are often biased toward Devanagari too. This would not be too much of a problem if this did not raise two important consequences:

— First, cillakṣaram ചില്ലക്ഷരം are not shown in the exposition the way the speakers of Malayalam see them. The above exposition might lead a casual reader to think that I consider cillakṣaram ചില്ലക്ഷരം to be an equivalent for Devanagari half-consonants (those without the right leg): but they would be wrong: I just observe that cillakṣaram ചില്ലക്ഷരം behave the same in as much they are preferred to halant form, but a specific conjunct would be preferred; and as such a similar mechanism can be conceived; I do not see more similarities; for example, I know that the glyph for the cillus are expected to be seen on a keyboard layout, but the writer would not expect such a key to be the equivalent to <C, U+0D4D>. I am really speaking about a deficit of explanations and more generally of attention toward the native audience.

— Then, the plain form of cillakṣaram ചില്ലക്ഷരം is a 3-codepoint sequence; since they are a often used in Malayalam, this creates a clear overhead, that goes against the very nature of the virama model (which is based on the observations that the a vowel is by far the most frequent, and that conjuncts are rarer than simple consonants: this makes ISCII and hence Unicode quite economical encodings for the Indian languages.)

The result is that the process is difficult to understand unless one is familiarized with Hindi rendering (!), and it ignores basic facts such as the existence of keys that create directly the 3-codepoint sequences (to be sure to represent the cillakṣaram ചില്ലക്ഷരം, even if it should not be the sequence to use, as we will see shortly).

The third reason is another process that is occurring meanwhile. Since about 40 years, the government of Kerala is trying to promote a reform of the script. The base for this reform was to reduce the number of ligatures; the impact is particularly important in two areas: the education, and the printing industry (it was said there was a need of 900 glyphs to print Malayalam; of course this was more a concern with lead typography than it is now; yet it is still easier to create a 150-glyph electronic font than one with 900!) I cannot form a definitive judgment about whether or not the reform will finally succeed, in some 40 years from now. But for the present generation as a whole, which did learn the traditional style at school, and are the users of Unicode right now, there is a clear need to be able to deal with the two forms. Of course one can encounter zealots in both camps; but I feel Unicode should ignore them and provide a solution that “works everywhere”.

And in such a context we encounter again with the problem I exposed above, about the loss of the original function of the ZWJ, to ask for a more compact yet irregular rendering. Consider Y.K.KA യ്ക്ക and L.K.KA ല്ക്ക, two not common yet not uncommon conjuncts; under the traditional style, they were represented with a stacked conjunct, the C1 consonant in nominal form and below it, a subjoined form of the K.KA ക്ക conjunct, usually lacking the top part.

With the reform, both conjuncts are declared obsolete. For the former, this means that <U+0D2F, U+0D4D> is considered apart and rendered with candrakkala ചന്ദ്രക്കല, as യ്, followed by the K.KA conjunct ക്ക; on any Malayalam keyboard layout, the typing order will stay the same, using five keys; in any case, the resulting encoding will be compatible with the traditional rendering: it only needs to render it with a font that has the old conjunct. But for the latter, <U+0D32, U+0D4D> will be rendered as ല്, cillu l; and this glyph will be independently present on keyboards, so a writer can legitimately enter it using the key (rather than the two-key sequence LA ല then candrakkala ചന്ദ്രക്കല); and the software solution will insert then a ZWJ inside the conjunct (in order to keep the inferred intention of the author to use a cillakṣaram ചില്ലക്ഷരം), when there are no such intention; worse, lacking the view of the result while using traditional fonts, the author will not note anything wrong: after all, there are no difference in the reformed style between <0D32, 0D4D, 0D15, 0D4D, 0D15> and <0D32, 0D4D, 200D, 0D15, 0D4D, 0D15> [I do not know if this is a typo, but I have seen such a distinction in a small glossary, written in traditional style: അയല്ക്കാരന് “neighbour” uses a cillakṣaram ചില്ലക്ഷരം, similar to the base word അയല്, while പാല്ക്കാരന് “milkman” does not and uses a complex conjunct.] This problem occurs because there is no way to indicate that a given cillu could be “swallowed”, or should not, into a larger conjunct, if presented in a different context.

It is important to note here that such a concern has NOT been studied by the Kerala IT Mission, since they restricted their field to the reformed style.

If the present model did not succeed, what could happen with the “solution” to encode five more codepoints?

First, it should be clear that these new codepoints introduce a whole new bunch of possible combinations, so it might be expected that all the cases exposed here could be covered. Yet, the resulting complexity will not allow easy implementations, at least for a complete solution that covers both traditional and reformed styles. Of particular interest here is the fact that the new codepoints are a new kind of animals, they are not consonants but they are not completely dead consonants (since there exist at least one staked conjunct, <U+0D28, U+0D4D, U+0D31>, pronounced /nṯ/, shown as cillu n on top of ṟa, which should probably be encoded using the new codepoint.)

Then, since the new codepoints will somewhat replace the present <C, VIRAMA, ZWJ> sequence, one should define rules to recode existing text; the same rules could also be implemented in the rendering engines, in order to deal with the deprecated sequences; as it is implied by the text of the issue, the difficulty here would be to identify common rules among the current implementations and the uses that could have been done of them.

Another difficulty is that to date, the only usable proposition for the rendering rules, is the one which was issued by the Kerala IT mission. But as I said above, it fails to address the issue of the traditional script, which is a substantial problem of its own.

Introducing new codepoints will clearly help for the first problem, as it could be seen as an acceptation of the 2001 proposal from Kerala (or an improvement over it.) Since it would make Malayalam substantially different from Devanagari, this could also been seen positively. Regarding the second problems, right now nothing can be said: it depends entirely on the rules that are to be fixed regarding the handling of the unobvious sequences, or equivalently the ways to encode the special cases. The third problem is entirely open at this point: I believe these new codepoints could be very well adapted to the reformed Malayalam; but I am not that sure it will allow to encode every other text in traditional style; and I doubt it will allow to legibly display in traditional style a text that would have been composed by a writer which only knew the reformed script.