Re: Encoding Bengali Vowel forms (again)

From: Mark Davis (
Date: Wed May 03 2000 - 02:26:10 EDT

Ken, I also think you have stated it well. I have no real set position on this issue, but I believe a couple of the points in favor of #2 are, on closer examination, not strong arguments in favor of that position.

>The entire sequences, Vowel_A_zophola_AA and Vowel_E_zophola_AA,
     constitute vowel-initial syllables. This would be handled more
     naturally for text processes if a structural solution comparable
     to that of Devanagari were chosen...

In all of the common text processes: rendering, word-break, etc., it will make essentially no difference that the vowel is a "base" instead of a consonant. If someone can come up with a case where it would make a difference in implementation, that would be useful to discuss.

>Rendering would be simpler, since for both of these sequences,
     there would simply be a one-to-one character-to-glyph mapping,
     as for the other long...

Once you have a system that handles Indic rendering, adding more or fewer items to a ligature table will make no real difference in complexity or performance.

I agree with Apurva that if we go down the road of sanctioning #1, it makes it tricker to do #2. What one would probably have to do is to allow #1 into the indefinite future, while making #2 the preferred mechanism when it comes available.

I think what it comes down to is:
#1 is available today (or the very near term), but seems odd to people used to halant as a 'vowel killer'
#2 is more natural for users, but would be at least a year or two out by the time it grinds through both UTC and WG2.


Apurva Joshi wrote:

> Many thanks Ken, for all the details below. My short [respectful] response
> is as follows:
> I am for approach (II). Even though approach (I) is available now, seems to
> be easily extendable and the fact that many developers would have already
> implemented this. I think its certainly worth the wait to define unitary
> characters for the LetterE_YaPhola_VowelSignAa and
> LetterA_YaPhola_VowelSignAa.
> Reasons: [feel free to correct me where I might be wrong]
> 1.
> I think that if we implement (I) now, the possibility of defining newer
> unitary characters for the same might be dim. This is given the fact that
> once things are implemented using a certain approach, people get used to it
> and its pretty difficult to change them later.
> 2.
> If we go ahead with approach (I) now and code points 'are' allocated in the
> future, I guess we might also be unknowingly adding to 'legacy' issues for
> processing Indic: since future implementation will have to take care of both
> approaches for downward compatibility for a good amount of time.
> 3.
> >From the script perspective, it would seem that the most logical sequences
> to generate the above characters would be:
> [I quote from the approach (I) below]
> >Vowel_A_zophola_AA = 0985 09CD 09AF 09BE ( a- halant ya -aa )
> >Vowel_E_zophola_AA = 098F 09CD 09AF 09BE ( e- halant ya -aa ).
> To do so in any Indic script that is alive and being used today,
> unfortunately makes me a good deal uncomfortable. Anyone familiar with the
> well defined ground rules of Indic scripts would certainly say that it goes
> against the very purpose of the halant in such scripts- ie. to remove an
> inherent vowel. I would prefer not to redefine such rules to suit short term
> goals.
> It would be nice to hear the views of people who use Bangla, for this. I
> also look forward to a possibility that such encoding requirements for any
> script [not just Indic], will continue to get required attention and time.
> Thanks,
> -apurva
> -----Original Message-----
> From: Kenneth Whistler []
> Sent: Tuesday, May 02, 2000 7:45 PM
> To: Unicode List
> Cc:
> Subject: RE: Encoding Bengali Vowel forms (again)
> A number of interesting arguments have been brought forward on this
> thread in response to Md. Abdul Malik's statement of the problem
> regarding the zophola-AA in Bengali.
> It seems to me that everyone has agreed about the problem itself:
> how to represent, in Unicode, the particular Bengali written
> initial sequences Vowel_A_zophola_AA and Vowel_E_zophola_AA that
> have been innovated in Bengali, apparently primarily for writing
> English words adapted into Bengali.
> It also seems that there is consensus that a solution involving
> private use characters is out of the question, because of the needs
> for interoperability and reliable text exchange.
> That leaves essentially two approaches.
> I. Represent these sequences using a halant (virama)
> Vowel_A_zophola_AA = 0985 09CD 09AF 09BE ( a- halant ya -aa )
> Vowel_E_zophola_AA = 098F 09CD 09AF 09BE ( e- halant ya -aa )
> II. Represent these sequences with newly coded characters
> Vowel_A_zophola_AA = 0991 (structurally analogous to candra-o)
> Vowel_E_zophola_AA = 098D (structurally analogous to candra-e)
> With a separate opinion that there should be two newly coded
> characters, but that they should be encoded as Bengali-specific
> additions, at 09FB and 09FC, presumably, because the zophola-AA
> forms have a separate graphic etymology in Bengali, and are not
> *formally* analogous to candra-o and candra-e, although they
> respond to the same functional requirement for transliteration
> of English sounds as the Devanagari innovations.
> The advantages I have been hearing or envision for solution I) include:
> - The characters are already encoded, so it is merely a matter
> of teaching the rendering engines about these exceptional
> sequences, and not of going through the process of getting
> formal acceptance of more encoded characters.
> - The zophola form is already a regular conjunct form for the
> consonant + halant + ya sequences, so this is not inventing
> some new shaping behavior, but merely extending it to a
> new context, defining the behavior when the halant + ya
> follows either of two particular independent vowels.
> - This solution follows naturally, by a process of extension by
> structural analogy, what must have been how the originators
> of this convention invented the usage in the first place.
> The advantages I have been hearing or envision for solution II) include:
> - The entire sequences, Vowel_A_zophola_AA and Vowel_E_zophola_AA,
> constitute vowel-initial syllables. This would be handled more
> naturally for text processes if a structural solution comparable
> to that of Devanagari were chosen, i.e. each vowel-initial
> syllable is coded as a single character, as an "independent vowel",
> even when the grapheme is arguably composed of visual parts.
> - Rendering would be simpler, since for both of these sequences,
> there would simply be a one-to-one character-to-glyph mapping,
> as for the other long independent vowels or for candra-e and
> candra-o in Devanagari.
> - Holes are present in the Devanagari code chart layout at the
> structurally correspondent positions, so these two sequences
> could be encoded while maintaining structural correspondences,
> even though ISCII does not define these particular extensions
> for Bengali. (This would be irrelevant if the two characters
> were to be encoded as extensions to Bengali.)
> And of course, in each case the advantages of one solution can be
> reconsidered as disadvantages for the other solution.
> Maybe others can state further advantages to one solution or the other.
> On balance, tossing in my 2 cents on this, I would have to favor the
> first approach, while admitting the advantages of the second approach.
> The great attraction is that solution I) is available now. It is merely
> a matter of specifying the particular behavior for halant + ya following
> a- or e- in Bengali. This is much less trouble than pushing through
> the 2- to 3-year process to get two new characters accepted for
> encoding in the international standard. In terms of the rendering
> side of the problem, the modifications ought to be rather minor for
> Bengali. A rendering engine is already going to have to be looking
> up triplets of <C- halant C-> to check if conjuncts are available in
> the font. Merely extending the initial class of that triplet to
> include a- and e- for Bengali ought to do the trick. Or alternatively,
> as Marco pointed out, a doublet check on <halant ya-> combinations could
> be implemented to unconditionally use the zophola form. I don't know
> if that would overgeneralize for Bengali, but the more conservative
> triplet checking should catch all cases. Effectively this is no more
> work than teaching the rendering engine that two new unitary characters
> are available as independent vowels. It ought to be a wash, since it
> is not a matter of introducing fundamentally different behavior, but
> just extending slightly already existing behavior.
> As for the drawback regarding syllable structure, it seems to me that
> this also ought to be a relatively minor extension. While it is true
> that it would be simpler for *all* vowel-initial syllables to have
> the single character independent vowels, with no matras following,
> it is also true that any code which is determining syllables in an
> Indic script already has to identify <C- halant C- -v> sequences as
> syllables, as well. It seems to me that the Vowel_A_zophola_AA sequence
> would fit rather easily as an exceptional case into that pattern --
> and the fact that it is an exceptional case functionally, limited
> almost entirely to loanword vocabulary, means that its exceptional
> processing ought not to be a major problem.
> For collation and ordering, it is true that a single code point would
> make the required tables a little simpler, but entering the
> Vowel_A_zophola_AA as a sequence into an ordering table is also allowed
> by the current algorithms, so equivalent ordering behavior can be
> achieved with either approach.
> Next, I'm a bit worried about approach II) for normalization purposes.
> If new unitary characters were encoded for these two sequences, it
> would be necessary to determine whether they were canonical equivalences
> to the sequences suggested in solution I). If they are determined
> to be canonical equivalents, then the new characters would have to
> be added to the composition exclusions table for normalization
> (see UTR #15), and people would not get the results they expect
> under normalization Form C (i.e., you would end up with the sequence
> expressed as a decomposed sequence, anyway). If they are *not*
> determined to be canonical equivalents, then that always raises
> the question of why what looks like it *ought* to be equivalent is
> in fact not treated as equivalent by the software. One could just
> rule that halants are disallowed after independent vowels (i.e. this
> is just a "bad spelling"), but by the time you get around to doing
> this and have the new encoded characters in place, people may already
> have data represented for Bengali Vowel_A_zophola_AA using the
> sequence, and they can be expected to ask why they cannot do so.
> Finally, regarding the question that was raised about how to represent
> a candrabindu or other combining mark for the sequence, I think the
> answer is fairly simple:
> Vowel_A_zophola_AA + candrabindu = 0985 09CD 09AF 09BE 0981
> ( a- halant ya -aa candrabindu )
> You place it in the same relative position to the sequence as you
> would if you were placing a candrabindu after a single character
> for an independent vowel, or if you were placing a candrabindu after
> a conjunct plus matra combination. It ought really to be no different.
> If it is unclear to implementers, it should be spelled out in
> detail. (This question is comparable to a question which came up in
> WG2 about where to represent a combining bangjeom tone mark for
> a Hangul syllable when the Hangul syllable itself is represented
> as a sequence of conjoining jamos.)
> In fact, come to think of it, it would be nice if the
> Bengali experts could help supply a *real* Bengali script introduction
> for the standard, so that Bengali implementers would have information
> about such oddities as the zophola and would have a common basis
> on which to develop Unicode implementations.
> --Ken

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT