RE: Encoding Bengali Vowel forms (again)

From: Apurva Joshi (apurvaj@microsoft.com)
Date: Wed May 03 2000 - 00:26:26 EDT


Many thanks Ken, for all the details below. My short [respectful] response
is as follows:
I am for approach (II). Even though approach (I) is available now, seems to
be easily extendable and the fact that many developers would have already
implemented this. I think its certainly worth the wait to define unitary
characters for the LetterE_YaPhola_VowelSignAa and
LetterA_YaPhola_VowelSignAa.

Reasons: [feel free to correct me where I might be wrong]
1.
I think that if we implement (I) now, the possibility of defining newer
unitary characters for the same might be dim. This is given the fact that
once things are implemented using a certain approach, people get used to it
and its pretty difficult to change them later.

2.
If we go ahead with approach (I) now and code points 'are' allocated in the
future, I guess we might also be unknowingly adding to 'legacy' issues for
processing Indic: since future implementation will have to take care of both
approaches for downward compatibility for a good amount of time.

3.
From the script perspective, it would seem that the most logical sequences
to generate the above characters would be:
[I quote from the approach (I) below]
>Vowel_A_zophola_AA = 0985 09CD 09AF 09BE ( a- halant ya -aa )
>Vowel_E_zophola_AA = 098F 09CD 09AF 09BE ( e- halant ya -aa ).
To do so in any Indic script that is alive and being used today,
unfortunately makes me a good deal uncomfortable. Anyone familiar with the
well defined ground rules of Indic scripts would certainly say that it goes
against the very purpose of the halant in such scripts- ie. to remove an
inherent vowel. I would prefer not to redefine such rules to suit short term
goals.

It would be nice to hear the views of people who use Bangla, for this. I
also look forward to a possibility that such encoding requirements for any
script [not just Indic], will continue to get required attention and time.
Thanks,
-apurva

-----Original Message-----
From: Kenneth Whistler [mailto:kenw@sybase.com]
Sent: Tuesday, May 02, 2000 7:45 PM
To: Unicode List
Cc: kenw@sybase.com
Subject: RE: Encoding Bengali Vowel forms (again)

A number of interesting arguments have been brought forward on this
thread in response to Md. Abdul Malik's statement of the problem
regarding the zophola-AA in Bengali.

It seems to me that everyone has agreed about the problem itself:
how to represent, in Unicode, the particular Bengali written
initial sequences Vowel_A_zophola_AA and Vowel_E_zophola_AA that
have been innovated in Bengali, apparently primarily for writing
English words adapted into Bengali.

It also seems that there is consensus that a solution involving
private use characters is out of the question, because of the needs
for interoperability and reliable text exchange.

That leaves essentially two approaches.

I. Represent these sequences using a halant (virama)

   Vowel_A_zophola_AA = 0985 09CD 09AF 09BE ( a- halant ya -aa )
   Vowel_E_zophola_AA = 098F 09CD 09AF 09BE ( e- halant ya -aa )

II. Represent these sequences with newly coded characters

   Vowel_A_zophola_AA = 0991 (structurally analogous to candra-o)
   Vowel_E_zophola_AA = 098D (structurally analogous to candra-e)

   With a separate opinion that there should be two newly coded
   characters, but that they should be encoded as Bengali-specific
   additions, at 09FB and 09FC, presumably, because the zophola-AA
   forms have a separate graphic etymology in Bengali, and are not
   *formally* analogous to candra-o and candra-e, although they
   respond to the same functional requirement for transliteration
   of English sounds as the Devanagari innovations.

The advantages I have been hearing or envision for solution I) include:

   - The characters are already encoded, so it is merely a matter
     of teaching the rendering engines about these exceptional
     sequences, and not of going through the process of getting
     formal acceptance of more encoded characters.

   - The zophola form is already a regular conjunct form for the
     consonant + halant + ya sequences, so this is not inventing
     some new shaping behavior, but merely extending it to a
     new context, defining the behavior when the halant + ya
     follows either of two particular independent vowels.

   - This solution follows naturally, by a process of extension by
     structural analogy, what must have been how the originators
     of this convention invented the usage in the first place.

The advantages I have been hearing or envision for solution II) include:

   - The entire sequences, Vowel_A_zophola_AA and Vowel_E_zophola_AA,
     constitute vowel-initial syllables. This would be handled more
     naturally for text processes if a structural solution comparable
     to that of Devanagari were chosen, i.e. each vowel-initial
     syllable is coded as a single character, as an "independent vowel",
     even when the grapheme is arguably composed of visual parts.

   - Rendering would be simpler, since for both of these sequences,
     there would simply be a one-to-one character-to-glyph mapping,
     as for the other long independent vowels or for candra-e and
     candra-o in Devanagari.

   - Holes are present in the Devanagari code chart layout at the
     structurally correspondent positions, so these two sequences
     could be encoded while maintaining structural correspondences,
     even though ISCII does not define these particular extensions
     for Bengali. (This would be irrelevant if the two characters
     were to be encoded as extensions to Bengali.)

And of course, in each case the advantages of one solution can be
reconsidered as disadvantages for the other solution.

Maybe others can state further advantages to one solution or the other.

On balance, tossing in my 2 cents on this, I would have to favor the
first approach, while admitting the advantages of the second approach.

The great attraction is that solution I) is available now. It is merely
a matter of specifying the particular behavior for halant + ya following
a- or e- in Bengali. This is much less trouble than pushing through
the 2- to 3-year process to get two new characters accepted for
encoding in the international standard. In terms of the rendering
side of the problem, the modifications ought to be rather minor for
Bengali. A rendering engine is already going to have to be looking
up triplets of <C- halant C-> to check if conjuncts are available in
the font. Merely extending the initial class of that triplet to
include a- and e- for Bengali ought to do the trick. Or alternatively,
as Marco pointed out, a doublet check on <halant ya-> combinations could
be implemented to unconditionally use the zophola form. I don't know
if that would overgeneralize for Bengali, but the more conservative
triplet checking should catch all cases. Effectively this is no more
work than teaching the rendering engine that two new unitary characters
are available as independent vowels. It ought to be a wash, since it
is not a matter of introducing fundamentally different behavior, but
just extending slightly already existing behavior.

As for the drawback regarding syllable structure, it seems to me that
this also ought to be a relatively minor extension. While it is true
that it would be simpler for *all* vowel-initial syllables to have
the single character independent vowels, with no matras following,
it is also true that any code which is determining syllables in an
Indic script already has to identify <C- halant C- -v> sequences as
syllables, as well. It seems to me that the Vowel_A_zophola_AA sequence
would fit rather easily as an exceptional case into that pattern --
and the fact that it is an exceptional case functionally, limited
almost entirely to loanword vocabulary, means that its exceptional
processing ought not to be a major problem.

For collation and ordering, it is true that a single code point would
make the required tables a little simpler, but entering the
Vowel_A_zophola_AA as a sequence into an ordering table is also allowed
by the current algorithms, so equivalent ordering behavior can be
achieved with either approach.

Next, I'm a bit worried about approach II) for normalization purposes.
If new unitary characters were encoded for these two sequences, it
would be necessary to determine whether they were canonical equivalences
to the sequences suggested in solution I). If they are determined
to be canonical equivalents, then the new characters would have to
be added to the composition exclusions table for normalization
(see UTR #15), and people would not get the results they expect
under normalization Form C (i.e., you would end up with the sequence
expressed as a decomposed sequence, anyway). If they are *not*
determined to be canonical equivalents, then that always raises
the question of why what looks like it *ought* to be equivalent is
in fact not treated as equivalent by the software. One could just
rule that halants are disallowed after independent vowels (i.e. this
is just a "bad spelling"), but by the time you get around to doing
this and have the new encoded characters in place, people may already
have data represented for Bengali Vowel_A_zophola_AA using the
sequence, and they can be expected to ask why they cannot do so.

Finally, regarding the question that was raised about how to represent
a candrabindu or other combining mark for the sequence, I think the
answer is fairly simple:

   Vowel_A_zophola_AA + candrabindu = 0985 09CD 09AF 09BE 0981
                                      ( a- halant ya -aa candrabindu )

You place it in the same relative position to the sequence as you
would if you were placing a candrabindu after a single character
for an independent vowel, or if you were placing a candrabindu after
a conjunct plus matra combination. It ought really to be no different.
If it is unclear to implementers, it should be spelled out in
detail. (This question is comparable to a question which came up in
WG2 about where to represent a combining bangjeom tone mark for
a Hangul syllable when the Hangul syllable itself is represented
as a sequence of conjoining jamos.)

In fact, come to think of it, it would be nice if the
Bengali experts could help supply a *real* Bengali script introduction
for the standard, so that Bengali implementers would have information
about such oddities as the zophola and would have a common basis
on which to develop Unicode implementations.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT