RE: Encoding Bengali Vowel forms (again)

From: Apurva Joshi (apurvaj@microsoft.com)
Date: Fri Apr 28 2000 - 18:46:52 EDT


Please see my responses below, in the light of the fact that I am not a
linguist; though I try my best to understand points of view of all those who
impact communication_future: users, encoding standards, linguists,
developers, font designers etc [no order of preference implied].
Thanks,
-apurva

-----Original Message-----
From: Peter Constable [mailto:peter_constable@sil.org]
Sent: Friday, April 28, 2000 8:33 AM
To: unicode@unicode.org; zzak@csi.com; opentype@list.sirius.com
Subject: Re: Encoding Bengali Vowel forms (again)

       Marco>> As usual, I cannot stop spitting my little word :-|

       Antoine>I believe I am as bad as you are. :-|.

       OK, I'll go along. :-|

       I'm very much inclined to agree with Marco that nothing *new*
       is needed, and also with Antoine that interested parties should
       discuss alternatives and agree on what will be done.

       Marco said:
>In general, viramas are just characters as any other,
>and can occur *everywhere*. And this a general
>feature of Unicode: with few reasonable exceptions
>(e.g. unpaired surrogates), Unicode does not have a
>"syntax" that stipulates which sequences of
>characters are legal and which are not.
[apurva:] A Virama, can be input by a user, anywhere in a sequence of
characters. However, that does not mean it would always result in output
that is meaningful to the user receiving it, or considered logical in a
[Indic] script.
I have known atleast the following occurances of the virama as illogical
when contructing syllables in Indic:
1. when it follows another virama [I guess Sinhala permits this]
2. when it follows a vowel sign
3. when it follows an independent vowel
4. when it occurs on its own [ie. when it's not for display purposes]

The known purposes of the virama in Indic are as follows:
1.
Causing conjunct formation:
consonantX + virama + consonantY -> conjunctXY
eg:
Ka + virama + La -> KLa
Here the virama is said to remove the inherent vowelA in Ka, in order to
make it receptive to combine with another consonant. [eg. Klesha: meaning
sorrow].
Or,
Da + virama + Ya -> DYa [eg. Vidyaa: knowledge].
2.
Shortening a syllable final consonant, done specifically by visually adding
the virama:
consonantX + virama -> Halant form of consonantX
eg:
Ma + virama -> MaVirama
These occur frequently in Sanskrit as word final syllables. [eg. Sukham:
meaning happiness].

Given the above purposes of the Virama, if the following occurs [not just in
Bengali]:
LetterA + virama + consonantYa + vowelSignAa
it would imply the removal of LetterA [the full vowel] itself. Going by the
rules, this would not be logical.

Specifically to Bengali, the use of Ya [a semi-vowel] to obtain a CandraE
and CandraO, is more a contemporary work_around, that has been resorted to
in the absence of devising a newer independent vowel. Hence I prefer to
consider its inclusion in the 'specific addition' section of Bengali.

[As an aside:
I prefer the name Halant to Virama.
Halant-> that which removes the inherent vowel.
Virama -> stop, full stop.
Swalpa virama (short stop) -> comma.
These indigenous words might change depending on the script and language
used eg: Halant in Bangla [Bengali] is known as hasanta.]

       Marco's general comment about Unicode not having a syntax
       (apart from things like surrogates) is, in my understanding,
       mostly but not 100% true. For example, the standard does
       indicate that Devanagari dependent vowels are to be encoded
       after their consonant (in logical order) while Thai vowels are
       encoded in visual order (which sometimes means before the
       consonant). It's necessary to mandate some things of this sort
       so that the standard will get implemented in software, and
       implemented in a consistent manner such that data interchange
       is possible (and that's the purpose for a character encoding
       standard). It would be a big problem for data interchange if
       Devanagari dependent vowels were sometimes encoded before and
       sometimes after the consonant at the whim of individual
       implementers.
[apurva:] I agree.

       In my mind, more of this is actually needed. Several months
       ago, we were working on our Yi font, and the samples that our
       clients showed us had occasional use of a middle dot as
       punctuation. Now, how many choices might there be for encoding
       this? I never made a thorough count, but it's more than one. I
       inquired on this list and with UTC to see if anyone could tell
       me what this punctuation character is and how it should be
       encoded, and nobody gave a definitive answer, probably because
       nobody had considered it before. We ended up using 30fb
       KATAKANA MIDDLE DOT since this would have the
       fullwidth/monowidth properties needed for Yi. But what if
       another implementer chose to use one of the other characters
       with a similar visual appearance? The result would be a
       hindrance to successful interchange.

       But I'm rambling. My point is that it is important for this
       issue to be discussed and that implementers agree on a
       solution. But, what Marco said about nothing prohibiting
       combining virama in new ways is absolutely true, as far as I
       know.

       Now, Apurva wrote:

>The semantics of Ya in conjunct formation and for
>use with LetterA /LetterE is very different.

       Semantics are different in what sense? Do you mean that they
       would represent different things phonologically/linguistically,
       or that different Unicode semantics would be required? If it's
       just a matter of different linguistic significance, that is a
       non-issue. The letter "g" has different phonological meaning
       between "rag" and in "rough"; "e" has different phonological
       meaning between "feet" and "fate". But that doesn't mean
       different encodings are needed for these.
[apurva:] Please see my earlier response above for the Ya. In addition: The
Ya_phola in question is not phonological. It has wider impact. Because
theoretically this would then make it possible for other semi-vowels [Ra,
La, Va] to combine with an independent vowel. I am pro evolution of scripts.
However, while permitting newer possibilites we might also want to care,
that in the process we don't trample on rules already set.

       There is nothing about the Unicode semantics of Bengali
       characters that prohibit using what is already there. All
       that's needed is to abandon certain assumptions, which Marco
       has already discussed. (I'll forward that message to the
       OpenType list for the benefit of people on that list who aren't
       on Unicode.) If you want to propose adding new characters to
       Unicode, you need to have good reasons why an implementation
       using the existing characters is inadequate *in terms of text
       processing issues* (not in terms of how speakers/writers think
       of the orthography - that is essentially irrelevant).
[apurva:] I would like to think of an encoding standard [Unicode] not only
as that which takes care of 'text processing', but also as providing a means
to cleanly address changes in scripts that have taken place due to:
1. constraints in earlier technologies
2. script evolution [or the lack thereof].

I guess thats what we are all aiming at.

       As far as using the PUA is concerned, yes, that's an option.
       It's becomes problematic, however, if you want all implementers
       to agree on particular PUA characters. Let's say everybody
       interested in Bengali gets together and agrees that E000 and
       E001 will be used for Vowel A_zophola_AA and Vowel
       E_zophola_AA, and let's suppose further that Apurva and co
       implement Uniscribe and some OT fonts based on this. In the
       mean time, somebody else has (as they are free to do) defined
       for their use E000 and E001 for a couple of Ethiopic characters
       that are being considered for future addition to Unicode.
       (That's a real situation - we're currently doing some work on
       Ethiopic, and we have made a number of such PUA assignments.)
       Now, that person has an Ethiopic font, and they want to display
       some text using MS software. They'll be pretty upset if
       Uniscribe munges their PUA characters. It's a legal use of
       Unicode for MS to define PUA characters for particular uses
       (though they are encouraged to do so near the top of the PUA
       range, and they really ought to publically document what they
       do so that users will know what to expect of their software).
       But if they want to be concerned about what end users may want
       to do with their software, they need to think very carefully
       about any PUA assignments they make. As far as encouraging a
       widespread pseudo-standard use of the PUA, that is potentially
       counter to the intension of Unicode, particularly if you are
       trying to get a number major software developers to go along.

       I have no problem with a couple of PUA characters being used by
       a group of people interested in Bengali as an interim solution
       for the potential characters. Getting some particualr support
       for that in Uniscribe would be, I think, not a good thing, and
       I'd be very surprised if MS would entertain that possibility.
       (But then, if you use the PUA, you don't need any smart font
       behaviour for these characters.)
[apurva:] I might not look favourably on the use of PUA for this.

       But I'd argue with Marco in favour of your other proposed
       interim solution, and I'd argue that it shouldn't be just an
       interim solution but rather the permanent solution.
[apurva:] Pardon my being blunt here. But, Indic scripts [like Malayalam]
have had to see a change in orthography and typographical quality [some,
sadly for the worse] due to some interim solutions [constraints in some
earlier typesetting systems]. Since these solutions unfortunately have not
been looked at as interim, but as permanent [they have existed for decades].
As a result, a whole generation of young people in India who have not had
the opportunity to see the original orthography of the script, think that
the current incorrectly implemented solutions 'are' the way it has to be!

Hence it would be prudent of us to try our best to look at the long term
effects too, that technology [here an encoding standard] tends to usher in
with itself.
Thanks,
-apurva

       Peter Constable

       From: <Antoine.Leca@renault.fr> AT Internet on 04/27/2000 07:07
             AM

       To: Peter Constable/IntlAdmin/WCT, <unicode@unicode.org> AT
             Internet@Ccmail
       cc: <unicode@unicode.org> AT Internet@Ccmail, <zzak@csi.com>
             AT Internet@Ccmail
       
       Subject: Re: Encoding Bengali Vowel forms (again)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT