Please see my responses below, in the light of the fact that I am not a
linguist; though I try my best to understand points of view of all those who
impact communication_future: users, encoding standards, linguists,
developers, font designers etc [no order of preference implied].
Thanks,
-apurva
-----Original Message-----
From: Peter Constable [mailto:peter_constable@sil.org]
Sent: Friday, April 28, 2000 8:33 AM
To: unicode@unicode.org; zzak@csi.com; opentype@list.sirius.com
Subject: Re: Encoding Bengali Vowel forms (again)
Marco>> As usual, I cannot stop spitting my little word :-|
Antoine>I believe I am as bad as you are. :-|.
OK, I'll go along. :-|
I'm very much inclined to agree with Marco that nothing *new*
is needed, and also with Antoine that interested parties should
discuss alternatives and agree on what will be done.
Marco said:
>In general, viramas are just characters as any other,
>and can occur *everywhere*. And this a general
>feature of Unicode: with few reasonable exceptions
>(e.g. unpaired surrogates), Unicode does not have a
>"syntax" that stipulates which sequences of
>characters are legal and which are not.
[apurva:] A Virama, can be input by a user, anywhere in a sequence of
characters. However, that does not mean it would always result in output
that is meaningful to the user receiving it, or considered logical in a
[Indic] script.
I have known atleast the following occurances of the virama as illogical
when contructing syllables in Indic:
1. when it follows another virama [I guess Sinhala permits this]
2. when it follows a vowel sign
3. when it follows an independent vowel
4. when it occurs on its own [ie. when it's not for display purposes]
The known purposes of the virama in Indic are as follows:
1.
Causing conjunct formation:
consonantX + virama + consonantY -> conjunctXY
eg:
Ka + virama + La -> KLa
Here the virama is said to remove the inherent vowelA in Ka, in order to
make it receptive to combine with another consonant. [eg. Klesha: meaning
sorrow].
Or,
Da + virama + Ya -> DYa [eg. Vidyaa: knowledge].
2.
Shortening a syllable final consonant, done specifically by visually adding
the virama:
consonantX + virama -> Halant form of consonantX
eg:
Ma + virama -> MaVirama
These occur frequently in Sanskrit as word final syllables. [eg. Sukham:
meaning happiness].
Given the above purposes of the Virama, if the following occurs [not just in
Bengali]:
LetterA + virama + consonantYa + vowelSignAa
it would imply the removal of LetterA [the full vowel] itself. Going by the
rules, this would not be logical.
Specifically to Bengali, the use of Ya [a semi-vowel] to obtain a CandraE
and CandraO, is more a contemporary work_around, that has been resorted to
in the absence of devising a newer independent vowel. Hence I prefer to
consider its inclusion in the 'specific addition' section of Bengali.
[As an aside:
I prefer the name Halant to Virama.
Halant-> that which removes the inherent vowel.
Virama -> stop, full stop.
Swalpa virama (short stop) -> comma.
These indigenous words might change depending on the script and language
used eg: Halant in Bangla [Bengali] is known as hasanta.]
Marco's general comment about Unicode not having a syntax
(apart from things like surrogates) is, in my understanding,
mostly but not 100% true. For example, the standard does
indicate that Devanagari dependent vowels are to be encoded
after their consonant (in logical order) while Thai vowels are
encoded in visual order (which sometimes means before the
consonant). It's necessary to mandate some things of this sort
so that the standard will get implemented in software, and
implemented in a consistent manner such that data interchange
is possible (and that's the purpose for a character encoding
standard). It would be a big problem for data interchange if
Devanagari dependent vowels were sometimes encoded before and
sometimes after the consonant at the whim of individual
implementers.
[apurva:] I agree.
In my mind, more of this is actually needed. Several months
ago, we were working on our Yi font, and the samples that our
clients showed us had occasional use of a middle dot as
punctuation. Now, how many choices might there be for encoding
this? I never made a thorough count, but it's more than one. I
inquired on this list and with UTC to see if anyone could tell
me what this punctuation character is and how it should be
encoded, and nobody gave a definitive answer, probably because
nobody had considered it before. We ended up using 30fb
KATAKANA MIDDLE DOT since this would have the
fullwidth/monowidth properties needed for Yi. But what if
another implementer chose to use one of the other characters
with a similar visual appearance? The result would be a
hindrance to successful interchange.
But I'm rambling. My point is that it is important for this
issue to be discussed and that implementers agree on a
solution. But, what Marco said about nothing prohibiting
combining virama in new ways is absolutely true, as far as I
know.
Now, Apurva wrote:
>The semantics of Ya in conjunct formation and for
>use with LetterA /LetterE is very different.
Semantics are different in what sense? Do you mean that they
would represent different things phonologically/linguistically,
or that different Unicode semantics would be required? If it's
just a matter of different linguistic significance, that is a
non-issue. The letter "g" has different phonological meaning
between "rag" and in "rough"; "e" has different phonological
meaning between "feet" and "fate". But that doesn't mean
different encodings are needed for these.
[apurva:] Please see my earlier response above for the Ya. In addition: The
Ya_phola in question is not phonological. It has wider impact. Because
theoretically this would then make it possible for other semi-vowels [Ra,
La, Va] to combine with an independent vowel. I am pro evolution of scripts.
However, while permitting newer possibilites we might also want to care,
that in the process we don't trample on rules already set.
There is nothing about the Unicode semantics of Bengali
characters that prohibit using what is already there. All
that's needed is to abandon certain assumptions, which Marco
has already discussed. (I'll forward that message to the
OpenType list for the benefit of people on that list who aren't
on Unicode.) If you want to propose adding new characters to
Unicode, you need to have good reasons why an implementation
using the existing characters is inadequate *in terms of text
processing issues* (not in terms of how speakers/writers think
of the orthography - that is essentially irrelevant).
[apurva:] I would like to think of an encoding standard [Unicode] not only
as that which takes care of 'text processing', but also as providing a means
to cleanly address changes in scripts that have taken place due to:
1. constraints in earlier technologies
2. script evolution [or the lack thereof].
I guess thats what we are all aiming at.
As far as using the PUA is concerned, yes, that's an option.
It's becomes problematic, however, if you want all implementers
to agree on particular PUA characters. Let's say everybody
interested in Bengali gets together and agrees that E000 and
E001 will be used for Vowel A_zophola_AA and Vowel
E_zophola_AA, and let's suppose further that Apurva and co
implement Uniscribe and some OT fonts based on this. In the
mean time, somebody else has (as they are free to do) defined
for their use E000 and E001 for a couple of Ethiopic characters
that are being considered for future addition to Unicode.
(That's a real situation - we're currently doing some work on
Ethiopic, and we have made a number of such PUA assignments.)
Now, that person has an Ethiopic font, and they want to display
some text using MS software. They'll be pretty upset if
Uniscribe munges their PUA characters. It's a legal use of
Unicode for MS to define PUA characters for particular uses
(though they are encouraged to do so near the top of the PUA
range, and they really ought to publically document what they
do so that users will know what to expect of their software).
But if they want to be concerned about what end users may want
to do with their software, they need to think very carefully
about any PUA assignments they make. As far as encouraging a
widespread pseudo-standard use of the PUA, that is potentially
counter to the intension of Unicode, particularly if you are
trying to get a number major software developers to go along.
I have no problem with a couple of PUA characters being used by
a group of people interested in Bengali as an interim solution
for the potential characters. Getting some particualr support
for that in Uniscribe would be, I think, not a good thing, and
I'd be very surprised if MS would entertain that possibility.
(But then, if you use the PUA, you don't need any smart font
behaviour for these characters.)
[apurva:] I might not look favourably on the use of PUA for this.
But I'd argue with Marco in favour of your other proposed
interim solution, and I'd argue that it shouldn't be just an
interim solution but rather the permanent solution.
[apurva:] Pardon my being blunt here. But, Indic scripts [like Malayalam]
have had to see a change in orthography and typographical quality [some,
sadly for the worse] due to some interim solutions [constraints in some
earlier typesetting systems]. Since these solutions unfortunately have not
been looked at as interim, but as permanent [they have existed for decades].
As a result, a whole generation of young people in India who have not had
the opportunity to see the original orthography of the script, think that
the current incorrectly implemented solutions 'are' the way it has to be!
Hence it would be prudent of us to try our best to look at the long term
effects too, that technology [here an encoding standard] tends to usher in
with itself.
Thanks,
-apurva
Peter Constable
From: <Antoine.Leca@renault.fr> AT Internet on 04/27/2000 07:07
AM
To: Peter Constable/IntlAdmin/WCT, <unicode@unicode.org> AT
Internet@Ccmail
cc: <unicode@unicode.org> AT Internet@Ccmail, <zzak@csi.com>
AT Internet@Ccmail
Subject: Re: Encoding Bengali Vowel forms (again)
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT