From: Christopher John Fynn (cfynn@gmx.net)
Date: Wed Jun 25 2003 - 21:39:16 EDT
Difficulties due to the present combining class values attached
to these characters most frequently occur with
abbreviations/contractions and/or with cursive scripts. With
abbreviations it is common to have two or more vowels on a
consonant stack. In cursive or semi-cursive forms of Tibetan
script the subjoined vowels 0F71, 0F74 and 0F75 form ligatures
with the consonant(s) in the stack, while above headline
vowel(s) such as U+0F72 U+0F7A and U+0F7C sometimes forms a
ligature with the following consonant or punctuation mark.
In Dzongkha (Bhutanese) abbreviated spellings are often the
usual way of writing words and a semi-cursive form of Tibetan
script (Joyig) is standard - so the problem frequently occurs.
I have a 225 page dictionary, and several other lists, of common
abbreviations which are full of examples where this problem
occurs.
I've attached a couple of real and fairly simple examples.
Example 1
========
Following normal orthographic rules the characters to produce
Example1_gtuig.jpg would be entered as:
U+0F42 U+0F4F U+0F74 U+0F72 U+0F42
If the characters remain in that order there is no problem -
the first U+0F42 is straight forward, the isolated character is
displayed as a simple glyph "uni0F42"
the sequence U+0F4F U+0F74 is replaced by a ligature
"uni0F4F0F74"
U+0F72 U+0F42 is replaced by a ligature "uni0F720F42"
Now if the text goes through a "normalisation" process the same
text ends up reordered as:
U+0F42 U+0F4F U+0F72 U+0F74 U+0F42
because the combining class value of U+0F72 is less than that of
U+0F74.
To render this there is no change for the first character but I
now need a lookup to render the whole sequence:
U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 with two glyphs
"uni0F4F0F74 uni0F720F42"
Example 2
========
Following normal orthographic rules the characters to produce
Example1_gtuop.jpg would be entered as:
U+0F42 U+0F4F U+0F74 U+0F7C U+0F54
If the characters remain in that order there is no proplem -
the first U+0F42 is as in the first example
the sequence U+0F4F U+0F74 is replaced by a ligature
"uni0F4F0F74"
U+0F7C U+0F54 is replaced by a ligature "uni0F7C0F54"
However, since the combining class value of U+0F7C is less than
that of U+0F74,.
after a "normalisation" process the same text ends up reordered
as:
U+0F42 U+0F4F U+0F7C U+0F72 U+0F54
and the whole sequence:
U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 needs to be replaced with the
two glyphs "uni0F4F0F74 uni0F720F42".
Example 3 - (Example3_aMi-aiM.jpg)
==============================
This is taken from an entirely different source, the "TibetBT"
font which was specially created for a project in Sichuan
digitising the Tibetan bstan-'gyur (a vast cannonical collection
of texts in over 200 large volumes originally translated
fromSanskrit into Tibetan). The glyph set of the font is the
same as the the set of Tibetan stacks found in that collection.
All stacks including any combining vowels are implemented as
precomposed ligatures This font can be downloaded from
(though it is wrapped-up in a Windows "setup.exe" file).
Here we have two stacks which one would naturally enter as
U+0F68 U+0F7E U+0F72 and U+0F68 U+0F72 U+0F7E respectively. No
problem so long as the characters remain in that order. However
since U+0F72 has a combining class value greater than that of
U+0F7E - in a process of "normalisation" U+0F72 would always
float to the end and both stings would end up as U+0F68 U+0F7E
U+0F72 and be indistinguishable.
If there were only a few and fixed number of cases like the
first two examples it would not be *much* of a problem to add
the extra lookups - even though my font would need both "many to
one" and "many to many" lookups to handle it. But there are
*numerous* cases I already know of and there is no fixed and
final list of such abbreviations. So I should really build the
tables in my font to be able to handle almost any possibility.
If the combining classes of vowels & marks were based on the
expected order where subjoined vowels are always written before
any above headline vowels, this would be reasonably
straight-forward to do - but as they may now wind up after
normalisation it requires adding a huge number of complex
lookups to the tables in my font. - Once I've done this it is
going to be very difficult to test all the permeutations.
Because of the number of additional lookups I need it is also
likely there will be a hefty performance hit - especially on
reflowing large documents. Unfortunately the third example
can't simply be fixed by font lookups since two distinct
combinations wind up being identical and hence would have to be
rendered identically.
If I wrote a peice of software where values I'd assigned caused
problems and innefficiencies like this, I'd count it as a major
fault or bug and hurry to fix it by assigning the correct
values. I know the Tibetan characters were discussed in great
detail by a number of "experts" at the time they were encoded -
however there was little or no substantial discussion amongst
these experts about the cannonical combining class values
assigned to the characters by the UTC. If the combining
classes of Tibetan dependant vowels had been based on the order
in which these characters are normally written or typed there
would not be this problem in processing them.
I beleive that correcting the cannonical combining class values
of these characters is the best solution. Leaving things as
they are is going to cause a lot of extra work for implementors
and inefficiencies in implementations. There is no work-around
for the problem illustrated by Example 3. Someone suggested
encoding an otherwise identical set of characters with the
correct CCCV values and depreciating the existing ones but this
is not a real solution only a kludge. - And how could encoding
otherwise identical characters in ISO/IEC-10646 be justified
since that standard does not specify cannonical combining class
values of characters?
- Chris
Christopher Fynn
4 Chester Court
84 Salusbury Road
London NW6 6PA
This archive was generated by hypermail 2.1.5 : Wed Jun 25 2003 - 22:17:15 EDT