From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 24 2003 - 19:11:31 EDT
Chris Fynn wrote:
> In Unicode's UnicodeData.txt (
> http://www.unicode.org/Public/UNIDATA/Unicodea.Dattxt )
> 0F7E has a Canonical Combining Class Value (CCCV) of 0;
> 0F71 a CCCV of 129;
> 0F72 0F7A 0F7B 0F7C 0F7D and 0F80 a CCCV of 130;
> 0F74 a CCCV of 132;
> and 0F82 and 0F83 have a CCCV of 230.
>
> By normal Tibetan & Dzongkha spelling, writing, and input rules
> Tibetan script stacks should be entered and written: 1 headline
> consonant (0F40-0F6A), any subjoined consonant(s) (0F90-
> 0F9C), achung (0F71), shabkyu (0F74), any above headline
> vowel(s) (0F72 0F7A 0F7B 0F7C 0F7D and 0F80) ; any ngaro (0F7E,
> 0F82 and 0F83)
>
> So following normal Tibetan & Dzongkha input and spelling rules
> the relative ordering of these characters should be:
> A. 0F71
> B. 0F74
> C. 0F72 0F7A 0F7B 0F7C 0F7D and 0F80
> D. 0F7E, 0F82 and 0F83
>
> The fact that, in a process of "canonical decomposition" or
> "normalisation", these combining characters can get reordered
> in a bizarre order relative to each other
Actually, looking at this data, while I can see that the
combining classes are assigned less than optimally, I don't
see that this makes any practical problem for Tibetan data.
You are saying, in effect, that the stack structure has
the following position classes (treating the consonant stack
itself as the more tightly bound unit that I will just
symbolize as CS):
CS - achung - shabkyu - vowelsabove - ngaro
And since shabkyu has cc=132 whereas the vowelsabove have
cc=130, they would reorder out of expected order if
normalized. However, for most text the shabkyu (u-below)
would be in complementary distribution with the vowels
above, so the effective positional classes are:
{ vowelsabove }
CS - achung - { shabkyu } - ngaro
And in this case, the relative combining class of the vowels
doesn't really matter, since we wouldn't be seeing both
present to reorder around each other.
I'm guessing that you are claiming there are instances where
the shabkyu does cooccur with other vowels above as well.
Wouldn't those, if they do occur, represent a distinctly
minority case in terms of the overall processing? The short
summaries of Tibetan writing that I've seen don't even mention
it as a possibility, since even the few diphthongs in -u
are written with a separate stack <0F60, 0F74> to the
right of the main stack.
> causes difficulties
> with culturally correct collation (where 0F7E, 0F82 and 0F83
> should have an equal value) - and especially it necessitates
> making lookups in smart fonts far more complex and inefficient
> than they should have to be.
And I'm not seeing the problem here, either. Since the
combining class of 0F82 is 0, and not some other random
value, it isn't going to reorder around the other vowel
marks. If it is entered in the traditional spelling order you
have indicated, then it is going to stay in that position;
normalization won't move it. And since the equivalent
0F82 and 0F83 sift to the end of the syllable, with their
high combining class, they'll end up in the same position
as the 0F7E ngaro if normalized.
The only problem you'd have is with Tibetan data where a
0F7E ngaro is entered in other than the optimal spelling
order you indicated. Such a sequence won't compare equal
unless you add a spelling equivalence rule on top of the
canonical equivalence. But there are a number of such edge
cases for Brahmic scripts -- not just Tibetan.
Culturally correct collation is first a matter of giving
the three ngaro characters equivalent weights. Beyond that,
as you indicated, the weighting of the syllables (or stacks)
is complicated, and isn't going to be affected by 0F7E
having combining class 0 in any case.
>
> (In Tibetan script fonts 0F71 and 0F74 are often ligated with
> preceding consonant (+ subjoined consonants) combined as a
> single glyph whereas above headline vowels are almost always
> treated as non spacing combining marks.)
Yes, but the only point where this would be a problem would
be for stacks with a shabkyu (u vowel) *and* another vowel.
And even for such cases, wouldn't this be handled effectively
by 6 triples in the ligature tables which would identify
any shabkyu moved after one of the other 6 vowels?
>
> Currently there seems to be no easy or standardized work around
> for these problems and the standard seems to say that the
> relative values of assigned Canonical Combining Class Values
> cannot be changed.
They cannot.
> Any suggestions as to how to create a standardized work around
> for these incorrect values?
I guess I'm not getting it. I don't see the need for a
"standardized" work around, here.
--Ken
>
> - Chris
This archive was generated by hypermail 2.1.5 : Tue Jun 24 2003 - 20:05:47 EDT