L2/01-308

 

Serious bug in Khmer, Myanmar combining classes

August 8, 2001

 

This document is a compendium of 3 e-mails, they result from Ken Whistler’s contribution in L2/01-307:

 

Martin Hosken, 2001-08-08

Ken Whistler,  2001-08-07

Martin Hosken, 2001-08-06

 

 

 

Dear Ken,

 

MH> My reading of this implies to me that koo should be stored as 1000 1031

> 102C 1039 200C.

 

KW>Yes, in my haste, I overlooked the use of 200C. You are right, of

course.

 

MH> This approach also fixes the U+1039, U+1037 confusion, since U+200C is

a

> base character and so breaks up the ordering and you can have free order

> again :)

 

KW>Yes. I think that is right.

 

In that case, I am probably in a position to have a go at writing a first

attempt at specifying this stuff. Given the complaints over the form of my

previous attempts at describing this stuff, I would appreciate suggestions

as to how to write this up so that it can be assessed/used in the best form

for others.

 

KW>No I wouldn't say that, but mostly because the combining class

definitions

we ended up with when normalization locked everything down predict

that. Actually, I think this is fundamentally a Latin/Greek/Cyrillic

letter plus accent distinction from a Brahmic consonant plus matra.

The systems work differently, and I wouldn't be too

worried about claiming that the order of matras is "spelling" whereas

the order of accents is "free variation" to be normalized away.

 

Hmm. I can see what you are saying. I think the plea is that we can use the

normalization algorithm to resolve some of the more messy aspects of Indic

scripts. The problem here is that the normalization algorithm isn't up to

it (since you need to associate a combining order with a sequence as well

as a single code). And the starting point is a good description of the

canonical order of combining characters in relation to each other and the

base character.

 

>The opposite position that *anything* attached above or below should

be allowed in free variation order is just as liable to overgeneralization

errors.

 

Is this before or after normalization?

 

I think my issue here is that keeping things in the right order at data

entry time is hard work. This is not something that needs to be done for

Latin since normalization sorts out the intermediate mess. The desire is

for the same facility for Brahmic scripts. And this raises the question as

to whether the normalization algorithm should be upgraded to make it

possible. And that would require some thought. Although is doing the

thinking worth it if such an idea is ruled out on some other principle?

 

Martin Hosken

 

 

 

From: Kenneth Whistler [kenw@sybase.com]

Sent: Tuesday, August 07, 2001 9:17 PM

 

Subject:    Re: serious bug in Khmer, Myanmar combining classes

 

> >koo 1000 1031 102C 1039

 

>

> These all make sense, except for one issue:

>

> p250 of TUS3.0 says "The virama ... also participates in some common

> constructions where it appears as a visible sign, commonly termed killer.

> In this usage where it appears as a visible diacritic, U+1039 is followed

> by a U+200C..., as with Devanagari.

>

> My reading of this implies to me that koo should be stored as 1000 1031

> 102C 1039 200C.

 

Yes, in my haste, I overlooked the use of 200C. You are right, of course.

 

>

> This approach also fixes the U+1039, U+1037 confusion, since U+200C is a

> base character and so breaks up the ordering and you can have free order

> again :)

 

Yes. I think that is right.

 

> >In my opinion, in *both* of these instances, the right way

> >to proceed is to specify the correct order, and to characterize

> >the other order as a *spelling* error -- not as a canonicalization

> >error.

>

> Would you make the same requirement in a Latin context, that the relative

> ordering underdot and acute accent, say, should be a spelling issue rather

> than a normalization issue? Me thinks there is somewhat of a double

> standard going on here (please read that as a purely technical,

> non-prejorative statement).

 

No I wouldn't say that, but mostly because the combining class definitions

we ended up with when normalization locked everything down predict

that. Actually, I think this is fundamentally a Latin/Greek/Cyrillic

letter plus accent distinction from a Brahmic consonant plus matra.

The systems work differently, and I wouldn't be too

worried about claiming that the order of matras is "spelling" whereas

the order of accents is "free variation" to be normalized away.

 

The opposite position that *anything* attached above or below should

be allowed in free variation order is just as liable to overgeneralization

errors.

 

>

> A nice solution. But I think a better solution is to follow the requirement

> for a visual killer to have a U+200C and then we can have the correct

> order:

>

> ang2 1021 1004 1039 200C 1037

> ang3 1021 1004 1039 200C 1038

 

I agree that this solves the problem in a way that best matches the

syllabic structure.

 

--Ken

 

 

From: Martin_Hosken@sil.org

Sent: Monday, August 06, 2001 10:48 PM

 

Subject:    Re: serious bug in Khmer, Myanmar combining classes

 

 

Dear Ken,

 

Thank you for looking into this.

 

>Peter said implementations will end up having to do an ad hoc

>kind of normalization, and that's a problem.

 

While correcting the combining classes is a need, there are other issues

that are not resolved by combining classes alone and so I am not sure that

Burmese can be resolved purely by the existing normalization algorithm.

 

>ka  1000

>kaa 1000 102C

>ki  1000 102D

>kii 1000 102E

>ku  1000 102F

>kuu 1000 1030

>ke  1000 1031

>kai 1000 1032

>ko  1000 1031 102C

>koo 1000 1031 102C 1039

>kã  1000 1036

>kui 1000 102F 102D

 

These all make sense, except for one issue:

 

p250 of TUS3.0 says "The virama ... also participates in some common

constructions where it appears as a visible sign, commonly termed killer.

In this usage where it appears as a visible diacritic, U+1039 is followed

by a U+200C..., as with Devanagari.

 

My reading of this implies to me that koo should be stored as 1000 1031

102C 1039 200C.

 

This approach also fixes the U+1039, U+1037 confusion, since U+200C is a

base character and so breaks up the ordering and you can have free order

again :)

 

>In the end, things were horsetraded down to what we've got, and

>like it or not, we are stuck with it now. (Note, for the record,

>that the Myanmar participants agreed to the idea of encoding

>-ui as a sequence of two characters, so this wasn't just something

>foisted on them by glyph-oriented Westerners ignorant of the

>vocalic pattern.)

 

Personally, I think the results are excellent, we just need to resolve

implementation issues.

 

>In my opinion, in *both* of these instances, the right way

>to proceed is to specify the correct order, and to characterize

>the other order as a *spelling* error -- not as a canonicalization

>error.

 

Would you make the same requirement in a Latin context, that the relative

ordering underdot and acute accent, say, should be a spelling issue rather

than a normalization issue? Me thinks there is somewhat of a double

standard going on here (please read that as a purely technical,

non-prejorative statement).

 

>The way to eliminate the visual ambiguity in the -ui case is to

>write a Myanmar renderer such that if it encounters the two

>pieces of the -ui vowel in the wrong order, it displays them

>visually wrong (intentionally), rather than quietly stacking

>them as if they were spelled in the correct order. That will

>give correct feedback for all of the potentially ambiguous

>cases.

 

That is always a good way to deal with errors :)

 

>Furthermore, one would expect that Myanmar input methods would

>provide single key access to all of the two-part vowels, in any

>case, as for most Indic keyboarding systems. This will work to

>help keep the -ui's and -o's correct in the underlying store.

 

Not if you are following the Burmese typewriter keyboard, for example. It

is a little strong for the encoding specification to pass the buck of its

problems to the data entry system :)

 

>ang1     1021 1004 1039  ( a -nga -killer )

>ang2     1021 1004 1039 1037

>ang3     1021 1004 1039 1038

>

>This is the order that I think makes the most linguistic sense,

>

>The problem is that the combining class of the killer is 9, as

>for all other halants (viramas), whereas the combining class of

>the 1037 dot below is 7, and the combining class of the 1038

>

>The alternative would be to specify that the correct spelling

>of tone marks applied to consonant-final syllables is

>to place the tone marks *before* the syllable-final killer:

>

>ang2     1021 1004 1037 1039

>       0         0        7       9

>

>ang3     1021 1004 1038 1039

>       0         0        0       9

>

>In this way, despite the mismatch in combining classes for 1037 and

>1038, both of these expressions would be in canonical order, which

>would bode better for systematic processing, despite the somewhat

>counterintuitive notion of putting the tone mark in between the

>final consonant and its killer. (In particular, for ang3, the 1039

>killer would have to rearrange around the 1038 visarga, so that

>it correctly appeared on top of the 1004 nga.)

 

A nice solution. But I think a better solution is to follow the requirement

for a visual killer to have a U+200C and then we can have the correct

order:

 

ang2 1021 1004 1039 200C 1037

ang3 1021 1004 1039 200C 1038

 

*******************************************************************************

 

* >What this is all pointing to, in my opinion, is that we are desperately

*

* >in need of implementation guidelines for Myanmar (and for Khmer) in *

* >the same kind of detail as for Devangari, so that these ordering *

* >issues and ambiguities can be nailed down in sufficient detail to *

* >enable a text model of properly spelled Myanmar (and Khmer). Otherwise,

*

* >we will not be able to interchange text successfully. Or at least, *

* >while the text itself could be interchanged, it would be spelled *

* >by drastically different conventions -- and since for Indic scripts, *

* >the "spellings" involve complicated interactions with the rendering *

* >rules, a spelling that works for Renderer A might result in *

* >illegible gibberish in Renderer B, which was assuming different *

* >spelling conventions. That would fail the Unicode plain text *

* >criteria for interoperability.

********************************************************************************

 

Yes, yes, yes, yes, yes, yes, yes!

 

>All my ruminations on this topic are gladly contributed to the

>cause, but I think it is imperative that someone who actually

>has implementation experience with Myanmar in a real system

>take the lead on this.

 

I would be happy to help contribute, given I am building implementation

experience, but am doing it rather slowly.

 

Martin Hosken