L2/01-308
August 8, 2001
This document is a
compendium of 3 e-mails, they result from Ken Whistler’s contribution in L2/01-307:
Martin
Hosken, 2001-08-08
Ken Whistler, 2001-08-07
Martin
Hosken, 2001-08-06
MH> My reading of this
implies to me that koo should be stored as 1000 1031
> 102C 1039 200C.
KW>Yes, in my haste, I
overlooked the use of 200C. You are right, of
course.
MH> This approach also
fixes the U+1039, U+1037 confusion, since U+200C is
a
> base character and so
breaks up the ordering and you can have free order
> again :)
KW>Yes. I think that is
right.
In that case, I am probably
in a position to have a go at writing a first
attempt at specifying this
stuff. Given the complaints over the form of my
previous attempts at
describing this stuff, I would appreciate suggestions
as to how to write this up
so that it can be assessed/used in the best form
for others.
KW>No I wouldn't say
that, but mostly because the combining class
definitions
we ended up with when
normalization locked everything down predict
that. Actually, I think this
is fundamentally a Latin/Greek/Cyrillic
letter plus accent
distinction from a Brahmic consonant plus matra.
The systems work
differently, and I wouldn't be too
worried about claiming that
the order of matras is "spelling" whereas
the order of accents is
"free variation" to be normalized away.
Hmm. I can see what you are
saying. I think the plea is that we can use the
normalization algorithm to
resolve some of the more messy aspects of Indic
scripts. The problem here is
that the normalization algorithm isn't up to
it (since you need to
associate a combining order with a sequence as well
as a single code). And the
starting point is a good description of the
canonical order of combining
characters in relation to each other and the
base character.
>The opposite position
that *anything* attached above or below should
be allowed in free variation
order is just as liable to overgeneralization
errors.
Is this before or after
normalization?
I think my issue here is
that keeping things in the right order at data
entry time is hard work.
This is not something that needs to be done for
Latin since normalization
sorts out the intermediate mess. The desire is
for the same facility for
Brahmic scripts. And this raises the question as
to whether the normalization
algorithm should be upgraded to make it
possible. And that would
require some thought. Although is doing the
thinking worth it if such an
idea is ruled out on some other principle?
Martin Hosken
From: Kenneth Whistler
[kenw@sybase.com]
Sent: Tuesday, August 07, 2001 9:17 PM
Subject: Re: serious bug in Khmer, Myanmar combining
classes
>
>koo 1000 1031 102C 1039
>
>
These all make sense, except for one issue:
>
>
p250 of TUS3.0 says "The virama ... also participates in some common
>
constructions where it appears as a visible sign, commonly termed killer.
> In
this usage where it appears as a visible diacritic, U+1039 is followed
> by
a U+200C..., as with Devanagari.
>
> My
reading of this implies to me that koo should be stored as 1000 1031
>
102C 1039 200C.
Yes, in
my haste, I overlooked the use of 200C. You are right, of course.
>
>
This approach also fixes the U+1039, U+1037 confusion, since U+200C is a
>
base character and so breaks up the ordering and you can have free order
>
again :)
Yes. I
think that is right.
>
>In my opinion, in *both* of these instances, the right way
>
>to proceed is to specify the correct order, and to characterize
>
>the other order as a *spelling* error -- not as a canonicalization
>
>error.
>
>
Would you make the same requirement in a Latin context, that the relative
>
ordering underdot and acute accent, say, should be a spelling issue rather
>
than a normalization issue? Me thinks there is somewhat of a double
>
standard going on here (please read that as a purely technical,
>
non-prejorative statement).
No I
wouldn't say that, but mostly because the combining class definitions
we
ended up with when normalization locked everything down predict
that.
Actually, I think this is fundamentally a Latin/Greek/Cyrillic
letter
plus accent distinction from a Brahmic consonant plus matra.
The
systems work differently, and I wouldn't be too
worried
about claiming that the order of matras is "spelling" whereas
the
order of accents is "free variation" to be normalized away.
The
opposite position that *anything* attached above or below should
be
allowed in free variation order is just as liable to overgeneralization
errors.
>
> A
nice solution. But I think a better solution is to follow the requirement
>
for a visual killer to have a U+200C and then we can have the correct
>
order:
>
>
ang2 1021 1004 1039 200C 1037
>
ang3 1021 1004 1039 200C 1038
I agree
that this solves the problem in a way that best matches the
syllabic
structure.
--Ken
Sent: Monday, August 06, 2001 10:48 PM
Subject: Re: serious bug in Khmer, Myanmar combining
classes
Dear
Ken,
Thank
you for looking into this.
>Peter
said implementations will end up having to do an ad hoc
>kind
of normalization, and that's a problem.
While
correcting the combining classes is a need, there are other issues
that
are not resolved by combining classes alone and so I am not sure that
Burmese
can be resolved purely by the existing normalization algorithm.
>ka 1000
>kaa
1000 102C
>ki 1000 102D
>kii
1000 102E
>ku 1000 102F
>kuu
1000 1030
>ke 1000 1031
>kai
1000 1032
>ko 1000 1031 102C
>koo
1000 1031 102C 1039
>kã 1000 1036
>kui
1000 102F 102D
These
all make sense, except for one issue:
p250 of
TUS3.0 says "The virama ... also participates in some common
constructions
where it appears as a visible sign, commonly termed killer.
In this
usage where it appears as a visible diacritic, U+1039 is followed
by a
U+200C..., as with Devanagari.
My
reading of this implies to me that koo should be stored as 1000 1031
102C
1039 200C.
This
approach also fixes the U+1039, U+1037 confusion, since U+200C is a
base
character and so breaks up the ordering and you can have free order
again
:)
>In
the end, things were horsetraded down to what we've got, and
>like
it or not, we are stuck with it now. (Note, for the record,
>that
the Myanmar participants agreed to the idea of encoding
>-ui
as a sequence of two characters, so this wasn't just something
>foisted
on them by glyph-oriented Westerners ignorant of the
>vocalic
pattern.)
Personally,
I think the results are excellent, we just need to resolve
implementation
issues.
>In my
opinion, in *both* of these instances, the right way
>to
proceed is to specify the correct order, and to characterize
>the
other order as a *spelling* error -- not as a canonicalization
>error.
Would
you make the same requirement in a Latin context, that the relative
ordering
underdot and acute accent, say, should be a spelling issue rather
than a
normalization issue? Me thinks there is somewhat of a double
standard
going on here (please read that as a purely technical,
non-prejorative
statement).
>The
way to eliminate the visual ambiguity in the -ui case is to
>write
a Myanmar renderer such that if it encounters the two
>pieces
of the -ui vowel in the wrong order, it displays them
>visually
wrong (intentionally), rather than quietly stacking
>them
as if they were spelled in the correct order. That will
>give
correct feedback for all of the potentially ambiguous
>cases.
That is
always a good way to deal with errors :)
>Furthermore,
one would expect that Myanmar input methods would
>provide
single key access to all of the two-part vowels, in any
>case,
as for most Indic keyboarding systems. This will work to
>help
keep the -ui's and -o's correct in the underlying store.
Not if
you are following the Burmese typewriter keyboard, for example. It
is a
little strong for the encoding specification to pass the buck of its
problems
to the data entry system :)
>ang1 1021 1004 1039 ( a -nga -killer )
>ang2 1021 1004 1039 1037
>ang3 1021 1004 1039 1038
>
>This
is the order that I think makes the most linguistic sense,
>
>The
problem is that the combining class of the killer is 9, as
>for
all other halants (viramas), whereas the combining class of
>the
1037 dot below is 7, and the combining class of the 1038
>
>The
alternative would be to specify that the correct spelling
>of
tone marks applied to consonant-final syllables is
>to
place the tone marks *before* the syllable-final killer:
>
>ang2 1021 1004 1037 1039
> 0 0 7 9
>
>ang3 1021 1004 1038 1039
> 0 0 0 9
>
>In
this way, despite the mismatch in combining classes for 1037 and
>1038,
both of these expressions would be in canonical order, which
>would
bode better for systematic processing, despite the somewhat
>counterintuitive
notion of putting the tone mark in between the
>final
consonant and its killer. (In particular, for ang3, the 1039
>killer
would have to rearrange around the 1038 visarga, so that
>it
correctly appeared on top of the 1004 nga.)
A nice
solution. But I think a better solution is to follow the requirement
for a
visual killer to have a U+200C and then we can have the correct
order:
ang2
1021 1004 1039 200C 1037
ang3
1021 1004 1039 200C 1038
*******************************************************************************
*
>What this is all pointing to, in my opinion, is that we are desperately
*
*
>in need of implementation guidelines for Myanmar (and for Khmer) in *
*
>the same kind of detail as for Devangari, so that these ordering *
*
>issues and ambiguities can be nailed down in sufficient detail to *
*
>enable a text model of properly spelled Myanmar (and Khmer). Otherwise,
*
*
>we will not be able to interchange text successfully. Or at least, *
*
>while the text itself could be interchanged, it would be spelled *
*
>by drastically different conventions -- and since for Indic scripts, *
*
>the "spellings" involve complicated interactions with the
rendering *
*
>rules, a spelling that works for Renderer A might result in *
*
>illegible gibberish in Renderer B, which was assuming different *
*
>spelling conventions. That would fail the Unicode plain text *
*
>criteria for interoperability.
********************************************************************************
Yes,
yes, yes, yes, yes, yes, yes!
>All
my ruminations on this topic are gladly contributed to the
>cause,
but I think it is imperative that someone who actually
>has
implementation experience with Myanmar in a real system
>take
the lead on this.
I would
be happy to help contribute, given I am building implementation
experience,
but am doing it rather slowly.
Martin
Hosken