On 1/9/2017 2:24 PM, Richard Wordingham
wrote:
Where, if anywhere, is the encoding of plain text specified? I am
particularly concerned with the arrangement of the code sequences for
non-spacing abstract characters once one has determined an encoding for
the abstract characters.
For example, a naive reading of TUS 9.0 Section 16.4 Subsection
"Ordering of Syllable Components" would lead one to believe that the
word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL
SIGN U, U+17C6 KHMER SIGN NIKAHIT>.
Richard,
the group of Khmer experts that developed the recent label
generation rules for root zone domain names considers that ordering
the only one supported, a specification you find here:
https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf
That document states:
7.4 Context of COENG Sign (U+17D2)
The sign ្ KHMER SIGN COENG (U+17D2) used for subscripting
consonants must occur between two consonants. If it occurs between
any other categories, it is not in a valid context so the label is
not well formed. Further, the consonant following it must not
include ឡ KHMER LETTER LA (U+17A1), ...
So, you are not alone in thinking that the COENG goes between
consonants.
Did they just make this up? No, they followed what is laid out in
the standard:
Page 621 in Unicode 9.0.0, you find
(
http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf)
Subscript Consonants. Subscript consonant signs differ
from independent consonant
characters and are called coeng (literally, “foot, leg”) after their
subscript position. While a
consonant character can constitute an orthographic syllable by
itself, a subscript consonant
sign cannot. Note that U+17A1 C khmer letter la does not have a
corresponding subscript
consonant sign in standard Khmer.... Subscript consonant signs are
used to represent any
consonant following the first consonant in an orthographic syllable.
and on page 624:
.... each of these [subscript consonant] signs is represented by the
sequence of two characters: a
special control character (U+17D2 khmer sign coeng) and a
corresponding consonant
character.
That text fixes the order MAIN CONSONANT + COENG OPERATOR +
SUBSCRIPT CONSONANT
with suffficient clarity (as do all the examples and tables).
However, on further investigation,
I cannot find any text that says that <U+1781, U+17C6, U+17D2, U+1789,
U+17BB> would not be compliant with the Unicode standard. Have I
missed anything?
In this example, your coeng operator U+17D2 is out of order, while
it is followed by
a consonant, it does not in turn immediately follow the main
consonant, because a
sign NIKAHIT is inserted in your example.
Again, from the Root Zone LGR document we find an explicit rule:
7.10 Context of NIKAHIT SIGN (U+17C6)
The sign ្ំ KHMER SIGN NIKAHIT (U+17C6) can only be preceded by a
consonant or a shifter or one of the subset of dependent vowels
tagged “dependent-vowel-1” in the repertoire table (្ ្ុ), i.e.
vowel signs AA and U.
That would allow the NIKAHIT to be placed where you suggest, if it
were not for the
rule on the coeng operator (7.4).
Now, it is a known fact that the label generation rules are slightly
more restrictive than the rules for general text. (See also section
5 in that document).
See the text on p. 622 in TUS 9.0.0 where the following
exception
is noted:
"The subscript consonant signs in the Khmer script can be used to
denote a final consonant,
although this practice is uncommon."
The associated example shows MAIN CONSONANT + VOWEL + NIKHAHIT +
COENG + FINAL CONSONANT
Another exception that is noted on p. 623 is the following:
"While these subscript consonant signs are usually attached to a
consonant character, they
can also be attached to an independent vowel character. Although
this practice is relatively
rare, it is used in one very common word, meaning “to give.”"
Taken together, it would appear that, unless your example fits the
first of these two exceptions,
the NIKAHIT in it is out of order.
(The label generation rules disallow both of these exceptions,
in an attempt to streamline the rules, sacrificing a number of
potential domain names. Equivelant
rule sets for validating text would have to be more complete).
One might hope that the subsection about 'logical order' in TUS 9.0
Section 2.2 Unicode Design Principles would help, but:
1) Section 3 'Conformance' says nothing about logical order; and
2) The subsection about 'logical order' seems to assume that there
exists a common practice; it does not actually place any requirement
on this common practice.
Richard.
I don't think either of these general sections are intended to
provide the correct
or expected ordering of characters for complex scripts. Any
preferred ordering that
doesn't result by happenstance from normalization would presumably
be describe
in the text of the scrip section, such as Section 16.4 Khmer, in
TUS 9.0.0.
http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf
A./
Received on Tue Jan 10 2017 - 02:06:49 CST