PRI #176: Properties of Two Khmer Characters

Public Review Issue #176: Properties of Two Khmer Characters

The UTC is considering potential changes to the General_Category property values and default collation weighting of two Khmer characters:

U+17B4 KHMER VOWEL INHERENT AQ
U+17B5 KHMER VOWEL INHERENT AA

These two characters are not part of official Khmer orthography. They were included in the Unicode Standard originally only as a mechanism for representation of inherent vowels for Sanskrit texts transcribed in Khmer script. They do not have any visible display: they are represented in the Unicode code charts with dashed box glyphs. Their current General_Category value is Cf (Format).

The first issue concerns the General_Category value. These two characters are not actually format controls, as they do not affect display of text in any way. The assignment of gc=Cf seems to have resulted from the use of a dashed box for the representative glyphs, without sufficient analysis of the characters' functions. Because they are used, rather, to transcribe inherent vowels, the UTC is considering whether to change the General_Category values to gc=Mn (which would make more sense for dependent vowels) or simply to gc=Lo (because of the fact that they represent "letters" of a sort, although they are not a part of Khmer orthographic rules).

The second issue concerns the default collation weighting of the characters. In the Default Unicode Collation Element Table (DUCET) for the Unicode Collation Algorithm (see UTS #10), these two characters are currently given primary weights between the primary weights of Khmer inherent vowels and Khmer dependent vowels -- i.e., in the same order as these vowels occur in the Unicode code charts. However, both CLDR and Mimer tailor U+17B4 and U+17B5 to be ignorable for collation. Although the inherent vowels do convey a sound difference for specialized implementations, they are not part of Khmer orthography. As a result, ignoring them by default in the DUCET table may be the best choice as well.

These considerations have arisen from an investigation of certain inconsistencies in the DUCET weightings. It turns out that U+17B4 and U+17B5 are the only Unicode characters with gc=Cf which are not ignorable for collation by default. It would probably be good to eliminate that particular exception, but it is not entirely obvious what the best particular solution for these two characters would be. The anomaly could be addressed either by changing the General_Category values or by changing the default collation weights to be ignorable or both.

The UTC is seeking feedback on this topic. In particular, the UTC would be interested in learning of any current implementations which might be adversely affected by any of the proposed modifications to the General_Category and/or default collation weighting of these two characters.