Re: "Data-visualization" for Unicode Collation: Khmer?

From: Mark Davis (
Date: Wed Mar 22 2000 - 10:05:56 EST

Thanks. I was surprised at how easy it was to generate some reasonable HTML for illustrating the ordering (of course, I already had the sample collation code from TR10). And you see lots of patterns in the charts that are very difficult to see in name lists.

The collation data only covers the Unicode 2.1 repertoire, currently. We need to get information on the new scripts, and feedback like yours certainly helps. If anyone else has information on the collation order of the new scripts, please send it in.


Maurice Bauhahn wrote:

> Hello Mark,
> Thank you for the collation tables...a tidy bit of programming! I searched
> through every page but could not find 178X series of Khmer characters. Has
> Khmer not made it into the collation sequence? Although combinations of Khmer
> characters have a very complex ordering, the consonants dominate over the
> vowels. The consonants 1780-17A2 are encoded in alphabetic order. The
> independent vowels 17A3-17B3 (not exactly in alphabetic order; have a similar
> priority (slightly lower) and are never in cluster [well...almost never:
> there may be a situation where there is an explicit vowel with a subscript
> independent vowel but it must be exceedingly rare] with the dependent vowels
> at 17B4-17C5 (which are in alphabetic). The signs are next in priority but
> have only rather weak collation sequencing 17C6-17D1 (the first six have
> fixed sequencing). The sign indicating that the next character is a
> subscript (17D2) does not itself have a collation sequence as I understand
> that, but it does infer that the character following it (usually a consonant
> but rarely an independent vowel) will bring the cluster to a lower collation
> priority. All other characters break the collation sequence.
> Hence an alphabetic sort would have:
> Consonant/Independent Vowel - Primary sort*
> 17D2 and Consonant/Independent Vowel - Secondary sort (first subscript)
> 17D2 and Consonant/Independent Vowel - Tertiary sort (second subscript)
> Vowel - Quadriary sort (starting with inherent vowels 17B4 and 17B5 which are
> normally not encoded [17B4 assumed if no explicit vowel])
> Sign - Quintary sort
> (please pardon the spelling of the fourth and fifth level sort...not sure
> what is right)
> *There is some complication in this...Independent Vowels equate in collation
> to different Consonant/Vowel (Primary and Quadriary) combinations:
> 17A3 = 17A2
> 17A4=17A2+17CB (a sign!)
> 17A5=17A2+17B7
> 17A6=17A2+17B9
> 17A7=17A2+17BB
> 17A8=17A2+17BB+
> 17A9=17A2+17BC
> 17AA=17A2+17BC+
> 17AB=179A+17B9
> 17AC=179A+17BA
> 17AD=179B+17B9
> 17AE=179B+17B9
> 17AF=17A2+17C2
> 17B0=17A2+17C3
> 17B1=17A2+17C4
> 17B2=17A2+17C4
> 17B3=17A2+17C5
> Note that 17A7 and 17A8 have nearly the same value...however the latter has a
> final consonental sound not treated equivalent to a second consonant but
> weighted slightly different from the first
> Note that 17A9 and 17AA have nearly the same value...however the latter has a
> final consonental sound not treated equivalent to a second consonant but
> weighted slightly different from the first
> What in addition do I need to do to facilitate Khmer being considered in
> Collation?
> Pensively;-)
> Maurice
> wrote:
> > While on the plane last week, I wrote a program that generates charts of
> > the default Unicode collation ordering. These charts displays the actual
> > characters in order, rather than merely listing them by name. If you are
> > interested, you can find the charts at
> > Feedback is welcome.
> >
> > Mark
> > ___
> > Mark Davis, IBM Center for Java Technology, Cupertino
> > (408) 777-5850 [fax: 5891],,
> >
> --
> Maurice Bauhahn
> 2 Meadow Way
> Dorney Reach
> SL6 0DS
> United Kingdom
> Home Tel: +44(0)1628 626068
> Work Tel: +44(0)118 9016020
> Home Email:
> Work Email:

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT