Personally, I find it interesting to see which and how many characters are affected by the difference in binary ordering between UTF-8 and UTF-16.
Affected are all code points in two ranges:
U+e000..U+ffff
U+10000..U+10ffff
The second range contains assignments for characters that are "rare" in the "average text".
The first range is interesting: It consists mostly of the PUA range of the BMP, some "specials", and of compatibility character assignments.
There are - aside from private use characters and the specials U+fff0..U+fffd - only 20 code points that "survive" an NFKD transformation:
12 CJK Unified Ideographs (U+fa__)
1 U+fb1e HEBREW POINT JUDEO-SPANISH VARIKA
2 ornate parentheses (U+fd3e/f)
2 combining ligatures halves (U+fe20/1)
2 combining tilde halves (U+fe22/3)
1 U+feff ZWNBSP
So, given normalized text (NFKD), there are only 20 assigned, non-compatibility, non-special characters that sort either before or after those "very rare" supplementary characters when one binary sorts UTF-8/16 strings.
I leave it up to the list to consider this... ;-)
markus
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT