Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Jun 05 2001 - 17:33:00 EDT


Personally, I find it interesting to see which and how many characters are affected by the difference in binary ordering between UTF-8 and UTF-16.
Affected are all code points in two ranges:
    U+e000..U+ffff
    U+10000..U+10ffff

The second range contains assignments for characters that are "rare" in the "average text".

The first range is interesting: It consists mostly of the PUA range of the BMP, some "specials", and of compatibility character assignments.
There are - aside from private use characters and the specials U+fff0..U+fffd - only 20 code points that "survive" an NFKD transformation:

    12 CJK Unified Ideographs (U+fa__)
    1 U+fb1e HEBREW POINT JUDEO-SPANISH VARIKA
    2 ornate parentheses (U+fd3e/f)
    2 combining ligatures halves (U+fe20/1)
    2 combining tilde halves (U+fe22/3)
    1 U+feff ZWNBSP

So, given normalized text (NFKD), there are only 20 assigned, non-compatibility, non-special characters that sort either before or after those "very rare" supplementary characters when one binary sorts UTF-8/16 strings.

I leave it up to the list to consider this... ;-)

markus



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT