RE: UTF-8S ??? UTF-16F !!!

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Jun 13 2001 - 07:41:38 EDT


Markus Kuhn wrote:
> Oh my god! Please don't. THIS IS UGLY AND AWFUL!!!
> [...] In particular, they should not even think about
> proposing this evil idea for standardization. Yuck!!!

Welcome to the thread, and congratulation for this excellent summary of the
topic.

If Oracle doesn't provide anything written, someone could copy&paste this
piece of your post publish *it* as an unofficial specification for UTF-8s.

> I think, Oracle et al. should consider to use instead of UTF-16 what I
> propose to call UTF-16F (F for "fixed") in their B-trees, to maintain
> UCS binary sorting order:
>
> Conversion between UTF-16 and UTF-16F works as follows:
>
> unsigned short utf16_to_utf16f(unsigned short u)
> {
> assert(u <= 0xffff);
> /* shift surrogates into the top 0x800 code positions of
> 16-bit space */
> if (u >= 0xe000)
> return u - 0x800;
> if (u >= 0xd800)
> return u + 0x2000;
> return u;
> }

No! This is not fair!!! ;-)

I won't cross post you anymore even to warn about an incoming danger! 8-)

If I didn't indulge in corporal needs (I went to lunch), I would have
proposed what I called "UTF-16S" (where the S is either for "sortable" or
"shifted") before you:

        int putchar_utf16s(char32 c)
        {
           /* Area 1: U+0..U+D7FF -> 0000..D7FF (BMP pre-surro: copied) */
           if (c <= 0xD7FF)
              return putchar_ucs2(c);
           /* Area 2: U+D800..U+DFFF -> F800..FFFF (BMP surro: shifted down)
*/
           if (c >= 0xD800 && c <= 0xDFFF)
              return putchar_ucs2(c + 0x2000);
           /* Area 3: U+E000..U+FFFF -> D800..F7FF (BMP post-surro: shifted
up) */
           if (c >= 0xE000 && c <= 0xFFFF)
              return putchar_ucs2(c - 0x0800);
           /* Area 4: U+10000..U+10FFFF -> F800,FC00..FBFF,FFFF (UTF-16s
surro pairs) */
           if (c <= 0x10FFFF)
              return putchar_ucs2(0xF800 | ((c - 0x10000) >> 10))
                  || putchar_ucs2(0xFC00 | (c & 0x3FF));
           /* Out of range */
           return -1;
        }

BTW, Carl W. Brown has already proposed something similar, but he warned
that the problem in his (your? my? our?) proposal is the BOM.

0xFEFF (the BOM) and 0xFFFE (its swapped counterpart) would become something
else. In UTF-16F/S, it would become 0xF6FF and 0xFFF6 respectively. In
UCS2/UTF-16, the first one is a PUA character and the second one is
unassigned.

So, in order for UTF16-F/S to be distinguished from UCS2/UTF-16, it would be
necessary that

1) The signature (0xF6 0xFF or 0xFF 0xF6) is declared mandatory in
UTF16-F/S;
2) 0xFFF6 is declared a non-character, like 0xFFFE;
3) Code point 0xF6FF is removed from PUA and redefined as an alternative
BOM.

Unluckily, condition 3 is against Unicode policies, so this could not be
accepted even if the UTC wanted...

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT