Re: validity of lone surrogates (was Re: Unicode surrogates: just say no!)

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Tue Jul 03 2001 - 06:07:57 EDT


Tue, 3 Jul 2001 01:50:56 -0700, Michael (michka) Kaplan <michka@trigeminal.com> pisze:

>> It's a pity that UTF-16 doesn't encode characters up to U+FFFFF, such
>> that code points corresponding to lone surrogates can be encoded as
>> pairs of surrogates.
>
> Unfortunately, we would then be stuck with what happens when two such
> surrogate surrogates are next to each other....

There is no problem with that.

Encoding: A character U+0000..D7FF or U+E000..FFFF is encoded as a single
16-bit word. A character U+D800..DFFF or U+10000..FFFFF is encoded as two
16-bit words: 0xD800 + (ch >> 10) and 0xDC00 + (ch & 0x3FF).

Decoding: A word 0x0000..D7FF or 0xE000..FFFF stands for itself.
Otherwise a word 0xD800..DBFF must be followed by a word 0xDC00..DFFF,
and the code obtained from them must be in the range U+D800..DFFF or
U+10000..FFFFF. The word stream is invalid in other cases (unpaired
surrogates or surrogates which encode a character which could be
encoded using a single word).

This gives unambiguous mapping of all code points U+0000..U+FFFFF to
single or double 16-bit words. The code space has exactly 20 bits.
Code points corresponding to surrogates could be even allocated for
real characters.

Unicode issues would be simpler if UTF-16 as defined today would not
exist. UTF-16 spreads its ugliness to other encoding forms and many
people think that Unicode implies 16 bits per character. There is a
tendency to use UTF-16 internally and ignore characters above U+FFFF,
treating surrogates as real characters which must come in pairs in
order to encode glyphs.

I suppose that we are stuck with UTF-16 forever, so please at least
don't spread surrogates to UTF-8 and UTF-32 which don't need to treat
the range U+D800..DFFF in any special way. It was hard enough for me
to accept that the code point space ends at a funny address U+10FFFF.
UTF-8 was so nice at 31 bits.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 13:48:07 EDT