Re: Rationale for U+10FFFF?

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Mar 06 2000 - 13:42:08 EST


Hello,

I find it amusing that "my" question from a year ago (and proposal of "UTF-20") is back... :-)

The reason for all of this is simply that the preferred form of Unicode is UTF-16, which has precisely the code point range of 0..0x10ffff. This is because UTF-16 is carefully designed to not have more than one possible encoding for any single code point: the two-word range is offset by the size of the single-word range.

UTF-16, therefore, does not need any range checks - either you have a BMP code point or put it together from a surrogate pair. UTF-8, on the other hand, comes with the problem that for a single code point you have (almost always) multiple encodings, which makes string searching, binary comparison, and security checks (see sections in modern RFCs about embedded NUL and control characters) difficult when such "irregular" sequences are used.

I am actuall trying to put together (for ICU) macros that do the same operations - get code point and increment, decrement and get code point, etc. - with all three UTFs, and doing it "safely" with all the error checking is quite a pain with UTF-8. I had to move most of it into functions because the macros became too large. Doing another check for the code point <=0x10ffff does not cause any significant performance degradation here.

It is also widely believed that a million code points are plenty. This makes UTF-8 unnecessarily unwieldy. With hindsight (tends to provide a clear view!), it would have been better to design UTF-8 such that

- a code point can be encoded only in one way
- at most 4 bytes are used
- only the actual range up to 0x10ffff is covered
- the decoding is easier by having a fixed format for lead bytes instead of the current variable-length format that requires a lookup table or "find first 0 bit" machine operations
- the C1 control set is not used for multi-byte sequences (UTF-8 was designed to be "File-System-Safe", not "VT-100-safe"...)

All this is possible and easy with 64 trail bytes and 21 lead bytes. However, we have to live with a suboptimal UTF-8 where we need byte-based encodings - no one wants a new UTF for general purpose.

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT