RE: Rationale for U+10FFFF?

From: Murray Sargent (murrays@microsoft.com)
Date: Mon Mar 06 2000 - 17:29:34 EST


Hey guys, range checks are essentially as efficient as AND operations (in
C/C++ :-), namely

        if(IN_RANGE(n1, ch, n2))
        {
        }

where the IN_RANGE() macro is defined as:

#define IN_RANGE(n1, ch, n2) ((unsigned)((ch) - (n1)) <=
unsigned((n2) - (n1)))

For constant n1 and n2, this requires only an extra subtraction relative to
an if statement with an AND and requires only a single goto. So
performance-wise it's immaterial whether you use an AND or a range check in
C/C++.

Now in Java, it's a bit more painful since Java doesn't have unsigned....

Murray

> -----Original Message-----
> From: Harald Tveit Alvestrand [SMTP:Harald@Alvestrand.no]
> Sent: Monday, March 06, 2000 2:01 PM
> To: Unicode List
> Subject: Re: Rationale for U+10FFFF?
>
> At 10:35 06.03.00 -0800, Markus Scherer wrote:
>
> >UTF-16, therefore, does not need any range checks - either you have a BMP
>
> >code point or put it together from a surrogate pair.
>
> except if the second character isn't a surrogate, in which case you have
> an
> erroneously encoded string. Range check required.
>
> > UTF-8, on the other hand, comes with the problem that for a single code
>
> > point you have (almost always) multiple encodings,
>
> how?
> with the rule in force that no unnecessary bytes be used for encoding, I
> can't see a way to make multiple UTF-8 encodings of the same string.
>
> There exist non-valid octet strings that an UTF-8 decoder that did no
> range
> checks might turn into a number without an error message, but that's
> hardly
> a strange thing.
>
> > which makes string searching, binary comparison, and security checks
> > (see sections in modern RFCs about embedded NUL and control characters)
> > difficult when such "irregular" sequences are used.
>
> The embedded NUL and controls are all outlawed in properly formed UTF-8.
>
> >I am actuall trying to put together (for ICU) macros that do the same
> >operations - get code point and increment, decrement and get code point,
> >etc. - with all three UTFs, and doing it "safely" with all the error
> >checking is quite a pain with UTF-8. I had to move most of it into
> >functions because the macros became too large. Doing another check for
> the
> >code point <=0x10ffff does not cause any significant performance
> >degradation here.
> >
> >It is also widely believed that a million code points are plenty. This
> >makes UTF-8 unnecessarily unwieldy. With hindsight (tends to provide a
> >clear view!), it would have been better to design UTF-8 such that
> >
> >- a code point can be encoded only in one way
>
> done, unless I missed something
>
> >- at most 4 bytes are used
>
> done with 17-plane limit
>
> >- only the actual range up to 0x10ffff is covered
>
> of questionable value - see previous discussion
>
> >- the decoding is easier by having a fixed format for lead bytes instead
> >of the current variable-length format that requires a lookup table or
> >"find first 0 bit" machine operations
>
> a fixed format that takes less than 1 byte?
>
> >- the C1 control set is not used for multi-byte sequences (UTF-8 was
> >designed to be "File-System-Safe", not "VT-100-safe"...)
>
> that argument I agree with....
>
> >All this is possible and easy with 64 trail bytes and 21 lead bytes.
> >However, we have to live with a suboptimal UTF-8 where we need byte-based
>
> >encodings - no one wants a new UTF for general purpose.
>
> agreed.
>
> Harald
>
> --
> Harald Tveit Alvestrand, EDB Maxware, Norway
> Harald.Alvestrand@edb.maxware.no



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT