At 10:35 06.03.00 -0800, Markus Scherer wrote:
>UTF-16, therefore, does not need any range checks - either you have a BMP
>code point or put it together from a surrogate pair.
except if the second character isn't a surrogate, in which case you have an
erroneously encoded string. Range check required.
> UTF-8, on the other hand, comes with the problem that for a single code
> point you have (almost always) multiple encodings,
how?
with the rule in force that no unnecessary bytes be used for encoding, I
can't see a way to make multiple UTF-8 encodings of the same string.
There exist non-valid octet strings that an UTF-8 decoder that did no range
checks might turn into a number without an error message, but that's hardly
a strange thing.
> which makes string searching, binary comparison, and security checks
> (see sections in modern RFCs about embedded NUL and control characters)
> difficult when such "irregular" sequences are used.
The embedded NUL and controls are all outlawed in properly formed UTF-8.
>I am actuall trying to put together (for ICU) macros that do the same
>operations - get code point and increment, decrement and get code point,
>etc. - with all three UTFs, and doing it "safely" with all the error
>checking is quite a pain with UTF-8. I had to move most of it into
>functions because the macros became too large. Doing another check for the
>code point <=0x10ffff does not cause any significant performance
>degradation here.
>
>It is also widely believed that a million code points are plenty. This
>makes UTF-8 unnecessarily unwieldy. With hindsight (tends to provide a
>clear view!), it would have been better to design UTF-8 such that
>
>- a code point can be encoded only in one way
done, unless I missed something
>- at most 4 bytes are used
done with 17-plane limit
>- only the actual range up to 0x10ffff is covered
of questionable value - see previous discussion
>- the decoding is easier by having a fixed format for lead bytes instead
>of the current variable-length format that requires a lookup table or
>"find first 0 bit" machine operations
a fixed format that takes less than 1 byte?
>- the C1 control set is not used for multi-byte sequences (UTF-8 was
>designed to be "File-System-Safe", not "VT-100-safe"...)
that argument I agree with....
>All this is possible and easy with 64 trail bytes and 21 lead bytes.
>However, we have to live with a suboptimal UTF-8 where we need byte-based
>encodings - no one wants a new UTF for general purpose.
agreed.
Harald
-- Harald Tveit Alvestrand, EDB Maxware, Norway Harald.Alvestrand@edb.maxware.no
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT