So far, the Unicode Standard has defined code points to be from the contiguous range of 0..0x10ffff.
Some definitions are fuzzy in the standard, with hopes of clarification in Unicode 4.0.
It is true that UTF-16 cannot encode <d800 dc00>, but it can encode <d800 0061 dc00>.
There are at least three reasons why not to forbid the representation
of surrogate code points in UTF-16 (and also UTF-32)
or the code-pointed-ness of surrogates:
1. Compatibility.
UTF-16 was explicitly created to be backwards compatible with UCS-2.
Valid UCS-2 text must be valid UTF-16 text.
In UCS-2, code points d800..dfff were legal, so they must be in UTF-16.
2. Performance.
When you iterate through a UTF-16/32 string, you don't want to forbid
surrogate code points because it adds complexity to your logic.
In fact, iterating through UTF-16 text currently does not produce any
decoding errors.
When you go through <d800 0061 dc00 d800 dc01> you get code points
d800, 0061, dc00, 10001.
Similarly, you don't want to forbid appending d800 to a string
because the application might deliberately append code units
(and dc00 would follow), or the application might just be blind
towards surrogates and pass code units through one by one
(UCS-2 application) with reasonable hopes that a surrogate pair
would be rejoined by default.
3. Properties.
An API that takes a code point and returns a property for that code point
must be able to deal with surrogate code points because there are non-trivial
properties assigned to them, e.g., general category Cs.
Surrogate code points have been listed in the UCD for a long time,
which shows that they are different from illegal code point values
like 0x110000 or -1.
markus
This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 11:44:17 EDT