From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Thu May 15 2003 - 12:29:59 EDT
Ben Dougall wrote:
> <guess> that area is full of surrogates. so they need another code point
> to make up a single character. on their own 0xd800-0xdfff are 1/2
> characters :) </guess>
Oh, no! Again, you are confusing code-points and code-units
(in other words: Unicode and its UTFs).
Code points:
- In Unicode, a surrogate code-point is not assigned any character.
Hence, these code-points are illegal, hence none of these can be
contained in actual (legal) data.
Code units:
- In UTF-8, there is no such thing as a surrogate code-unit,
as the code units are only 8 bits wide.
- In UTF-16, a pair of surrogate code-units encodes a character
beyond the BMP (and a non-surrogate code-unit encodes a character
in the BMP).
- In UTF-32, a surrogate code-unit is illegal, as it would
encode an illegal surrogate code-point.
In a nutshell: Unicode is not UTF-16.
Best wishes,
Otto Stolz
This archive was generated by hypermail 2.1.5 : Thu May 15 2003 - 13:31:25 EDT