From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 04 2003 - 12:29:20 EST
From: "David E. Hollingsworth" <deh@fastanimals.com>
> I believe this is described pretty well in sections 3.8 & 3.9 (plus
> conformance requirement C12b) of Unicode 4.0.
>
> Surrogate pairs are for UTF-16 only. For UTF-8 & UTF-32, surrogates
> (pairs or otherwise) are ill-formed code unit sequences, and
> conformant processes must treat them as erroneous.
Well, it is effectively forbidden to encode surrogates in UTF-8, but not in
CESU-8, where it is the only allowed method to encode characters out of the
BMP.
Still, in CESU-8, there are also ill-formed sequences:
- those that are using encoded sequences of more than 3 bytes for code
points out of the BMP
- those that are using unpaired surrogate code points.
This second form however is quite common in legacy applications that allow
unpaired surrogate code points, handled as if they were coding individual
characters. This is allowed for internal string handlings (typical in Java,
and most C/C++ applications that map the wchar_t to a 16-bit integer), but
texts should not be interchanged with CESU-8 that contain unpaired surrogate
code points.
This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 13:20:29 EST