Re: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 04 2003 - 12:29:20 EST

Next message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"

Previous message: Peter Kirk: "Re: UTF-16 inside UTF-8"
In reply to: David E. Hollingsworth: "Re: UTF-16 inside UTF-8"
Next in thread: Jill Ramonsky: "RE: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "David E. Hollingsworth" <deh@fastanimals.com>
> I believe this is described pretty well in sections 3.8 & 3.9 (plus
> conformance requirement C12b) of Unicode 4.0.
>
> Surrogate pairs are for UTF-16 only. For UTF-8 & UTF-32, surrogates
> (pairs or otherwise) are ill-formed code unit sequences, and
> conformant processes must treat them as erroneous.

Well, it is effectively forbidden to encode surrogates in UTF-8, but not in
CESU-8, where it is the only allowed method to encode characters out of the
BMP.

Still, in CESU-8, there are also ill-formed sequences:
- those that are using encoded sequences of more than 3 bytes for code
points out of the BMP
- those that are using unpaired surrogate code points.

This second form however is quite common in legacy applications that allow
unpaired surrogate code points, handled as if they were coding individual
characters. This is allowed for internal string handlings (typical in Java,
and most C/C++ applications that map the wchar_t to a 16-bit integer), but
texts should not be interchanged with CESU-8 that contain unpaired surrogate
code points.

Next message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Previous message: Peter Kirk: "Re: UTF-16 inside UTF-8"
In reply to: David E. Hollingsworth: "Re: UTF-16 inside UTF-8"
Next in thread: Jill Ramonsky: "RE: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 13:20:29 EST