From: Doug Ewell (doug@ewellic.org)
Date: Wed Dec 30 2009 - 21:15:00 CST
Andrew Lipscomb <ewwa at chattanooga dot net> wrote:
> Except that UTF-32 *isn't* on the banned list that started this
> thread--discouraged, though, as I understand it. The fourth one was
> CESU-8 (which, granted, has only one character that can be encoded two
> ways, the NULL).
CESU-8 doesn't have any characters that can be encoded two ways. You
may be thinking of a different encoding.
CESU-8 is simply UTF-8 applied to UTF-16 code units instead of Unicode
scalar values. A supplementary character like U+10000 is encoded as <ED
A0 80 ED B0 80> instead of <F0 90 80 80>. (Note that UTR #26
incorrectly quotes this as <ED AE 80 ED B0 80>, which is the CESU-8
encoding for U+F0000, an earlier example.) All BMP characters,
including U+0000 NULL, are encoded the same in both CESU-8 and UTF-8,
which of course is the biggest problem with CESU-8.
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ http://is.gd/2kf0s
This archive was generated by hypermail 2.1.5 : Wed Dec 30 2009 - 21:18:48 CST