From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 13 2003 - 17:42:59 EDT
----- Original Message -----
From: "Rick McGowan" <rick@unicode.org>
To: <unicode@unicode.org>
Sent: Wednesday, August 13, 2003 10:15 PM
Subject: Convert UTF code update
> Following on a recent bug report, and to fix problems with the last
public
> release, I have recently updated the "Convert UTF" sample code on the
> Unicode web site. You can find the latest "alpha" code here:
>
> http://www.unicode.org/Public/ALPHA/CVTUTF-1-1/
>
> There are some changes in "ConvertUTF.c" to better catch illegal
> sequences, and a one-line change in "harness.c" to fix a buffer
problem
> what was independently reported by a few people.
>
> If you're a developer and you have a chance to look at this code and
try
> the harness, I would appreciate any error reports.
I just noted the following fragment in ConvertUTF16toUTF8():
/* Figure out how many bytes the result will require */
if (ch < (UTF32)0x80) { bytesToWrite = 1;
} else if (ch < (UTF32)0x800) { bytesToWrite = 2;
} else if (ch < (UTF32)0x10000) { bytesToWrite = 3;
} else if (ch < (UTF32)0x200000) { bytesToWrite = 4;
} else { bytesToWrite = 2;
ch = UNI_REPLACEMENT_CHAR;
}
shouldn't tyhe line:
} else if (ch < (UTF32)0x200000) { bytesToWrite = 4;
say instead:
} else if (ch < (UTF32)0x110000) { bytesToWrite = 4;
so that it will produce legal UTF-8 (according to the isLegalUTF8
function),
by not encoding beyond the first 17 planes of UCS-4 (i.e. the currently
only legal UTF-32 codespace)?
For now the C fragment allows encoding to the legacy UTF-8 scheme
(old RFC version) the first 32 planes of UCS-4, which goes beyond
what UTF-16 can currently represent...
As long that there will be no way in UTF-16 to go beyond the 17 first
planes of UCS-4, the extra planes should not be encodable there using
the old UTF-8 rules.
This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 18:27:17 EDT