Re: Convert UTF code update

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 13 2003 - 17:42:59 EDT

Next message: Jony Rosenne: "RE: Questions on ZWNBS - for line initial holam plus alef"

Previous message: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
In reply to: Rick McGowan: "Convert UTF code update"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

----- Original Message -----
From: "Rick McGowan" <rick@unicode.org>
To: <unicode@unicode.org>
Sent: Wednesday, August 13, 2003 10:15 PM
Subject: Convert UTF code update

> Following on a recent bug report, and to fix problems with the last
public
> release, I have recently updated the "Convert UTF" sample code on the
> Unicode web site. You can find the latest "alpha" code here:
>
> http://www.unicode.org/Public/ALPHA/CVTUTF-1-1/
>
> There are some changes in "ConvertUTF.c" to better catch illegal
> sequences, and a one-line change in "harness.c" to fix a buffer
problem
> what was independently reported by a few people.
>
> If you're a developer and you have a chance to look at this code and
try
> the harness, I would appreciate any error reports.

I just noted the following fragment in ConvertUTF16toUTF8():

/* Figure out how many bytes the result will require */
        if (ch < (UTF32)0x80) { bytesToWrite = 1;
        } else if (ch < (UTF32)0x800) { bytesToWrite = 2;
        } else if (ch < (UTF32)0x10000) { bytesToWrite = 3;
        } else if (ch < (UTF32)0x200000) { bytesToWrite = 4;
        } else { bytesToWrite = 2;
                                            ch = UNI_REPLACEMENT_CHAR;
        }

shouldn't tyhe line:
} else if (ch < (UTF32)0x200000) { bytesToWrite = 4;
say instead:
} else if (ch < (UTF32)0x110000) { bytesToWrite = 4;
so that it will produce legal UTF-8 (according to the isLegalUTF8
function),
by not encoding beyond the first 17 planes of UCS-4 (i.e. the currently
only legal UTF-32 codespace)?
For now the C fragment allows encoding to the legacy UTF-8 scheme
(old RFC version) the first 32 planes of UCS-4, which goes beyond
what UTF-16 can currently represent...
As long that there will be no way in UTF-16 to go beyond the 17 first
planes of UCS-4, the extra planes should not be encodable there using
the old UTF-8 rules.

Next message: Jony Rosenne: "RE: Questions on ZWNBS - for line initial holam plus alef"
Previous message: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
In reply to: Rick McGowan: "Convert UTF code update"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 18:27:17 EDT