Re: Convert UTF code update

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 13 2003 - 17:42:59 EDT

  • Next message: Jony Rosenne: "RE: Questions on ZWNBS - for line initial holam plus alef"

    ----- Original Message -----
    From: "Rick McGowan" <rick@unicode.org>
    To: <unicode@unicode.org>
    Sent: Wednesday, August 13, 2003 10:15 PM
    Subject: Convert UTF code update

    > Following on a recent bug report, and to fix problems with the last
    public
    > release, I have recently updated the "Convert UTF" sample code on the
    > Unicode web site. You can find the latest "alpha" code here:
    >
    > http://www.unicode.org/Public/ALPHA/CVTUTF-1-1/
    >
    > There are some changes in "ConvertUTF.c" to better catch illegal
    > sequences, and a one-line change in "harness.c" to fix a buffer
    problem
    > what was independently reported by a few people.
    >
    > If you're a developer and you have a chance to look at this code and
    try
    > the harness, I would appreciate any error reports.

    I just noted the following fragment in ConvertUTF16toUTF8():

    /* Figure out how many bytes the result will require */
            if (ch < (UTF32)0x80) { bytesToWrite = 1;
            } else if (ch < (UTF32)0x800) { bytesToWrite = 2;
            } else if (ch < (UTF32)0x10000) { bytesToWrite = 3;
            } else if (ch < (UTF32)0x200000) { bytesToWrite = 4;
            } else { bytesToWrite = 2;
                                                ch = UNI_REPLACEMENT_CHAR;
            }

    shouldn't tyhe line:
        } else if (ch < (UTF32)0x200000) { bytesToWrite = 4;
    say instead:
        } else if (ch < (UTF32)0x110000) { bytesToWrite = 4;
    so that it will produce legal UTF-8 (according to the isLegalUTF8
    function),
    by not encoding beyond the first 17 planes of UCS-4 (i.e. the currently
    only legal UTF-32 codespace)?
    For now the C fragment allows encoding to the legacy UTF-8 scheme
    (old RFC version) the first 32 planes of UCS-4, which goes beyond
    what UTF-16 can currently represent...
    As long that there will be no way in UTF-16 to go beyond the 17 first
    planes of UCS-4, the extra planes should not be encodable there using
    the old UTF-8 rules.



    This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 18:27:17 EDT