Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 10 2003 - 01:13:03 EST

  • Next message: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"

    Peter Kirk <peterkirk at qaya dot org> wrote:

    >> The "wcslen" has nothing whatsoever to do with the Unicode standard,
    >> but it has all to do with the *C* standard. And, according to the C
    >> standard, "wcslen" must simply count the number "wchar_t" array
    >> elements from the location pointed to by its argument up to the first
    >> "wchar_t" element whose value is L'\0'. Full stop.
    >
    > OK, as a C function handling wchar_t arrays it is not expected to
    > conform to Unicode. But if it is presented as a function available to
    > users for handling Unicode text, for determining how many characters
    > (as defined by Unicode) are in a string, it should conform to Unicode,
    > including C9.

    wcslen() is very definitely presented as a function for counting
    _code_units_. You can't even rely on it to count Unicode characters
    accurately, if a wchar_t is 16 bits long, because supplementary
    characters will require 2 code points (high + low surrogate).

    Programmers rely on primitive functions like wcslen() to do what they do
    very rapidly, and not to change their meaning in new versions of the
    language standard. It would be very handy to have a suite of C
    functions that normalize their input string to any of NFK*[CD], or to
    compare strings or measure their length taking normalization into
    account, but those would have to be all-new functions.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Wed Dec 10 2003 - 01:50:52 EST