Re: Counting characters or bytes in UTF-8?

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Tue Sep 12 2000 - 04:33:35 EDT


Yves Arrouye wrote:
>
> > 2. The original intent of strncpy() was to provide a means of copying both
> > bytes and characters. Since the assumption was 1 byte == 1 char, there was
> > no problem with this. In addition to the problem in #1, though, UTF-8
> > introduces these issues:
>
> I've always looked at the strxxx() functions as manipulating characters
> (strings of), and the memxxx() ones (memcpy, memcmp, actually bxxx() in my
> time) as manipulating bytes.

Unfortunately, the C Standard legislated it the other way round:
the different count values in both the memxxx() *and* the strnxxx()
functions are clearly specified as byte count, and not (multibyte)
characters.

As far as I know, all implementations with more-than-1-byte characters,
that is practically East Asian ones and the European ones for the
Videotext codesets and related T.51/T.61, take the short and easy way
and use byte counts (some invented special supplementary functions to
deal with multibyte character counts, for example dealing with "widths").

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT