Re: Counting characters or bytes in UTF-8?

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Tue Sep 12 2000 - 04:33:35 EDT

Next message: Michael Everson: "Re: TATAP => TATAR"
Previous message: John Hudson: "Re: Tamil glyphs"
Maybe in reply to: Lars Marius Garshol: "Counting characters or bytes in UTF-8?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Yves Arrouye wrote:
>
> > 2. The original intent of strncpy() was to provide a means of copying both
> > bytes and characters. Since the assumption was 1 byte == 1 char, there was
> > no problem with this. In addition to the problem in #1, though, UTF-8
> > introduces these issues:
>
> I've always looked at the strxxx() functions as manipulating characters
> (strings of), and the memxxx() ones (memcpy, memcmp, actually bxxx() in my
> time) as manipulating bytes.

Unfortunately, the C Standard legislated it the other way round:
the different count values in both the memxxx() *and* the strnxxx()
functions are clearly specified as byte count, and not (multibyte)
characters.

As far as I know, all implementations with more-than-1-byte characters,
that is practically East Asian ones and the European ones for the
Videotext codesets and related T.51/T.61, take the short and easy way
and use byte counts (some invented special supplementary functions to
deal with multibyte character counts, for example dealing with "widths").

Antoine

Next message: Michael Everson: "Re: TATAP => TATAR"
Previous message: John Hudson: "Re: Tamil glyphs"
Maybe in reply to: Lars Marius Garshol: "Counting characters or bytes in UTF-8?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT