Re: Counting characters or bytes in UTF-8?

From: Yves Arrouye (yves@realnames.com)
Date: Tue Sep 12 2000 - 02:49:35 EDT


> 2. The original intent of strncpy() was to provide a means of copying both
> bytes and characters. Since the assumption was 1 byte == 1 char, there was
> no problem with this. In addition to the problem in #1, though, UTF-8
> introduces these issues:

I've always looked at the strxxx() functions as manipulating characters
(strings of), and the memxxx() ones (memcpy, memcmp, actually bxxx() in my
time) as manipulating bytes.

I would thus try to avoid having functions like uni_strncpy() handle
anything but characters. Unfortunately, character-oriented APIs in UTF-8 are
not paragons of performance, so it may be better to provide a byte-oriented
API and some way to get the byte offset of the nth character of string,
along with the opposite operation. I would then not call the function
uni_strncpy(), but maybe uni_bytescopy() or uni_memcpy(), to minimize
confusion.

YA



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT