Counting characters or bytes in UTF-8?

From: Doug Ewell (dewell@compuserve.com)
Date: Mon Sep 11 2000 - 11:31:49 EDT


Lars Marius Garshol <larsga@garshol.priv.no> wrote:

> We have a uni_strncpy function name that is mapped to some function
> that performs the same task as the standard strncpy function and the
> name is mapped differently depending on platform and internal text
> encoding.
>
> The question is what the 'n' argument counts. In 16-bit mode it is
> obviously characters and in non-Unicode mode there is no distinction
> between bytes and characters. However, what do we count with UTF-8?
> My intuition tells me that it will be bytes, since the function will
> not be aware that it is processing UTF-8 at all.

The problem is that uni_strncpy() has no way to know what you want to
do with the (possibly truncated) target string. If you are displaying
it and want no more than 'n' characters, then you want a character
count. If you are copying to a fixed-byte-length buffer and want to be
sure not to overrun it, then you want a byte count.

You may want to make uni_strncpy() specific to UTF-8, and use memcpy()
or the standard strncpy() when you are interested in byte count.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT