Re: Counting characters or bytes in UTF-8?

From: addison@inter-locale.com
Date: Wed Sep 13 2000 - 05:42:11 EDT


Hi Yves,

You're correct that memxxx is really the intended byte moving function. My
point was really that the strxxx functions were designed (and are often
used) as if all characters are equal to a byte. I guess I wasn't very
clear in my last post. Let me try again:

1. It makes sense to use bytes for addressing in a UTF-8 implementation,
since the encoding is variable width encoding whose base is an
8-bit value. UTF-16 implementations typically use a 16-bit value for the
same reason.

2. Neither UTF-8 nor UTF-16 will have a "trivial" implementation, since
the functions must be aware of the underlying encoding scheme in order to
avoid destroying data.

  NB> This has higher performance in UTF-x than in
traditional multibyte enabling, since moving the pointer around in
"legacy" MBCS typically requires a read from the beginning of the string,
whereas UTF-x can use bitmasking and/or code point ranges.

  Also> You will need to end up with more functions than the C library
has, since you will need functions for pointer movement and buffer
allocation among other things (the standard C library doesn't provide for
such functionality because of an implicit byte == char assumption). A
common mistake is to try and force an implementation to match the
libraries 1:1 (or the implementers to write all the code to handle the
other requirements themselves).

3. Creating a working Unicode string library to replace the standard C
library functions (or the STL in C++) is not a quick exercise because
there are a lot of important issues related to adding support for over 1
million characters (not to mention the need for some of these functions to
be locale aware too).

Best Regards,

Addison

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Mon, 11 Sep 2000, Yves Arrouye wrote:

> > 2. The original intent of strncpy() was to provide a means of copying both
> > bytes and characters. Since the assumption was 1 byte == 1 char, there was
> > no problem with this. In addition to the problem in #1, though, UTF-8
> > introduces these issues:
>
> I've always looked at the strxxx() functions as manipulating characters
> (strings of), and the memxxx() ones (memcpy, memcmp, actually bxxx() in my
> time) as manipulating bytes.
>
> I would thus try to avoid having functions like uni_strncpy() handle
> anything but characters. Unfortunately, character-oriented APIs in UTF-8 are
> not paragons of performance, so it may be better to provide a byte-oriented
> API and some way to get the byte offset of the nth character of string,
> along with the opposite operation. I would then not call the function
> uni_strncpy(), but maybe uni_bytescopy() or uni_memcpy(), to minimize
> confusion.
>
> YA
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT