Re: Non-ascii string processing?

From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Oct 05 2003 - 16:19:46 CST


Hi Doug,

heres some things I think.

> If you really aren't processing anything but the ASCII characters
> within
> your strings, like "<" and ">" in your example, you can probably get
> away with keeping your existing byte-oriented code. At least you won't
> get false matches on the ASCII characters (this was a primary design
> goal of UTF-8).

Yes, and in fact, UTF8 doesn't generate any false matches when
searching for a valid UTF8 string, within another valid UTF8 string.

In fact, if there is UTF8 between the < and >, the processing works
just fine.

> However, if your goal is to simplify processing of arbitrary UTF-8
> text,
> including non-ASCII characters, I haven't found a better way than to
> read in the UTF-8, convert it on the fly to UTF-32, and THEN do your
> processing on the fixed-width UTF-32. That way you don't have to do
> one
> thing for Basic Latin characters and something else for the rest.

Well, I can do most processing just fine, as I said. I only have a
problem with lexical string processing (A = å), or spell checking. And
in fact, lexical string processing is already so complex, it probably
won't make much difference with UTF32 or UTF8, because of conjoining
characters and that.

> You will probably hear from some very prominent Unicode people that
> converting to UTF-16 is better, because "most" characters are in the
> BMP, for which UTF-16 uses half as much memory. But this approach
> doesn't really solve the variable-width problem -- it merely moves it,
> from "ASCII vs. non-ASCII" to "BMP vs. non-BMP." Unless you are
> keeping
> large amounts of text in memory, or are working with a small device
> such
> as a handheld, the extra size of UTF-32 compared to UTF-16 is unlikely
> to be a big problem, and you have the advantage of dealing with a
> fixed-width representation for the entire Unicode code space.

Unfortunately, I'm more concerned about the speed of converting the
UTF8 to UTF32, and back. This is because usually, I can process my UTF8
with byte functions.

> All of this assumes that you don't have multi-character processing
> issues, like combining characters and normalization, or culturally
> appropriate sorting, in which case your character processing WILL be
> more complex than ASCII no matter which CES you use.

Yes. Actually, I haven't yet seen any reasons to not use
byte-oriented-only functions for UTF8, now. Thanks for trying though!

Maybe someone whose native language isn't English and who spends a lot
of time writing string processing code could help me with suggestions
for tasks that need character modes? (like lexical processing a=å, and
spell checking).



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST