UTF-8s programming problems

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Tue Jun 12 2001 - 22:39:17 EDT


UTF-8s is reminiscent of a problem that I had installing a certain vendor's
terminals. Each screen was about 2K of data. The terminal communications
protocol broke the data into 128 byte chunks. Each block had a small header
and the terminal would wait for a response before the next block was sent.
My client was trying to connect using an X.25 link over a satellite link.
Each 128 byte required two X.25 packets and the response required another
packet. X.25 also has its own pacing and response transmissions. Satellite
links take about 1.4 seconds to get from one location to another. The
actual calculations are complex but as you can see the round trip for 1/16th
of the data is about 3 seconds. The result was unusable.

One obvious programming problem is UTF-8s to UTF-32 conversions. But other
are more subtle. The major problem is that like the example above, UTF-8s
is encoding part of the character (One surrogate) as a character. You end
up encoding a single character as two UTF-8 characters.

I am currently working on code that supports UTF-8 and I am implementing a
function library for it.

Take the example of xiu8_strtok. It in turn calls xiu8_strspn and
xiu8_strpbrk. Each of these routines scans for delimiters using a set of
deliminators in the from of a UTF-8 character string. If each surrogate is
encoded separately the scan will find a match for any character with either
of the same surrogates. Now what happens. Supposedly when scanning for the
start of the first token we skip deliminiters and we get a match on the
high-surrogate of the pair but not the low-surrogate. This means that we
start out first token string in the middle of our first character. If the
ending token match is a low surrogate we will replace that with a null and
terminate the token string with another half character.

If I am chunking UTF-8 data into a buffer to convert to UTF-16 and then
translate to a charset I can break the UTF-8s code in the middle of a
character and even if my UTF-8s to UTF-16 converter works, I will have
broken UTF-16 data.

Functions like xiu8_CharNext, xiu8_CharCnt, xiu8_CharLen etc. do not work.
I could go on but further examples are redundant.

You end up breaking so much code just for a sorting sequence when comparing
UTF-16 in Unicode code point sequence is so easy to write.

Carl



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT