A modest proposal for UTF-8s

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Wed Jun 13 2001 - 12:40:05 EDT


What do you really want? A UTF-8 like encoding with UTF-16 sorting.

The major problem with UTF-8s as it is currently proposed is that it
violates the basic tenet of all UTF encodings that you can determine
character length from the first part of the encoding. This is true of most
character sets. With the current UTF-8s proposal characters starting with
ED can be either 3 or 6 bytes long. Encodings like iso-2022 are almost
worthless for anything other than data transport. If you want to actually
manipulate the data, you have to transform it to a better encoding. Even
strncpy type functions should only copy complete characters. There is very
little you can do with data without accurate character boundary information.

If you have a UTF-8s you will also need a UTF-32s. Most of the newer
wchat_t implementations are using 4 byte wide characters. So it makes more
sense to design the UTF-32s first. If we make the assumption that UTF-32s
will take all code points above U+DFFF and shift them after plane 16 then we
have a way to encode them as a single UTF character.

  UTF-16 UTF-32s UTF-8s

  E000 00110000 F4908080

When use UTF-8s? If the goal is the have the UTF-8s be used internally in a
product like Oracle so that it sorts the same as UTF-16 but that the data
will actually be retrieved in UTF-16 sequence there is no issue because the
encoding will only be used internally. If the UTF-8s to transformed to
UTF-8 for I/O again we have no issues here.

If the user actually wants UTF-8s input and output streams then the question
is why? Why should it look anything like UTF-8. It is not interchangeable
with UTF-8. You can not send it to a browser or even use the UTF-8 string
handling routines to manipulate the data. If you want to use any OS UTF-8
functions with UTF-8s it will not work.

If you intend to cheat and say that you intend to limit your characters to
plane 0 characters this is not only a GROSSE VIOLATION if the standard but
ironically it makes the argument for UTF-8s go away because without non
plane 0 characters they sort the same.

The big question is what good is UTF-8s as proposed? What can you do with
it? Why would you want it?

What I picture is that problem is a situation like this. You have a Sun
Solaris server with an Oracle database. You know that the wchar_t
implementation is not Unicode so you would like to use the UTF-8 services.
You figure that UTF-8s will look enough like UTF-8 that it will fool the
UTF-8 to UTF-16 converter. You ship the data as UTF-16 to your Windows
client and everything works. The problem is that you need the same sort
sequences on your client code as your database.

What is missing is that the UTF-8 services will break with UTF-8s data. In
actuality you will be just as messed up using the Sun wide character support
with your data in UTF-16 encoding as you will be using UTF-8s. In actuality
you have little choice in the matter. At this stage of the game the only
real solution is a cross platform Unicode support package like ICU. This is
why I am dedicating man months of pro bono work to make ICU easier to
implement for both new and existing applications. We really don't need a
new encoding, we need good software to implement what we have. After more
than 15 years of fighting code pages, I see Unicode as the only way to go.
I will do what I can to see Unicode truly succeed.

Carl



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT