UTF-16 problems

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Jun 11 2001 - 15:54:31 EDT


I think we all recognize that UTF-16 has problems in that it does not
naturally sort in Unicode code point order. It would have been nice if the
end of the sort order had been available for surrogate codes.

However, if we try to look at the problem from different perspectives we
might come up with an alternate solution.

Since we can not reassign code points we might however assign alternate code
points. The private area for example could be assign an alternate private
area in another plane. Users could assign these code points to their
private characters. Most of the remaining code points are presentation
forms and compatibility encoding. If I have a special presentation form
duplicating the same presentation form in another plane should not present a
problem. Compatibility characters are more difficult but also represent
alternate forms of the same character. Half and full width characters are
also alternate forms.

The big problem is the BOM. The BOM serves two functions. It is both a
zero width space character and a byte order mark.

How will this be implemented. The UTF-16x encoding would shift these high
characters to an alternate plain. They would be legitimate characters in
these positions. For UCS-2 compatibility that can be shifted back to UCS2.

If we use the end of the last plane we can maintain sort order compatibility
with UTF-32 and UTF-8.

The BOM will be different. A BOM at the start of a block has to be
considered a BOM. For comparisons the BOMs will either be removed or
compare equally so its sort sequence does not matter. You will have to
continue to encode a BOM the same way as before, but this single exception
should not create problems.

I think that UTF-16x would be a better approach than UTF-8s. I am sure that
I have missed some issues feel free to comment. In any case UTF-16s would
naturally be in Unicode code point order. It would be easy to transform to
UCS-2 for applications that do not support UTF-16.

Carl



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT