RE: UTF-16 problems

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Jun 11 2001 - 20:37:11 EDT


Michka,

I guess that we can agree to disagree. I can see that if for nothing else
having UTF-16 sort in Unicode code point order with a simple binary search
has real performance advantages. You don't see it much in C code but some
assembler implementations can really benefit. For example on an IBM 370/390
you can compare up to 16MB with a single machine instruction. Having to
adjust for code point sequences for each character will add significant
overhead.

I am proposing that we fix UTF-16. Probably the most common use of these
code points is for hankata (half width katakana). If the application does
not have UTF-16 support it will work as it does normally. UTF-16
applications will translate either code point to the code page katakana
character. Going back to UTF-16 it will use the high end surrogate
character. If is sends the data to a system that does not have UTF-16
support it will have to convert by shifting the characters to the alternate
code points(UTF-16 to UCS-2). UTF-8 representation will be 4 characters
rather than 3 and UTF-16 will require 2 positions rather than 1.

UTF-16 fonts will have to map both code points. In other words it will be a
little more overhead. But it will eliminate the need for either UTF-8s or
UTF-32s. Providing two code positions the same character is a requirement
of UTF-32s anyway so the impact of this change is far less that splitting
UTF-8 into two incompatible systems. This one is at least interchangeable.

The real beauty for this system is that even when converting from UTF-16 to
UCS-2 the UCS-2 applications will still sort in the same relative order as
the UTF-16, UTF-8 and UTF-32 applications. You will produce different UTF-8
codes because those will correspond to the new UTF-16 code points. So there
will be some minor changes to the code that converts from UCS-2 to UTF-8.
But even if the code is not adjusted you will still get valid UTF-8
encoding, it will just not sort properly. That is certainly preferable to
broken encoding.

Carl

-----Original Message-----
From: Michael (michka) Kaplan [mailto:michka@trigeminal.com]
Sent: Monday, June 11, 2001 3:47 PM
To: Carl W. Brown; unicode
Subject: Re: UTF-16 problems

From: "Carl W. Brown" <cbrown@xnetinc.com>

> I first I thought the same thing but I have changed my mind. There are
> problems but the problems are with UTF-16 not UTF-8. I don't think that I
> am the only one who thinks that UTF-8s will create more problems that it
> fixes.
>
> Worse yet they will also have to "fix" UTF-32 as well.
>
> The point of this message is to fix UTF-16 which is the source of the
> problem. These changes are no more of a stretch than UTF-32s. The
UTF-32s
> proposal that I heard involves replication the same code points to get
these
> code points to sort high like UTF-16.
>
> What this does, is the legitimize the code point shift for UTF-16, UTF-8,
> and UTF-32 so that the transforms all work and all sort the same and that
> the binary sort and Unicode sort orders are the same.
>
> It does involve a minor normalization transform but you have to do that
for
> UTF-32s anyway and UTF-32s is required if you allow support of UTF-8s.
The
> big difference is that you don't change any UTF protocols or develop two
> mutually exclusive transforms that are so similar that they might be
> confused. Besides this transform keeps UTF-8 to 4 bytes not 6 and will
work
> with the existing UTF-8 software.
>
> The beauty of this proposal is that UCS-2 (plane 0 only) codes will sort
in
> the same order as the post transformed UTF-16 codes.

Carl,

I would agree with you except for one thing.... no one needs this, to solve
their implementation issues! Why would everyone want to turn around and have
to change all their implementations around, including the lazy folks who are
asking others to change for their sake, to support something that no one
wants to do?

The whole UTF-8S mess is a bunch of people asking for a lavicious license to
be lecherously lazy (they should have called it UTF-8L in effigy). No one is
interested in doing a bunch of work here:

1) There is the group of people who took responsibility for their
implementations at some point in the last seven years to properly support
supplementary characters. They do not want to do any extra work since they
work just fine.

2) There is the group of people who are scrambling around trying to get
their laziness canonized as the forward looking savior of a solution that
all of us were too foolish to realize is vital -- they do not want to do any
work either (except marketing work to convince everyone how right they are).

3) There is the group of people who can't believe how far this has come.

I understand the technical merit of the suggestion, and it is technically
superior to the UTF-8S plan (this is of course not saying much, but your
plan is well thought out!). The problem is that this is a solution that is
looking for a problem.

The only people who have the problem are the ones who were not thinking
ahead, and they do not want to throw away their current solution, they are
too in love with it.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT