RE: Playing with Unicode (was: Re: UTF-17)

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Tue Jun 26 2001 - 13:38:13 EDT


Peter,

>
> >6) UTF-16X (also named UTF-16S or UTF-16F) is definitely humor,
> although I
> >am probably not the only one to think that it is technically more
> "serious"
> >than UTF-8S.
>
> I didn't get the impression that it was presented with humour in mind. I
> didn't read the original message in which it was introduced carefully,
> however. But it seems to me that it did get picked up as a serious
> possibility for consideration / comparison in the UTF-8s discussion.
>
I was very serious when proposing utf-16x which was a proposal that we can
replicate all of the characters at the end of plane 0 to plane 16 so that
all sorts would be equal. (lesser of two evils) Fortunately you have the
private area which can be relocated because it is not our responsibility to
maintain. Most of the rest are various presentation forms not primary
characters.

Also I do not like 's' protocols and feel that 'x' protocols are far
superior by nature.

Carl W. Brown
President X.Net, Inc.

P.S.

In all seriousness all I propose at this time is that we leave the end of
Plane 16 open for now.

My "serious" proposal was to demonstrate that we can develop protocols that
are better than encoding UTF-16 code points in UTF-8 and then have to double
decode to get anything to work. I was willing to bet almost anything that
utf-16x will never get adopted because the real reason that utf-8s was
proposed as a standard is that it is already supported by the existing code
not because it is the best solution.

In a sense utf-16x was serious but I developed it to take the air out of the
arguments that the utf-8s proponents had for maintaining sort order. I
still think that if you compare UTF-16 in code point order then you are
providing the optimal use of Unicode. If oracle wanted to, they could have
used a modified UTF-16 key and they could have maintained binary code point
order. UTF-8 is already in code point order. As long at they use UTF-8
they had no problems. With the introduction of UTF-16 support in Oracle 9.x
was when they had problems. They could have implemented an AL32UTF16 which
shifts the surrogates to the end and shift the high end characters down.
Retrieval would reverse the process so that the data would come out in valid
UTF-16 encoding but the database would sort in Unicode code point order.

I am fairly sure that they have no clients with Oracle 8.x databases with
utf-8s encoded non plane 0 characters. I warned my clients not to use
surrogates with Oracle 8.x data bases. I also can not see that they could
be so short sighted not to develop a full UTF-8 encoder. If MS can put
surrogate support into Windows 2000, then they can put it into Oracle 8.0.
I am sure that the development of Oracle 8.0 started much later that NT 5.0.

I am very serious that I do not want a UTF-xx encoding that we can not
detect the character length from the first encoding value. I have been
though enough MBCS problems to know that character integrity is very
important.

The bottom line is that since you can not use UTF-8 functions with utf-8s
then why not use UTF-16? You have the decode utf-8s into UTF-16 before you
can use it. You will notice that I never got an answer. End of story.

Carl



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:19 EDT