UTF8 vs AL32UTF8

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Fri Jun 08 2001 - 17:09:14 EDT


Jianping,

I will change the subject.

>For Oracle's naming convention, we will consider your concern here. But
given
>the clear definition and recommendation, I don't think user will be
confused.

If there other documentation that I am missing?

Looking at your documentation you call UTF-8s UTF8 and standard UTF-8
AL31UTF8. To me this is very misleading.

Supposedly you build you Unicode data base as UTF8. You start using the
data for a web application. What happens when you send UTF-8s data to a web
browser? It will work most of the time but will give you funny results from
time to time. This could create a difficult bug for people to find.

Worse yet you have a client who has a Unicode database encoded as UTF8.
They have mostly in-house processing. Now they want to use the data in a
web application. There is no charset for UTF-8s. Will you provide a
converter to convert from UFT-8 to UTF-8s and back? If so, would it not be
easier to provide a code point order compare routine for UTF-16.

Since the issue is sorting sequences there might be another alternative.
You could use UTF-8s internally and convert to UTF-8 for I/O. This way you
would always be working with valid Unicode data at the API layer. You could
then support real UTF-8 data with UTF-16 sorting.

How are other databases handling UTF-16 compares? Have you considered UTF-16
code point order support? How do you sort AL16UTF16?

Carl

-----Original Message-----
From: Jianping Yang [mailto:Jianping.Yang@oracle.com]
Sent: Friday, June 08, 2001 11:33 AM
To: Carl W. Brown
Cc: unicode@unicode.org
Subject: Re: UTF-8 syntax

Carl,

Please be focused on the subject we have discussed on this email chain. If
you
cannot understand the issue in this chain, just don't waste your time here.

For Oracle's naming convention, we will consider your concern here. But
given
the clear definition and recommendation, I don't think user will be
confused.

Regards,
Jianping.

"Carl W. Brown" wrote:

> Jianping,
>
> UTF-16 is an encoding system for Unicode. Encoding does not indicate sort
> order. It is just encoding that is all. It you want to compare two
fields
> that should be compared in either the collating sequence for the locale or
> Unicode code point order.
>
> If I follow your argument further we should insure that EUC-J, Shift-JIS,
> iso-2022-jp and Unicode have the same sort order.
>
> It is not hard to compare UTF-16 data in code point sequence. DO THE
RIGHT
> THING!!!
>
> What is really bad about Oracle's proposed UTF-8 implementation is that
the
> incorrect encoding is called UTF8 and the real UTF-8 is called AL32UTF8.
Be
> honest with your users. Let them know the real facts. If they use your
> UTF8 encoding that can get into trouble. Do it now before users get into
a
> migration jam.
>
> UTF8
>
> The UTF8 character set encodes characters in one to three bytes.
Surrogate
> pairs
> require six bytes.
>
> AL32UTF8
>
> The AL32UTF8 character set encodes characters in one to three bytes.
> Surrogate pairs
> require four bytes.
>
> From this documentation it would seem that UTF8 is the real thing and the
> AL32UTF8 is an Oracle special encoding.
>
> If you are going to have a non-compliant encoding that you should call it
> AL16UTF8 and call the other UTF8. This would be consistent with your
> AL16UTF16 encoding selection. You should also change the documentation
to:
>
> UTF8
>
> The UTF8 character set encodes characters in one to three bytes.
Surrogate
> pairs
> require four bytes.
>
> AL16UTF8
>
> The AL16UTF8 character set encodes characters in one to three bytes.
> Surrogate pairs
> use a non-standard encoding that requires six bytes. This encoding
> provides that same
> sort order as AL16UTF16 but will not work with standard UTF-8 encoders
and
> decoders.
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Jianping Yang
> Sent: Thursday, June 07, 2001 6:51 PM
> To: Peter_Constable@sil.org
> Cc: unicode@unicode.org
> Subject: Re: UTF-8 syntax
>
> I don't get point from this argument as UTF-8S is exactly mapped to UTF-16
> in
> UTF-16 code unit which means one UTF-16 code unit will be mapped to either
> one,
> two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
> UTF-8S, it should also apply to UTF-16, which does not make sense to me.
>
> Regards,
> Jianping.
>
> Peter_Constable@sil.org wrote:
>
> > On 06/07/2001 10:38:15 AM DougEwell2 wrote:
> >
> > >The ambiguity comes from the fact that, if I am using UTF-8s and I want
> to
> > >represent the sequence of (invalid) scalar values <D800 DC00>, I must
use
> > the
> > >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> > (valid)
> > >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80
ED
> > B0
> > >80>. Unless you have a crystal ball or are extremely good with tarot
> > cards,
> > >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED
B0
> > >80>, to know whether it is supposed to be mapped back to <D800 DC00> or
> to
> > ><10000>.
> >
> > This brings out a good point. We can't yet say that UTF-8s is ambiguous
> > since it is not formally defined. What this does highlight, though, is a
> > gap in the proposal that must be addressed before it could be
considered:
> a
> > well-formed definition for UTF-8 must (by D29) provide a *unique*
> > representation for *all* USVs, and unless the proposal is amended to
> remove
> > D800 - DFFF from the codespace, it must be amended to provide some
unique
> > means of representing things like U+D800. What it is *not allowed* to be
> is
> > ambiguous. If UTF-8s considers <ED A0 80 ED B0 80> to mean U+10000, then
> it
> > must provide some sequence other than <ED A0 80> to mean U+D800.
> >
> > >Premise: Unicode should not, and does not, define ambiguous UTFs.
> > > I think we agree on this.
> >
> > Yes.
> >
> > >Premise: UTF-8s is ambiguous in its handling of surrogate code points.
> > > I tried to prove this above.
> > >
> > >Conclusion: Unicode should not define UTF-8s.
> >
> > I definitely agree with the idea your getting at, but am just looking
from
> > a very slightly different angle. The conclusion does not necessarily
> follow
> > because UTF-8s is only a proposal that potentially can be modified. If
you
> > say, "UTF-8s as has been currently proposed would be inconsistent with
> > D29", then I agree. The proposed definition for UTF-8s *could*
potentiall
> > be revised, though, and so the argument that a UTF-8s cannot be added to
> > Unicode doesn't hold.
> >
> > UTF-8s definitely is not tenable as currently proposed, given the
current
> > definitions. I think we agree on that.
> >
> > - Peter
> >
>
> --------------------------------------------------------------------------
> -
> > Peter Constable
> >
> > Non-Roman Script Initiative, SIL International
> > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> > Tel: +1 972 708 7485
> > E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT