Re: UTF8 vs AL32UTF8

From: Jianping Yang (Jianping.Yang@oracle.com)
Date: Fri Jun 08 2001 - 18:56:08 EDT


Carl,

"Carl W. Brown" wrote:

> Jianping,
>
> I will change the subject.
>
> >For Oracle's naming convention, we will consider your concern here. But
> given
> >the clear definition and recommendation, I don't think user will be
> confused.
>
> If there other documentation that I am missing?
>
> Looking at your documentation you call UTF-8s UTF8 and standard UTF-8
> AL31UTF8. To me this is very misleading.
>

We clearly documented what character set definition for UTF8 and AL32UTF8 in our
manual. If you look at them you should easy map UTF8 to UTF-8S and AL32UTF8 to
UTF-8.

>
> Supposedly you build you Unicode data base as UTF8. You start using the
> data for a web application. What happens when you send UTF-8s data to a web
> browser? It will work most of the time but will give you funny results from
> time to time. This could create a difficult bug for people to find.
>

In this case if you want to insert and retrieve the string in UTF-8 encoding
form, you can set your NLS_LANG to be AL32UTF8 at client side. Oracle's
architecture will provide character set conversion between server (UTF8) and
client(AL32UTF8), so you can server just as black box.

>
> Worse yet you have a client who has a Unicode database encoded as UTF8.
> They have mostly in-house processing. Now they want to use the data in a
> web application. There is no charset for UTF-8s. Will you provide a
> converter to convert from UFT-8 to UTF-8s and back? If so, would it not be
> easier to provide a code point order compare routine for UTF-16.
>

As I said above, this is already supported.

>
> Since the issue is sorting sequences there might be another alternative.
> You could use UTF-8s internally and convert to UTF-8 for I/O. This way you
> would always be working with valid Unicode data at the API layer. You could
> then support real UTF-8 data with UTF-16 sorting.
>

You have choice here dependent on your need. You can choose either AL32UTF8 or
UTF8 at server side or client side. For OCI programing, you can also choose
UTF-16.

>
> How are other databases handling UTF-16 compares? Have you considered UTF-16
> code point order support? How do you sort AL16UTF16?
>

In binary sort, AL16UTF16 is sorted as UTF8 which will be different from
AL32UTF8. But you can choose NLS_SORT as UNICODE_BINARY which will sort AL32UTF8
as AL16UTF16 binary order.

Only AL16UTF16 and UTF8 can be used for NCHAR character set, and NCHAR semantics
is totally independent of its character set, which will enable you to build
portable NCHAR application that uses either AL16UTF16 or UTF8 for your benefit
when considering storage space for particular locale.

Regards,
Jianping.

>
> Carl
>
> -----Original Message-----
> From: Jianping Yang [mailto:Jianping.Yang@oracle.com]
> Sent: Friday, June 08, 2001 11:33 AM
> To: Carl W. Brown
> Cc: unicode@unicode.org
> Subject: Re: UTF-8 syntax
>
> Carl,
>
> Please be focused on the subject we have discussed on this email chain. If
> you
> cannot understand the issue in this chain, just don't waste your time here.
>
> For Oracle's naming convention, we will consider your concern here. But
> given
> the clear definition and recommendation, I don't think user will be
> confused.
>
> Regards,
> Jianping.
>
> "Carl W. Brown" wrote:
>
> > Jianping,
> >
> > UTF-16 is an encoding system for Unicode. Encoding does not indicate sort
> > order. It is just encoding that is all. It you want to compare two
> fields
> > that should be compared in either the collating sequence for the locale or
> > Unicode code point order.
> >
> > If I follow your argument further we should insure that EUC-J, Shift-JIS,
> > iso-2022-jp and Unicode have the same sort order.
> >
> > It is not hard to compare UTF-16 data in code point sequence. DO THE
> RIGHT
> > THING!!!
> >
> > What is really bad about Oracle's proposed UTF-8 implementation is that
> the
> > incorrect encoding is called UTF8 and the real UTF-8 is called AL32UTF8.
> Be
> > honest with your users. Let them know the real facts. If they use your
> > UTF8 encoding that can get into trouble. Do it now before users get into
> a
> > migration jam.
> >
> > UTF8
> >
> > The UTF8 character set encodes characters in one to three bytes.
> Surrogate
> > pairs
> > require six bytes.
> >
> > AL32UTF8
> >
> > The AL32UTF8 character set encodes characters in one to three bytes.
> > Surrogate pairs
> > require four bytes.
> >
> > From this documentation it would seem that UTF8 is the real thing and the
> > AL32UTF8 is an Oracle special encoding.
> >
> > If you are going to have a non-compliant encoding that you should call it
> > AL16UTF8 and call the other UTF8. This would be consistent with your
> > AL16UTF16 encoding selection. You should also change the documentation
> to:
> >
> > UTF8
> >
> > The UTF8 character set encodes characters in one to three bytes.
> Surrogate
> > pairs
> > require four bytes.
> >
> > AL16UTF8
> >
> > The AL16UTF8 character set encodes characters in one to three bytes.
> > Surrogate pairs
> > use a non-standard encoding that requires six bytes. This encoding
> > provides that same
> > sort order as AL16UTF16 but will not work with standard UTF-8 encoders
> and
> > decoders.
> >
> > -----Original Message-----
> > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> > Behalf Of Jianping Yang
> > Sent: Thursday, June 07, 2001 6:51 PM
> > To: Peter_Constable@sil.org
> > Cc: unicode@unicode.org
> > Subject: Re: UTF-8 syntax
> >
> > I don't get point from this argument as UTF-8S is exactly mapped to UTF-16
> > in
> > UTF-16 code unit which means one UTF-16 code unit will be mapped to either
> > one,
> > two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
> > UTF-8S, it should also apply to UTF-16, which does not make sense to me.
> >
> > Regards,
> > Jianping.
> >
> > Peter_Constable@sil.org wrote:
> >
> > > On 06/07/2001 10:38:15 AM DougEwell2 wrote:
> > >
> > > >The ambiguity comes from the fact that, if I am using UTF-8s and I want
> > to
> > > >represent the sequence of (invalid) scalar values <D800 DC00>, I must
> use
> > > the
> > > >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> > > (valid)
> > > >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80
> ED
> > > B0
> > > >80>. Unless you have a crystal ball or are extremely good with tarot
> > > cards,
> > > >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED
> B0
> > > >80>, to know whether it is supposed to be mapped back to <D800 DC00> or
> > to
> > > ><10000>.
> > >
> > > This brings out a good point. We can't yet say that UTF-8s is ambiguous
> > > since it is not formally defined. What this does highlight, though, is a
> > > gap in the proposal that must be addressed before it could be
> considered:
> > a
> > > well-formed definition for UTF-8 must (by D29) provide a *unique*
> > > representation for *all* USVs, and unless the proposal is amended to
> > remove
> > > D800 - DFFF from the codespace, it must be amended to provide some
> unique
> > > means of representing things like U+D800. What it is *not allowed* to be
> > is
> > > ambiguous. If UTF-8s considers <ED A0 80 ED B0 80> to mean U+10000, then
> > it
> > > must provide some sequence other than <ED A0 80> to mean U+D800.
> > >
> > > >Premise: Unicode should not, and does not, define ambiguous UTFs.
> > > > I think we agree on this.
> > >
> > > Yes.
> > >
> > > >Premise: UTF-8s is ambiguous in its handling of surrogate code points.
> > > > I tried to prove this above.
> > > >
> > > >Conclusion: Unicode should not define UTF-8s.
> > >
> > > I definitely agree with the idea your getting at, but am just looking
> from
> > > a very slightly different angle. The conclusion does not necessarily
> > follow
> > > because UTF-8s is only a proposal that potentially can be modified. If
> you
> > > say, "UTF-8s as has been currently proposed would be inconsistent with
> > > D29", then I agree. The proposed definition for UTF-8s *could*
> potentiall
> > > be revised, though, and so the argument that a UTF-8s cannot be added to
> > > Unicode doesn't hold.
> > >
> > > UTF-8s definitely is not tenable as currently proposed, given the
> current
> > > definitions. I think we agree on that.
> > >
> > > - Peter
> > >
> >
> > --------------------------------------------------------------------------
> > -
> > > Peter Constable
> > >
> > > Non-Roman Script Initiative, SIL International
> > > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> > > Tel: +1 972 708 7485
> > > E-mail: <peter_constable@sil.org>





This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT