RE: UTF8 vs AL32UTF8

From: Carl W. Brown ([email protected])
Date: Sun Jun 10 2001 - 16:17:20 EDT

Next message: Lars Marius Garshol: "Scripts and languages topic map"
Previous message: Michael \(michka\) Kaplan: "Re: Lenient search engine"
In reply to: Jianping Yang: "Re: UTF8 vs AL32UTF8"
Next in thread: Shigemichi Yazawa: "Re: UTF8 vs AL32UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Jianping,

>We clearly documented what character set definition for UTF8 and AL32UTF8
>in our manual. If you look at them you should easy map UTF8 to UTF-8S and
>AL32UTF8 to UTF-8.

I missed this documentation. All I saw was that one used 6 bytes and the
other used 4 bytes. This type of explanation is meaningless to most DBAs.
If told to implement UTF-8 that then will select UTF8 because they don't
know better.

I think that you should add a reference here to explain where to find the
clarification the consequences of the choice.

Have you given any though to limiting NLS_LANG to AL32UTF8 only? This gives
you the sorting like UTF-16 but you don't generate invalid UTF-8 data. This
would solve the problem and not create new ones. The best of all worlds.
If the client wants data in UTF-16 sequencing it is probably because they
are retrieving you UTF-8s data in UTF-16. I can not imagine the they would
want to retrieve actual UTF-8s data because there is nothing that they could
do with it. No converters work and it has no valid charset.

Could you point me to the documentation that explains the differences
between UFT8 vs. AL32UTF8?

Carl

-----Original Message-----
From: Jianping Yang [mailto:[email protected]]
Sent: Friday, June 08, 2001 3:56 PM
To: Carl W. Brown
Cc: [email protected]
Subject: Re: UTF8 vs AL32UTF8

Carl,

"Carl W. Brown" wrote:

> Jianping,
>
> I will change the subject.
>
> >For Oracle's naming convention, we will consider your concern here. But
> given
> >the clear definition and recommendation, I don't think user will be
> confused.
>
> If there other documentation that I am missing?
>
> Looking at your documentation you call UTF-8s UTF8 and standard UTF-8
> AL31UTF8. To me this is very misleading.
>

We clearly documented what character set definition for UTF8 and AL32UTF8 in
our
manual. If you look at them you should easy map UTF8 to UTF-8S and AL32UTF8
to
UTF-8.

>
> Supposedly you build you Unicode data base as UTF8. You start using the
> data for a web application. What happens when you send UTF-8s data to a
web
> browser? It will work most of the time but will give you funny results
from
> time to time. This could create a difficult bug for people to find.
>

In this case if you want to insert and retrieve the string in UTF-8 encoding
form, you can set your NLS_LANG to be AL32UTF8 at client side. Oracle's
architecture will provide character set conversion between server (UTF8) and
client(AL32UTF8), so you can server just as black box.

>
> Worse yet you have a client who has a Unicode database encoded as UTF8.
> They have mostly in-house processing. Now they want to use the data in a
> web application. There is no charset for UTF-8s. Will you provide a
> converter to convert from UFT-8 to UTF-8s and back? If so, would it not
be
> easier to provide a code point order compare routine for UTF-16.
>

As I said above, this is already supported.

>
> Since the issue is sorting sequences there might be another alternative.
> You could use UTF-8s internally and convert to UTF-8 for I/O. This way
you
> would always be working with valid Unicode data at the API layer. You
could
> then support real UTF-8 data with UTF-16 sorting.
>

You have choice here dependent on your need. You can choose either AL32UTF8
or
UTF8 at server side or client side. For OCI programing, you can also choose
UTF-16.

>
> How are other databases handling UTF-16 compares? Have you considered
UTF-16
> code point order support? How do you sort AL16UTF16?
>

In binary sort, AL16UTF16 is sorted as UTF8 which will be different from
AL32UTF8. But you can choose NLS_SORT as UNICODE_BINARY which will sort
AL32UTF8
as AL16UTF16 binary order.

Only AL16UTF16 and UTF8 can be used for NCHAR character set, and NCHAR
semantics
is totally independent of its character set, which will enable you to build
portable NCHAR application that uses either AL16UTF16 or UTF8 for your
benefit
when considering storage space for particular locale.

Regards,
Jianping.

>
> Carl
>
> -----Original Message-----
> From: Jianping Yang [mailto:[email protected]]
> Sent: Friday, June 08, 2001 11:33 AM
> To: Carl W. Brown
> Cc: [email protected]
> Subject: Re: UTF-8 syntax
>
> Carl,
>
> Please be focused on the subject we have discussed on this email chain. If
> you
> cannot understand the issue in this chain, just don't waste your time
here.
>
> For Oracle's naming convention, we will consider your concern here. But
> given
> the clear definition and recommendation, I don't think user will be
> confused.
>
> Regards,
> Jianping.
>
> "Carl W. Brown" wrote:
>
> > Jianping,
> >
> > UTF-16 is an encoding system for Unicode. Encoding does not indicate
sort
> > order. It is just encoding that is all. It you want to compare two
> fields
> > that should be compared in either the collating sequence for the locale
or
> > Unicode code point order.
> >
> > If I follow your argument further we should insure that EUC-J,
Shift-JIS,
> > iso-2022-jp and Unicode have the same sort order.
> >
> > It is not hard to compare UTF-16 data in code point sequence. DO THE
> RIGHT
> > THING!!!
> >
> > What is really bad about Oracle's proposed UTF-8 implementation is that
> the
> > incorrect encoding is called UTF8 and the real UTF-8 is called AL32UTF8.
> Be
> > honest with your users. Let them know the real facts. If they use your
> > UTF8 encoding that can get into trouble. Do it now before users get
into
> a
> > migration jam.
> >
> > UTF8
> >
> > The UTF8 character set encodes characters in one to three bytes.
> Surrogate
> > pairs
> > require six bytes.
> >
> > AL32UTF8
> >
> > The AL32UTF8 character set encodes characters in one to three bytes.
> > Surrogate pairs
> > require four bytes.
> >
> > From this documentation it would seem that UTF8 is the real thing and
the
> > AL32UTF8 is an Oracle special encoding.
> >
> > If you are going to have a non-compliant encoding that you should call
it
> > AL16UTF8 and call the other UTF8. This would be consistent with your
> > AL16UTF16 encoding selection. You should also change the documentation
> to:
> >
> > UTF8
> >
> > The UTF8 character set encodes characters in one to three bytes.
> Surrogate
> > pairs
> > require four bytes.
> >
> > AL16UTF8
> >
> > The AL16UTF8 character set encodes characters in one to three bytes.
> > Surrogate pairs
> > use a non-standard encoding that requires six bytes. This encoding
> > provides that same
> > sort order as AL16UTF16 but will not work with standard UTF-8 encoders
> and
> > decoders.
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]On
> > Behalf Of Jianping Yang
> > Sent: Thursday, June 07, 2001 6:51 PM
> > To: [email protected]
> > Cc: [email protected]
> > Subject: Re: UTF-8 syntax
> >
> > I don't get point from this argument as UTF-8S is exactly mapped to
UTF-16
> > in
> > UTF-16 code unit which means one UTF-16 code unit will be mapped to
either
> > one,
> > two, or three bytes in UTF-8S. So if you are saying there is ambiguous
in
> > UTF-8S, it should also apply to UTF-16, which does not make sense to me.
> >
> > Regards,
> > Jianping.
> >
> > [email protected] wrote:
> >
> > > On 06/07/2001 10:38:15 AM DougEwell2 wrote:
> > >
> > > >The ambiguity comes from the fact that, if I am using UTF-8s and I
want
> > to
> > > >represent the sequence of (invalid) scalar values <D800 DC00>, I must
> use
> > > the
> > > >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> > > (valid)
> > > >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80
> ED
> > > B0
> > > >80>. Unless you have a crystal ball or are extremely good with tarot
> > > cards,
> > > >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80
ED
> B0
> > > >80>, to know whether it is supposed to be mapped back to <D800 DC00>
or
> > to
> > > ><10000>.
> > >
> > > This brings out a good point. We can't yet say that UTF-8s is
ambiguous
> > > since it is not formally defined. What this does highlight, though, is
a
> > > gap in the proposal that must be addressed before it could be
> considered:
> > a
> > > well-formed definition for UTF-8 must (by D29) provide a *unique*
> > > representation for *all* USVs, and unless the proposal is amended to
> > remove
> > > D800 - DFFF from the codespace, it must be amended to provide some
> unique
> > > means of representing things like U+D800. What it is *not allowed* to
be
> > is
> > > ambiguous. If UTF-8s considers <ED A0 80 ED B0 80> to mean U+10000,
then
> > it
> > > must provide some sequence other than <ED A0 80> to mean U+D800.
> > >
> > > >Premise: Unicode should not, and does not, define ambiguous UTFs.
> > > > I think we agree on this.
> > >
> > > Yes.
> > >
> > > >Premise: UTF-8s is ambiguous in its handling of surrogate code
points.
> > > > I tried to prove this above.
> > > >
> > > >Conclusion: Unicode should not define UTF-8s.
> > >
> > > I definitely agree with the idea your getting at, but am just looking
> from
> > > a very slightly different angle. The conclusion does not necessarily
> > follow
> > > because UTF-8s is only a proposal that potentially can be modified. If
> you
> > > say, "UTF-8s as has been currently proposed would be inconsistent with
> > > D29", then I agree. The proposed definition for UTF-8s *could*
> potentiall
> > > be revised, though, and so the argument that a UTF-8s cannot be added
to
> > > Unicode doesn't hold.
> > >
> > > UTF-8s definitely is not tenable as currently proposed, given the
> current
> > > definitions. I think we agree on that.
> > >
> > > - Peter
> > >
> >
>
> --------------------------------------------------------------------------
> > -
> > > Peter Constable
> > >
> > > Non-Roman Script Initiative, SIL International
> > > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> > > Tel: +1 972 708 7485
> > > E-mail: <[email protected]>

Next message: Lars Marius Garshol: "Scripts and languages topic map"
Previous message: Michael \(michka\) Kaplan: "Re: Lenient search engine"
In reply to: Jianping Yang: "Re: UTF8 vs AL32UTF8"
Next in thread: Shigemichi Yazawa: "Re: UTF8 vs AL32UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT