RE: UTF-8 Syntax

From: toby_phipps@peoplesoft.com
Date: Mon Jun 11 2001 - 05:33:09 EDT


Carl W. Brown <cbrown@xnetinc.com> wrote:
>In the case of strcmp the problem is that this won't even work on UCS-2.
It
>detects the end of string with a single byte 0x00. You have to use a
>special Unicode compare routine this routine needs to be fixed to produce
>proper compares. Most likely your problem with COBOL is that it is not
>UTF-16 enabled so it does UCS-2 compares.

Sorry - read "wcscmp" for "strcmp". 4 years of Unicode coding and still
slipping back to old function names!

I am puzzled by your COBOL comparison statement - any implementation using
UCS-2 will compare equivalently to a UTF-16 implementation, unless you're
talking about a UTF-16 based system with a function that makes it binary
compare equivalently to UTF-8 or UTF-32, which is the topic of this whole
thread.

>Your point #2 logic is circular. Just because old code designed to
support
>UCS-2 may be used with UTF-16 does not mean that it is doing the right
>thing. Where in the Unicode documentation does it support UTF-15 binary
>sorting order as a valid Unicode sort order? I can not find it anywhere.

Sorting data by its binary representation is not something that the Unicode
Consortium or TUS has to ordain. It's a given in most programming
environments, unless something special is written into compare functions to
perform a non-binary sort. Remember - we're sorting data for internal
comparison here, not for presentation to the user or through an API.

>Your point #3 supports not having UTF-8s. Good systems that support both
>ASCII & EBCDIC have special logic to insure that they compare the same
way.
>Thus if I have an AS400 or IBM mainframe server running EBCDIC I must sort
>the same way as my workstations.

Yes, but why force new code to perform the same ugly key-generation and
shifting that legacy systems have to perform? If I'm writing a new system
completely in Unicode, I should be able to free my system from having to
deal with all the issues of legacy encodings, such as dealing with
different binary sorts for in-memory comparison. It's one of the benefits
Unicode brought the world, although it appears that it was more by accident
than by design.

>I can see by point #4 that you have not worked on database internals.
>Primary key structures are usually separate instances of the data key
>information. This is not true of files like link list files but modern
>databases are too sophisticated to be able to share key and data records.
>The can use any encoding. I for example can sort EBCDIC records in ASCII
>order and still use binary compares by building internal keys that compare
>correctly.

On the contrary. Before my current position, I did hard time at a certain
large database vendor that I will decline to name in this context, but
let's not get into personal backgrounds. What you are describing amounts
to a functional index - an index built with values that you are referring
to as "internal keys" used for optimization purposes. Please re-read my
original description of the fetch process where the data value can be
retrieved from the index without the need to refer to a data block. Take
for example a table with one primary key and several other non-key columns.
If I create an index on the primary key, using a binary sort, and then
perform a query against this key, ordered by the key, a well-optimized
database system will return the values read directly from the index, as it
has no need to refer to the data blocks. If however, I include one or more
of the non-key (and non-indexed) columns in the query, the database will
have to refer to the data blocks to retrieve those values, and the
advantage of performing an index-only scan is lost. This is all irrelevant
to the UTF-8S discussion - the point I'm trying to make is that indexing
values in their original binary form as opposed to a transformation of that
form is significantly beneficial for query performance.

>Point # 6. UTF-EBCDIC is a serious problem. It does not work will all
>EBCDIC codepages. Take the case of Japanese where they have replaced all
of
>the lower case Latin characters with single width katakana character. Try
>to transform to UTF-EBCDIC. It is like UTF-8s a short term solution to a
>problem that creates more problems later on.

UTF-EBCDIC works reasonably well in Japanese EBCDIC environments, as long
as one is careful. For example, if I take data in CCSID290 (Extended
single-byte Japanese - katakana in place of lowercase Latin), and transcode
it to UTF-EBCDIC via Unicode, all is well, as long as the programs using
the encoding are truely using UTF-EBCDIC rules (they expect lowercase Latin
characters to be in their invarient/CCSID1047 positions), not CCSID290
rules. UTF-EBCDIC is a very practical solution to a problem that is common
amongst many vendors, as is UTF-8S.

>My problem is that Oracle calls UTF-8s UTF8 and UTF-8 is AL31UTF8. I
think
>that people should understand this is a short term workaround only. It
>should have !!!User Beware!!! you had better know what you are doing
written
>all over it. You also need to have a good migration plan to phase out
>UTF-8s as soon as you can.

This is simply an Oracle naming issue, nothing to do with the proposal for
UTF-8S. Whether Oracle names it UTF8 or UTF-8 or AL16UTF8 is something
proprietary to Oracle. Like any good vendor, they are very cautious about
breaking existing implementations dependent on their system's behavoiur
(hmmm - sounds like a standards body we're all acutely aware of), and so
will probably need to continue supporting their character set name "UTF8"
to mean as UTF-8S. Yes, they will need to document well the fact that
their "UTF8" is not truely UTF-8, but is in fact "UTF-8S". Yes, they can
most probably give notice of the renaming of their UTF8 character set in a
future release. This naming is internal to Oracle and its users, and
something that I trust that they will address. It should not affect the
discussion of what UTF-8S is, and how it will be used/misused.

>This is what we mean by an internal implementation that does not have to
be
>and should not be sanctioned by UTC.

I reiterate - UTF-8S is not something that will be used only by Oracle
internally. PeopleSoft plans to use it in places where we need an 8-bit
representation that sorts in binary equivalently to UTF-16. We'll use it
in our COBOL. We'll use it in our memory cache. SAP has the same needs in
their systems and databases (yes, they are a database vendor - see
http://www.sapdb.org ). Yes, all these uses may be internal to each
vendor, but as Uma has stated, internal representation leaks out. If any
significant number of vendors are going to be using this encoding
internally in their systems, wouldn't it make sense to have a UTR
describing what this representation is, when it is useful, and how to deal
with data presented to you in that encoding should the situation arise?

Toby.

--
Toby Phipps
PeopleTools Product Manager - Global Technology
PeopleSoft, Inc.
tphipps@peoplesoft.com



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT