RE: UTF-8 Syntax

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Jun 11 2001 - 15:13:52 EDT


-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of toby_phipps@peoplesoft.com
Sent: Monday, June 11, 2001 2:33 AM
To: unicode@unicode.org
Subject: RE: UTF-8 Syntax

>Carl W. Brown <cbrown@xnetinc.com> wrote:
>>In the case of strcmp the problem is that this won't even work on UCS-2.
>>It
>>detects the end of string with a single byte 0x00. You have to use a
>>special Unicode compare routine this routine needs to be fixed to produce
>>proper compares. Most likely your problem with COBOL is that it is not
>>UTF-16 enabled so it does UCS-2 compares.

>Sorry - read "wcscmp" for "strcmp". 4 years of Unicode coding and still
>slipping back to old function names!

>I am puzzled by your COBOL comparison statement - any implementation using
>UCS-2 will compare equivalently to a UTF-16 implementation, unless you're
>talking about a UTF-16 based system with a function that makes it binary
>compare equivalently to UTF-8 or UTF-32, which is the topic of this whole
>thread.

If your are going to compare everything is the same sequence, then you
should change your UTF-16 compare logic to compare in Unicode code point
order. In other words the sorting sequence should not be tied to the
encoding form. This should be part of the UCS-2 to UTF-16 upgrade.

>>Your point #2 logic is circular. Just because old code designed to
>>support
>>UCS-2 may be used with UTF-16 does not mean that it is doing the right
>>thing. Where in the Unicode documentation does it support UTF-15 binary
>>sorting order as a valid Unicode sort order? I can not find it anywhere.

>Sorting data by its binary representation is not something that the Unicode
>Consortium or TUS has to ordain. It's a given in most programming
>environments, unless something special is written into compare functions to
>perform a non-binary sort. Remember - we're sorting data for internal
>comparison here, not for presentation to the user or through an API.

If you do a binary only sort that you should not expect it to survive a
translation or transform.

>>Your point #3 supports not having UTF-8s. Good systems that support both
>>ASCII & EBCDIC have special logic to insure that they compare the same
>>way.
>>Thus if I have an AS400 or IBM mainframe server running EBCDIC I must sort
>>the same way as my workstations.

>Yes, but why force new code to perform the same ugly key-generation and
>shifting that legacy systems have to perform? If I'm writing a new system
>completely in Unicode, I should be able to free my system from having to
>deal with all the issues of legacy encodings, such as dealing with
>different binary sorts for in-memory comparison. It's one of the benefits
>Unicode brought the world, although it appears that it was more by accident
>than by design.

That would be nice in an ideal world. UCS-2 was usually touted that you
could simplify your logic because you got rid on MBCS encoding. Now we have
UTF-16 which is another form of MBCS. I suspect that is why many of the
late adaptors of Unicode are going to UTF-32.

>>I can see by point #4 that you have not worked on database internals.
>>Primary key structures are usually separate instances of the data key
>>information. This is not true of files like link list files but modern
>>databases are too sophisticated to be able to share key and data records.
>>The can use any encoding. I for example can sort EBCDIC records in ASCII
>>order and still use binary compares by building internal keys that compare
>>correctly.

>On the contrary. Before my current position, I did hard time at a certain
>large database vendor that I will decline to name in this context, but
>let's not get into personal backgrounds. What you are describing amounts
>to a functional index - an index built with values that you are referring
>to as "internal keys" used for optimization purposes. Please re-read my
>original description of the fetch process where the data value can be
>retrieved from the index without the need to refer to a data block. Take
>for example a table with one primary key and several other non-key columns.
>If I create an index on the primary key, using a binary sort, and then
>perform a query against this key, ordered by the key, a well-optimized
>database system will return the values read directly from the index, as it
>has no need to refer to the data blocks. If however, I include one or more
>of the non-key (and non-indexed) columns in the query, the database will
>have to refer to the data blocks to retrieve those values, and the
>advantage of performing an index-only scan is lost. This is all irrelevant
>to the UTF-8S discussion - the point I'm trying to make is that indexing
>values in their original binary form as opposed to a transformation of that
>form is significantly beneficial for query performance.

What I meant to say is that to get your primary keys in an order that sorts
with binary compares you can encode the key in any form you want. The API
should deliver the data in the proper data encoding which can be different
from the key encoding.

>>Point # 6. UTF-EBCDIC is a serious problem. It does not work will all
>>EBCDIC codepages. Take the case of Japanese where they have replaced all
>>of
>>the lower case Latin characters with single width katakana character. Try
>>to transform to UTF-EBCDIC. It is like UTF-8s a short term solution to a
>>problem that creates more problems later on.

>UTF-EBCDIC works reasonably well in Japanese EBCDIC environments, as long
>as one is careful. For example, if I take data in CCSID290 (Extended
>single-byte Japanese - katakana in place of lowercase Latin), and transcode
>it to UTF-EBCDIC via Unicode, all is well, as long as the programs using
>the encoding are truely using UTF-EBCDIC rules (they expect lowercase Latin
>characters to be in their invarient/CCSID1047 positions), not CCSID290
>rules. UTF-EBCDIC is a very practical solution to a problem that is common
>amongst many vendors, as is UTF-8S.

>>My problem is that Oracle calls UTF-8s UTF8 and UTF-8 is AL31UTF8. I
>>think
>>that people should understand this is a short term workaround only. It
>>should have !!!User Beware!!! you had better know what you are doing
>>written
>>all over it. You also need to have a good migration plan to phase out
>>UTF-8s as soon as you can.

>This is simply an Oracle naming issue, nothing to do with the proposal for
>UTF-8S. Whether Oracle names it UTF8 or UTF-8 or AL16UTF8 is something
>proprietary to Oracle. Like any good vendor, they are very cautious about
>breaking existing implementations dependent on their system's behavoiur
>(hmmm - sounds like a standards body we're all acutely aware of), and so
>will probably need to continue supporting their character set name "UTF8"
>to mean as UTF-8S. Yes, they will need to document well the fact that
>their "UTF8" is not truely UTF-8, but is in fact "UTF-8S". Yes, they can
>most probably give notice of the renaming of their UTF8 character set in a
>future release. This naming is internal to Oracle and its users, and
>something that I trust that they will address. It should not affect the
>discussion of what UTF-8S is, and how it will be used/misused.

I agree that this is an Oracle issue.

>>This is what we mean by an internal implementation that does not have to
>>be
>>and should not be sanctioned by UTC.

>I reiterate - UTF-8S is not something that will be used only by Oracle
>internally. PeopleSoft plans to use it in places where we need an 8-bit
>representation that sorts in binary equivalently to UTF-16. We'll use it
>in our COBOL. We'll use it in our memory cache. SAP has the same needs in
>their systems and databases (yes, they are a database vendor - see
>http://www.sapdb.org ). Yes, all these uses may be internal to each
>vendor, but as Uma has stated, internal representation leaks out. If any
>significant number of vendors are going to be using this encoding
>internally in their systems, wouldn't it make sense to have a UTR
>describing what this representation is, when it is useful, and how to deal
>with data presented to you in that encoding should the situation arise?

I guess the bottom line is that I still don't understand why you want to
retrieve data in UFT-8s format. If you want, you can fetch data as UTF-16
from Oracle, DB2, SQL-Server, etc. COBOL has such poor string support that
UTF-16 is far easier to use than UTF-8s. This is especially true if you
have a transform that can not be used outside of you own system. If it is
to do a transform to UTF-EBCDIC maybe we should address that issue.

I see UFT-8s in the same light as Sun's implementation of a non-Unicode wide
character support. If they had called it something other than wchar_t we
could deal with it. Now it makes cross platform Unicode support that much
more difficult.

If it were significantly different from UTF-8 then it would be OK. But code
that works most of the time is a disaster. It is so close that I see UTF-8
and UTF-8s mixing. It is also difficult to detect the difference between
the two. It will require extra overhead to completely validate all the data
you receive. I can see people getting sloppy.

How are you planning to handle UFT-32? Implementing UTF-32s?

>Toby.

>--
>Toby Phipps
>PeopleTools Product Manager - Global Technology
>PeopleSoft, Inc.
>tphipps@peoplesoft.com



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT