RE: UTF-8 Syntax

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sun Jun 10 2001 - 19:09:51 EDT


Toby,

In the case of strcmp the problem is that this won't even work on UCS-2. It
detects the end of string with a single byte 0x00. You have to use a
special Unicode compare routine this routine needs to be fixed to produce
proper compares. Most likely your problem with COBOL is that it is not
UTF-16 enabled so it does UCS-2 compares.

It is easier to use the old UCS-2 code with UTF-16 data because few people
have added true UTF-16 support.

The problem is that today's circumventions will lead to bigger headaches
later on.

I agree with your point #1 but are Microsoft & IBM committed to follow
Oracle?

Your point #2 logic is circular. Just because old code designed to support
UCS-2 may be used with UTF-16 does not mean that it is doing the right
thing. Where in the Unicode documentation does it support UTF-15 binary
sorting order as a valid Unicode sort order? I can not find it anywhere.

Your point #3 supports not having UTF-8s. Good systems that support both
ASCII & EBCDIC have special logic to insure that they compare the same way.
Thus if I have an AS400 or IBM mainframe server running EBCDIC I must sort
the same way as my workstations.

I can see by point #4 that you have not worked on database internals.
Primary key structures are usually separate instances of the data key
information. This is not true of files like link list files but modern
databases are too sophisticated to be able to share key and data records.
The can use any encoding. I for example can sort EBCDIC records in ASCII
order and still use binary compares by building internal keys that compare
correctly.

Point #5 is debatable. The Oracle code was not designs to support UTF-8s.
It was designed to support UCS-2 to UTF-8 conversions. The implementation
accidentally produces UTF-8s. However because the current Oracle code is
not designed to support surrogates users can go either way. Oracle can just
as easily fix the UTF-8 support to add surrogate processing as put out
illegitimate UTF-8 sequences. It is merely a matter of what they call each
encoding. They could have UTF8 and AL16UTF8. This would be a simple
matter of renaming the two. Current databases should not have and surrogate
encoded data so migration is simple.

Point # 6. UTF-EBCDIC is a serious problem. It does not work will all
EBCDIC codepages. Take the case of Japanese where they have replaced all of
the lower case Latin characters with single width katakana character. Try
to transform to UTF-EBCDIC. It is like UTF-8s a short term solution to a
problem that creates more problems later on.

However, I am sympathetic to your problem. It will take a while for
products to support UTF-16 properly. In the mean time you have to produce
products that work. It is not reasonable that you write your own COBOL
compilers just to get UTF-16 compares to work.

My problem is that Oracle calls UTF-8s UTF8 and UTF-8 is AL31UTF8. I think
that people should understand this is a short term workaround only. It
should have !!!User Beware!!! you had better know what you are doing written
all over it. You also need to have a good migration plan to phase out
UTF-8s as soon as you can.

I would also have no problem if Oracle implemented AL16UTF8 as a data type
but only allowed users to read and write legitimate UTF-8 data. In most
cases I suspect that you will be reading the data as UTF-16 anyway.

This is what we mean by an internal implementation that does not have to be
and should not be sanctioned by UTC.

Carl

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of toby_phipps@peoplesoft.com
Sent: Friday, June 08, 2001 4:11 PM
To: unicode@unicode.org
Cc: unicore@unicode.org
Subject: Re: UTF-8 Syntax

As one of the proponents of the UTF-8S proposal, I feel compelled to
respond to some of the recent comments regarding the proposal on the
unicode and unicore lists. Although there have been some good comments
about how the goals of the proposal could be accomplished without a new
encoding form, there have also been numerous arguments against UTF-8S
varying from simply unprofessional (the WTF thread) through to blatently
false (encoding doesn't imply a collation). Let me address each of the
comments individually. There's been a lot of talk about the UTF-8S
proposal on both the unicode and unicore list, so please forgive me (and
notify me if you feel the need) if I have missed any of the salient points
that require a response.

--
Toby Phipps
PeopleTools Product Manager - Global Technology
PeopleSoft, Inc.
tphipps@peoplesoft.com

1. UTF-8S doesn't need to be "accepted" or "approved" by the UTC, as its use is within a proprietary, closed system.

Nothing could be further from the truth. Just look at which companies are pushing the proposal (Oracle, SAP, PeopleSoft). These organizations all share the same technological issue, but are also direct competitors. We share a common technology - that of large SQL databases, and in the case of PeopleSoft and SAP, heterogeneity across many different SQL databases. We need a commonly understood UTF-8 encoding that can be used as a database encoding, an in-memory encoding and other "internal" forms, but at the same time, passed between systems from different vendors. PeopleSoft and SAP support a range of database platforms, including Oracle, Microsoft SQL Server, and IBM DB2. Communication between *applications* from one vendor to a *database* from another vendor is not a closed system.

2. An encoding form does not imply a collation

False. The most basic collation in any system is the binary order of the codepoints in their current encoding. That's what C gives you with the strcmp( ) function, what COBOL gives you with " > ", what Java gives you with its basic string classes. Even though the binary collation of each Unicode transformation makes no linguistic sense, developers all over the world make use of binary collation string comparisons to optimize code, especially when dealing with huge volumes of data. Just looking at PeopleSoft's tens of millions of lines of code, the great majority of our collation-depedent comparisons (eg. comparisons returning more information than simple equivalence) are used for performance and optimization.

There are most definately cases where we need a linguistic comparison, and we have the appropriate syntax in each of our languages (except COBOL) to deal with this. However, these cases are rare, and typically the developer is aware that they are performing a collation whose result will be visible to the user, and therefore needs to be in linguistic order.

Given the proliferation of UTF-16-based programming languages (Java, Microsoft Win32 C/C++, increasing numbers of non-Win32 C compilers), the combination of a UTF-16 based database client communicating with a UTF-8 based database server is common. Without UTF-8S (and UTF-32S to a lesser extent) as a database encoding, creating a single, portable database client in a UTF-16-based language environment that can operate against a database backend encoded in any of the Unicode transforms would be very difficult. Introducing an alternative database encoding along the lines of UTF-8S would allow the same UTF-16-based client application to operate against either a UTF-8S or UTF-16 database without change.

3. Vendors can't expect that other encodings collate the same in binary, so why expect this of the Unicode transforms.

This is true. We can't expect most other encodings to compare the same in binary. This often leads us to the situation where we only support servers and clients that share the same encoding. Before we supported Unicode, with a couple of exceptions (EBCDIC being one), this was the case at PeopleSoft - we require our servers and clients to share the same encoding. In reality, this wasn't a big deal for our customer base - there was very little utility running a server in ISO 8859-2 and a client in ISO 8859-1. Only the lower 7-bits represent common characters (and were therefore usable), so the system may as well be running in 7-bit ASCII. Where this did hurt were the CJK encodings. We don't support running a Shift-JIS client against a EUC-JP database server. Binary collation is just one reason. Expansion/contraction of character lengths is another. The implementation of Unicode across our systems fixed most of this problem. We all changed our database column size quantities to be character-based, not byte based, so the character length issue went away, and until real surrogates appeared on the scene with Unicode 3.1, we could rely on a common binary collation between client and server tiers.

4. A database should be able to provide sorted output in any collation, not just the binary collation of its encoding

True. However, for most SQL databases (at least those that use sorted b-tree indexes such as Oracle, Microsoft SQL Server, Sybase, DB2/UDB etc.), it is much, much faster and efficient to provide data collated in the binary encoding of the database than in any other collation. Why? Because column indexes are stored on-disk in a binary-sorted order. In order to return a pre-ordered result set to a SQL query, the database simply has to do what's known as a "index only scan". In this case, the values returned in the result set are read directly from the index, and the actual data blocks don't need to be fetched.

Of course, just about every database allows the result set to be in a collation other than the binary sort of the database's binary encoding. There are several ways of doing this. One is to sort data in temporarily-allocated memory. This is incredibly inefficient, not only because significant amounts of temporary space needs to be allocated and freed, but also because the entire result set of the query has to be processed and sorted before the first row is returned. With result sets involving several million rows, this is a very significant overhead, especially if the typical user only looks at the first couple of hundred. So, some vendors allow the creation of additional indexes, sorted by a weighted collation key of the original value. This works well in practice, however it still doesn't allow for "index only scans" as in the binary collation example, as the index only stores the numerical collation key, and not the actual value. After fetching the row from the sorted index, the database must then fetch the actual data from the data block.

Given this architecture (which is common across many SQL database platforms), the most efficient way of encode the database is to use an encoding where the binary representation of the data on-disk matches the collation expected most often by the database's clients. In the case of a database with many UTF-16 clients, a database encoding of UTF-16 or UTF-8S would make sense.

5. Oracle is pushing this proposal as it makes it easier for them to support surrogates without changing their architecture

False. Oracle already supports UTF-8S (called UTF8 in their engine for historical reasons), true UTF-8, and UTF-16 all as core database encodings. Oracle gains little from having the UTF-8S encoding accepted as a UTR other than gaining a simple nomenclature to describe one of their supported encodings. It is the large-scale users of Oracle Unicode databases such as SAP and PeopleSoft who are strongly encouraging them to get a common industry acceptance of the UTF-8S transformation for several reasons.

- We believe we won't be the only vendors to have the requirement of equivalent binary sorts across different Unicode encodings. Ignoring non-BMP characters, we have this equivalence now, and I can confidently guess that the majority of database-based Unicode systems today aren't using non-BMP characters in their systems, so their reliance on equivalent binary sorting has not yet become acutely obvious.

- We need some well-known way of describing the encoding of data in the database. This is important for discussions with our customers, documentation and technical architecture disclosures. Without a accepted name such as UTF-8S, we'll be forever talking about the fact that our internal data representation is "like UTF-8, but with individually encoded surrogate pairs". Why do people need to know what our internal database representation is? Because we'll be speaking it over database APIs (eg. PeopleSoft applications to a host Oracle database). Application Developers will see it in-memory when they use our debugging tools. It may "leak" into debug or trace files when things go wrong.

6. The UTF-8S proposal is asking for a "quasi-standard" acceptance which we haven't seen before

False. The Unicode Consortium publishes the Unicode Standard (TUS) and several Unicode Standard Annixes (UAX) which comprise TUS. These are standardized components, and share components (such as the UTF-8 transformation and the code allocations) with ISO 10646. In addition to TUS, the Unicode Consortium publishes Unicode Technical Reports (UTR). UTRs are intended to make life easier for implementors of TUS by providnig common techniques for character representation, encoding, collation and more. There is absolutely no requirement for anyone to implement any component of a UTR in order to claim compliance with TUS. They are for guidance only.

We are proposing UTF-8S as the topic of a UTR. As such, thers is no compunction for any implementor of Unicode to support such an encoding. There is nothing compelling the encoding to be registered in the IANA registry or be recognized by a web browser or XML parser. All we are asking for is that the form of such an encoding be published and recognized, so it can be referred to and used by implementors of the UTC who share a need for equivalent binary collation that we have identified to be not specific to one organization.

This is very similar to the acceptance of UTF-EBCDIC as UTR #16. PeopleSoft is a big user of UTF-EBCDIC. We use it in our COBOL when it's running on an EBCDIC platform. We use it in trace files and dump files on our EBCDIC platforms. Do we expect it to be recognized in HTML? No. XML? No. The same is true for UTF-8S.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT