RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Jun 04 2001 - 15:55:11 EDT


Mark,

I think that I am missing some point. Form what I hear the issue is that
they want a way to support identical compares. This order is not important.
What is important is that they collate the same.

Point #1 - I don't understand why this is a standard's issue. The way you
build keys is an internal design issue. You can use BOCU or whatever.

Point # 2 - You can not do binary compares between UTF-16 and UTF-8 keys.
You must:

        Use UTF-16 for all keys
        Use UTF-8 for all keys
        Convert all UTF-16 keys to UTF-8 for compares
        Convert all UTF-8 keys to UTF-16 for compares

For one of the first two cases there is no issue.

For the second two you must convert. If you look at the total conversion
overhead of converting two UCS-2 characters to two UTF-8s characters it is
likely to be less overhead to convert a pair of UTF-16 surrogates to a
single UTF-8 character or to convert a UTF-8 character to a pair of UTF-16
surrogates.

__________________________________________________________________________

This leaves me very confused as to the reason for requesting UTF-8s. The
other reason that comes to mind is the "red herring" reason. If they told
you the real reason you would never approve it.

I know you are familiar with the efforts to upgrade ICU to support UTF-16.
It was not easy and some of the situations were very subtle. One obvious
problem is issue of what is a character. The nice 1 to 1 mapping in UCS-2
is gone. UTF-16 is now just another MBCS with all of its inherent problems.

It becomes very tempting for a developer who has software that may not have
software systems as well organized as ICU to decide to foist the problem of
UTF-16 back on the user and the OS by ignoring surrogates all together. If
they support UTF-8 then they have a problem because they can not just ignore
surrogates. If the Unicode Consortium legitimizes UTF-8s then they can make
it someone else's problem. It puts them in a position to compel others to
add UTF-8s support because it is a sanctioned form of Unicode.

__________________________________________________________________________

If you endorse UTF-8s that please setup some restrictions as to its use.

1) All interfaces supporting UTF-8s but also support UTF-8.

2) All data passed to in interface or stored but a system using UTF-8s but
be retrievable with in UTF-8 format.

3) All data stored in UTF-8s must be retrievable with UTF-8 keys.

If this is not done you will end up bifurcating UTF-8 use. If a buy one
component using UTF-8 and another using UTF-8s the end user will have a real
mess on their hands converting back and forth and dealing with Unicode in
two sorting sequences depending on the interface.

We might as well be asking the user to work in code page again. It is like
designing application that are required to support both Shift JIS and EUC-J
simultaneously.

Carl

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Mark Davis
Sent: Monday, June 04, 2001 8:47 AM
To: DougEwell2@cs.com; unicode@unicode.org
Cc: Peter_Constable@sil.org
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

I am not, myself, in favor of UTF-8s. However, I do want to point out a few
things.

1) Normalization does not particularly favor one side or the other.

A binary compare is used because of performance, typically when you don't
care about the internal ordering from an international perspective (such as
a B-Tree for file systems). It does not prevent you from later imposing a
localized sort order (e.g. when the files are displayed in a window, they
can be sorted by name (or date, or author, etc) at that time).

For performance reasons, in that case it is simply not a good idea to do
normalization when you compare. You are choosing a binary compare simply
because it is a fast, well-defined comparison operation. Invoking
normalization at comparison time will defeat one of the goals. While
normalization at comparison can be pretty fast (only take the slow path when
the Quickcheck fails -- as described in #15), yet it will never be anywhere
as fast as binary compare.

The best practice for that case is to enforce normalization on data fields
*when the text is inserted in the field* . If one does, then canonical
equivalents will compare as equal, whether they are encoded in UTF-8,
UTF-8s, or UTF-16 (or, for that matter, BOCU).

2. Auto-detection does not particularly favor one side or the other.

UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
supplementary character expressed with two 3-byte values, you know you do
not have pure UTF-8. If you ever encounter a supplementary character
expressed with a 4-byte value, you know you don't have pure UTF-8s. If you
never encounter either one, why does it matter? Every character you read is
valid and correct.

Auto-detection works on the basis of statistical probability. With
sufficient non-ASCII characters, the chance that text obeys the UTF-8 byte
restrictions and is not UTF-8 is very low (see Martin Duerst's messages on
this from some time ago*). Essentially the same is true of UTF-8s.

Mark

* Martin, it'd be nice to resurrect you note into one of the Unicode FAQs.

----- Original Message -----
From: <DougEwell2@cs.com>
To: <unicode@unicode.org>
Cc: <Peter_Constable@sil.org>
Sent: Monday, June 04, 2001 00:10
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

> In a message dated 2001-06-03 18:04:17 Pacific Daylight Time,
> Peter_Constable@sil.org writes:
>
> > It would seem to me that there's
> > another issue that has to be taken into consideration here:
normalisation.
> > You can't just do a simple sort using raw binary comparison; you have
to
> > normalise strings before you compare them, even if the comparison is a
> > binary compare.
>
> I would be surprised if that has even been considered. Normalization is
one
> of those fine details of Unicode, like directionality and character
> properties, that may be completely unknown to a development team that
thinks
> the strict binary order of UTF-16 code points makes a suitable collation
> order. This is a sign of a company or development team that thinks
Unicode
> support is a simple matter of handling 16-bit characters instead of 8-bit.
>
> While we are at it, here's another argument against the existence of both
> UTF-8 and this new UTF-8s. Recently there was a discussion about the use
of
> the U+FEFF signature in UTF-8 files, with a fair number of Unicode experts
> arguing against its necessity because UTF-8 is so easy to detect
> heuristically. Without reopening that debate, it is worth noting that
UTF-8s
> could not be distinguished from UTF-8 by that technique. By definition
D29,
> UTF-8s must support encoding of unpaired surrogates (as UTF-8 already
does),
> and thus a UTF-8s sequence like ED A0 80 ED B0 80 could ambiguously
represent
> either the two unpaired surrogates U+D800 U+DC00 or the legitimate Unicode
> code point U+10000. Such a sequence -- the only difference between UTF-8
and
> UTF-8s -- could appear in either encoding, but with different
> interpretations, so auto-detection would not work.
>
> Summary: UTF-8s is bad.
>
> -Doug Ewell
> Fullerton, California
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT