RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Wed Jun 06 2001 - 14:08:21 EDT


Mark,

I like the clever ICU technique for sorting in code point order.

U_CAPI int32_t U_EXPORT2
u_strcmpCodePointOrder(const UChar *s1, const UChar *s2) {
    static const UChar utf16Fixup[32]={
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0x2000, 0xf800, 0xf800, 0xf800, 0xf800
    };
    UChar c1, c2;
    int32_t diff;

    /* rotate each code unit's value so that surrogates get the highest
values */
    for(;;) {
        c1=*s1;
        c1+=utf16Fixup[c1>>11]; /* additional "fix-up" line */
        c2=*s2;
        c2+=utf16Fixup[c2>>11]; /* additional "fix-up" line */

        /* now c1 and c2 are in UTF-32-compatible order */
        diff=(int32_t)c1-(int32_t)c2;
        if(diff!=0 || c1==0 /* redundant: || c2==0 */) {
            return diff;
        }
        ++s1;
        ++s2;
    }
}

The surrogates are shifted up to the high end of the sorting sequence and
the code points higher than the surrogates are shifted down. This is a very
low overhead technique that might be included in the Unicode documentation.
Using this technique avoids the need for UTF-8s. Using this type of compare
means that UTF-16 (compared in codepoint order) has the same sorting
sequence as UTF-8 and UTF-32. This code preserves the UTF-16 data typing.
UChar is an unsigned 16 bit integer.

If you did not want to preserve the unsigned integer you could just add a
correction factor to the surrogates the make them higher than 0x0000FFFF.
This would also make them sort higher than the rest of the code points but I
don't think it would have any less overhead.

The point is that they are techniques that are faster that converting to
UTF-32 that add very little overhead that "do the right thing". All systems
should sort in standard Unicode code point order regardless of encoding.
This way everyone is reading from the same page.

Carl

Note this code fragment is from ICU. This is Open Source code. See
http://oss.software.ibm.com/icu/ for further details.

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Carl W. Brown
Sent: Tuesday, June 05, 2001 11:09 AM
To: unicode@unicode.org
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Mark,

Now I understand.

If they implement a UTF-16 strcmp function that is a case sensitive version
of a UTF-16 strcasecmp(stricmp) you will get the same result as a UTF-8 or
UTF-32 compare. To me, it seems like this is the way to go.

Normally a strcmp function just loops through the string comparing them
character by character. If the loop checks for surrogates and compares
UTF-32 code points you will always get the same result for all encoding, the
standard Unicode code point order.

Ultimately this is the "do it right the first time" way of implementing
Unicode.

Carl

-----Original Message-----
From: Mark Davis [mailto:markdavis34@home.com]
Sent: Monday, June 04, 2001 9:23 PM
To: Carl W. Brown; unicode@unicode.org
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Nobody has ever proposed binary compares between UTF-8 and UTF-16 strings.

The scenario is:

Client software uses UTF-16.

Database software uses UTF-8s.

Client wants to have string A < string B if and only if Database has A < B.
(where A and B are in the respective client/database encodings).

The point of standardization (for those who favor it) is that you can then
properly tag the data in the database when transferring it between different
systems (instead of either incorrectly tagging it as UTF-8, or correctly
tagging it with a private name -- but one that other people don't
understand).

I don't think the companies in favor of UTF-8s are trying to avoid
supporting supplementary characters at all. They see it (rightly or wrongly)
as a way to solve a problem they have in this scenario without a performance
hit.

Mark

----- Original Message -----
From: "Carl W. Brown" <cbrown@xnetinc.com>
To: <unicode@unicode.org>
Sent: Monday, June 04, 2001 12:55
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

> Mark,
>
> I think that I am missing some point. Form what I hear the issue is that
> they want a way to support identical compares. This order is not
important.
> What is important is that they collate the same.
>
> Point #1 - I don't understand why this is a standard's issue. The way you
> build keys is an internal design issue. You can use BOCU or whatever.
>
> Point # 2 - You can not do binary compares between UTF-16 and UTF-8 keys.
> You must:
>
> Use UTF-16 for all keys
> Use UTF-8 for all keys
> Convert all UTF-16 keys to UTF-8 for compares
> Convert all UTF-8 keys to UTF-16 for compares
>
> For one of the first two cases there is no issue.
>
> For the second two you must convert. If you look at the total conversion
> overhead of converting two UCS-2 characters to two UTF-8s characters it is
> likely to be less overhead to convert a pair of UTF-16 surrogates to a
> single UTF-8 character or to convert a UTF-8 character to a pair of UTF-16
> surrogates.
>
> __________________________________________________________________________
>
> This leaves me very confused as to the reason for requesting UTF-8s. The
> other reason that comes to mind is the "red herring" reason. If they told
> you the real reason you would never approve it.
>
> I know you are familiar with the efforts to upgrade ICU to support UTF-16.
> It was not easy and some of the situations were very subtle. One obvious
> problem is issue of what is a character. The nice 1 to 1 mapping in UCS-2
> is gone. UTF-16 is now just another MBCS with all of its inherent
problems.
>
> It becomes very tempting for a developer who has software that may not
have
> software systems as well organized as ICU to decide to foist the problem
of
> UTF-16 back on the user and the OS by ignoring surrogates all together.
If
> they support UTF-8 then they have a problem because they can not just
ignore
> surrogates. If the Unicode Consortium legitimizes UTF-8s then they can
make
> it someone else's problem. It puts them in a position to compel others to
> add UTF-8s support because it is a sanctioned form of Unicode.
>
> __________________________________________________________________________
>
> If you endorse UTF-8s that please setup some restrictions as to its use.
>
> 1) All interfaces supporting UTF-8s but also support UTF-8.
>
> 2) All data passed to in interface or stored but a system using UTF-8s but
> be retrievable with in UTF-8 format.
>
> 3) All data stored in UTF-8s must be retrievable with UTF-8 keys.
>
> If this is not done you will end up bifurcating UTF-8 use. If a buy one
> component using UTF-8 and another using UTF-8s the end user will have a
real
> mess on their hands converting back and forth and dealing with Unicode in
> two sorting sequences depending on the interface.
>
> We might as well be asking the user to work in code page again. It is
like
> designing application that are required to support both Shift JIS and
EUC-J
> simultaneously.
>
> Carl
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Mark Davis
> Sent: Monday, June 04, 2001 8:47 AM
> To: DougEwell2@cs.com; unicode@unicode.org
> Cc: Peter_Constable@sil.org
> Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
>
>
> I am not, myself, in favor of UTF-8s. However, I do want to point out a
few
> things.
>
> 1) Normalization does not particularly favor one side or the other.
>
> A binary compare is used because of performance, typically when you don't
> care about the internal ordering from an international perspective (such
as
> a B-Tree for file systems). It does not prevent you from later imposing a
> localized sort order (e.g. when the files are displayed in a window, they
> can be sorted by name (or date, or author, etc) at that time).
>
> For performance reasons, in that case it is simply not a good idea to do
> normalization when you compare. You are choosing a binary compare simply
> because it is a fast, well-defined comparison operation. Invoking
> normalization at comparison time will defeat one of the goals. While
> normalization at comparison can be pretty fast (only take the slow path
when
> the Quickcheck fails -- as described in #15), yet it will never be
anywhere
> as fast as binary compare.
>
> The best practice for that case is to enforce normalization on data fields
> *when the text is inserted in the field* . If one does, then canonical
> equivalents will compare as equal, whether they are encoded in UTF-8,
> UTF-8s, or UTF-16 (or, for that matter, BOCU).
>
> 2. Auto-detection does not particularly favor one side or the other.
>
> UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
> supplementary character expressed with two 3-byte values, you know you do
> not have pure UTF-8. If you ever encounter a supplementary character
> expressed with a 4-byte value, you know you don't have pure UTF-8s. If you
> never encounter either one, why does it matter? Every character you read
is
> valid and correct.
>
> Auto-detection works on the basis of statistical probability. With
> sufficient non-ASCII characters, the chance that text obeys the UTF-8 byte
> restrictions and is not UTF-8 is very low (see Martin Duerst's messages on
> this from some time ago*). Essentially the same is true of UTF-8s.
>
> Mark
>
> * Martin, it'd be nice to resurrect you note into one of the Unicode FAQs.
>
> ----- Original Message -----
> From: <DougEwell2@cs.com>
> To: <unicode@unicode.org>
> Cc: <Peter_Constable@sil.org>
> Sent: Monday, June 04, 2001 00:10
> Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
>
>
> > In a message dated 2001-06-03 18:04:17 Pacific Daylight Time,
> > Peter_Constable@sil.org writes:
> >
> > > It would seem to me that there's
> > > another issue that has to be taken into consideration here:
> normalisation.
> > > You can't just do a simple sort using raw binary comparison; you have
> to
> > > normalise strings before you compare them, even if the comparison is
a
> > > binary compare.
> >
> > I would be surprised if that has even been considered. Normalization is

> one
> > of those fine details of Unicode, like directionality and character
> > properties, that may be completely unknown to a development team that
> thinks
> > the strict binary order of UTF-16 code points makes a suitable collation
> > order. This is a sign of a company or development team that thinks
> Unicode
> > support is a simple matter of handling 16-bit characters instead of
8-bit.
> >
> > While we are at it, here's another argument against the existence of
both
> > UTF-8 and this new UTF-8s. Recently there was a discussion about the
use
> of
> > the U+FEFF signature in UTF-8 files, with a fair number of Unicode
experts
> > arguing against its necessity because UTF-8 is so easy to detect
> > heuristically. Without reopening that debate, it is worth noting that
> UTF-8s
> > could not be distinguished from UTF-8 by that technique. By definition
> D29,
> > UTF-8s must support encoding of unpaired surrogates (as UTF-8 already
> does),
> > and thus a UTF-8s sequence like ED A0 80 ED B0 80 could ambiguously
> represent
> > either the two unpaired surrogates U+D800 U+DC00 or the legitimate
Unicode
> > code point U+10000. Such a sequence -- the only difference between
UTF-8
> and
> > UTF-8s -- could appear in either encoding, but with different
> > interpretations, so auto-detection would not work.
> >
> > Summary: UTF-8s is bad.
> >
> > -Doug Ewell
> > Fullerton, California
> >
>
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT