Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Mark Davis (markdavis34@home.com)
Date: Wed Jun 06 2001 - 23:47:47 EDT


Thanks. That's Markus's invention.

Mark

----- Original Message -----
From: "Carl W. Brown" <cbrown@xnetinc.com>
To: <unicode@unicode.org>
Sent: Wednesday, June 06, 2001 11:08
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

> Mark,
>
> I like the clever ICU technique for sorting in code point order.
>
> U_CAPI int32_t U_EXPORT2
> u_strcmpCodePointOrder(const UChar *s1, const UChar *s2) {
> static const UChar utf16Fixup[32]={
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0x2000, 0xf800, 0xf800, 0xf800, 0xf800
> };
> UChar c1, c2;
> int32_t diff;
>
> /* rotate each code unit's value so that surrogates get the highest
> values */
> for(;;) {
> c1=*s1;
> c1+=utf16Fixup[c1>>11]; /* additional "fix-up" line */
> c2=*s2;
> c2+=utf16Fixup[c2>>11]; /* additional "fix-up" line */
>
> /* now c1 and c2 are in UTF-32-compatible order */
> diff=(int32_t)c1-(int32_t)c2;
> if(diff!=0 || c1==0 /* redundant: || c2==0 */) {
> return diff;
> }
> ++s1;
> ++s2;
> }
> }
>
> The surrogates are shifted up to the high end of the sorting sequence and
> the code points higher than the surrogates are shifted down. This is a
very
> low overhead technique that might be included in the Unicode
documentation.
> Using this technique avoids the need for UTF-8s. Using this type of
compare
> means that UTF-16 (compared in codepoint order) has the same sorting
> sequence as UTF-8 and UTF-32. This code preserves the UTF-16 data typing.
> UChar is an unsigned 16 bit integer.
>
> If you did not want to preserve the unsigned integer you could just add a
> correction factor to the surrogates the make them higher than 0x0000FFFF.
> This would also make them sort higher than the rest of the code points but
I
> don't think it would have any less overhead.
>
> The point is that they are techniques that are faster that converting to
> UTF-32 that add very little overhead that "do the right thing". All
systems
> should sort in standard Unicode code point order regardless of encoding.
> This way everyone is reading from the same page.
>
> Carl
>
> Note this code fragment is from ICU. This is Open Source code. See
> http://oss.software.ibm.com/icu/ for further details.
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Carl W. Brown
> Sent: Tuesday, June 05, 2001 11:09 AM
> To: unicode@unicode.org
> Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
>
>
> Mark,
>
> Now I understand.
>
> If they implement a UTF-16 strcmp function that is a case sensitive
version
> of a UTF-16 strcasecmp(stricmp) you will get the same result as a UTF-8 or
> UTF-32 compare. To me, it seems like this is the way to go.
>
> Normally a strcmp function just loops through the string comparing them
> character by character. If the loop checks for surrogates and compares
> UTF-32 code points you will always get the same result for all encoding,
the
> standard Unicode code point order.
>
> Ultimately this is the "do it right the first time" way of implementing
> Unicode.
>
> Carl
>
>
>
> -----Original Message-----
> From: Mark Davis [mailto:markdavis34@home.com]
> Sent: Monday, June 04, 2001 9:23 PM
> To: Carl W. Brown; unicode@unicode.org
> Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
>
>
> Nobody has ever proposed binary compares between UTF-8 and UTF-16 strings.
>
> The scenario is:
>
> Client software uses UTF-16.
>
> Database software uses UTF-8s.
>
> Client wants to have string A < string B if and only if Database has A <
B.
> (where A and B are in the respective client/database encodings).
>
> The point of standardization (for those who favor it) is that you can then
> properly tag the data in the database when transferring it between
different
> systems (instead of either incorrectly tagging it as UTF-8, or correctly
> tagging it with a private name -- but one that other people don't
> understand).
>
> I don't think the companies in favor of UTF-8s are trying to avoid
> supporting supplementary characters at all. They see it (rightly or
wrongly)
> as a way to solve a problem they have in this scenario without a
performance
> hit.
>
> Mark
>
> ----- Original Message -----
> From: "Carl W. Brown" <cbrown@xnetinc.com>
> To: <unicode@unicode.org>
> Sent: Monday, June 04, 2001 12:55
> Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
>
>
> > Mark,
> >
> > I think that I am missing some point. Form what I hear the issue is
that
> > they want a way to support identical compares. This order is not
> important.
> > What is important is that they collate the same.
> >
> > Point #1 - I don't understand why this is a standard's issue. The way
you
> > build keys is an internal design issue. You can use BOCU or whatever.
> >
> > Point # 2 - You can not do binary compares between UTF-16 and UTF-8
keys.
> > You must:
> >
> > Use UTF-16 for all keys
> > Use UTF-8 for all keys
> > Convert all UTF-16 keys to UTF-8 for compares
> > Convert all UTF-8 keys to UTF-16 for compares
> >
> > For one of the first two cases there is no issue.
> >
> > For the second two you must convert. If you look at the total
conversion
> > overhead of converting two UCS-2 characters to two UTF-8s characters it
is
> > likely to be less overhead to convert a pair of UTF-16 surrogates to a
> > single UTF-8 character or to convert a UTF-8 character to a pair of
UTF-16
> > surrogates.
> >
> >
__________________________________________________________________________
> >
> > This leaves me very confused as to the reason for requesting UTF-8s.
The
> > other reason that comes to mind is the "red herring" reason. If they
told
> > you the real reason you would never approve it.
> >
> > I know you are familiar with the efforts to upgrade ICU to support
UTF-16.
> > It was not easy and some of the situations were very subtle. One
obvious
> > problem is issue of what is a character. The nice 1 to 1 mapping in
UCS-2
> > is gone. UTF-16 is now just another MBCS with all of its inherent
> problems.
> >
> > It becomes very tempting for a developer who has software that may not
> have
> > software systems as well organized as ICU to decide to foist the problem
> of
> > UTF-16 back on the user and the OS by ignoring surrogates all together.
> If
> > they support UTF-8 then they have a problem because they can not just
> ignore
> > surrogates. If the Unicode Consortium legitimizes UTF-8s then they can
> make
> > it someone else's problem. It puts them in a position to compel others
to
> > add UTF-8s support because it is a sanctioned form of Unicode.
> >
> >
__________________________________________________________________________
> >
> > If you endorse UTF-8s that please setup some restrictions as to its use.
> >
> > 1) All interfaces supporting UTF-8s but also support UTF-8.
> >
> > 2) All data passed to in interface or stored but a system using UTF-8s
but
> > be retrievable with in UTF-8 format.
> >
> > 3) All data stored in UTF-8s must be retrievable with UTF-8 keys.
> >
> > If this is not done you will end up bifurcating UTF-8 use. If a buy one
> > component using UTF-8 and another using UTF-8s the end user will have a
> real
> > mess on their hands converting back and forth and dealing with Unicode
in
> > two sorting sequences depending on the interface.
> >
> > We might as well be asking the user to work in code page again. It is
> like
> > designing application that are required to support both Shift JIS and
> EUC-J
> > simultaneously.
> >
> > Carl
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> > Behalf Of Mark Davis
> > Sent: Monday, June 04, 2001 8:47 AM
> > To: DougEwell2@cs.com; unicode@unicode.org
> > Cc: Peter_Constable@sil.org
> > Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
> >
> >
> > I am not, myself, in favor of UTF-8s. However, I do want to point out a
> few
> > things.
> >
> > 1) Normalization does not particularly favor one side or the other.
> >
> > A binary compare is used because of performance, typically when you
don't
> > care about the internal ordering from an international perspective (such
> as
> > a B-Tree for file systems). It does not prevent you from later imposing
a
> > localized sort order (e.g. when the files are displayed in a window,
they
> > can be sorted by name (or date, or author, etc) at that time).
> >
> > For performance reasons, in that case it is simply not a good idea to do
> > normalization when you compare. You are choosing a binary compare simply
> > because it is a fast, well-defined comparison operation. Invoking
> > normalization at comparison time will defeat one of the goals. While
> > normalization at comparison can be pretty fast (only take the slow path
> when
> > the Quickcheck fails -- as described in #15), yet it will never be
> anywhere
> > as fast as binary compare.
> >
> > The best practice for that case is to enforce normalization on data
fields
> > *when the text is inserted in the field* . If one does, then canonical
> > equivalents will compare as equal, whether they are encoded in UTF-8,
> > UTF-8s, or UTF-16 (or, for that matter, BOCU).
> >
> > 2. Auto-detection does not particularly favor one side or the other.
> >
> > UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
> > supplementary character expressed with two 3-byte values, you know you
do
> > not have pure UTF-8. If you ever encounter a supplementary character
> > expressed with a 4-byte value, you know you don't have pure UTF-8s. If
you
> > never encounter either one, why does it matter? Every character you read
> is
> > valid and correct.
> >
> > Auto-detection works on the basis of statistical probability. With
> > sufficient non-ASCII characters, the chance that text obeys the UTF-8
byte
> > restrictions and is not UTF-8 is very low (see Martin Duerst's messages
on
> > this from some time ago*). Essentially the same is true of UTF-8s.
> >
> > Mark
> >
> > * Martin, it'd be nice to resurrect you note into one of the Unicode
FAQs.
> >
> > ----- Original Message -----
> > From: <DougEwell2@cs.com>
> > To: <unicode@unicode.org>
> > Cc: <Peter_Constable@sil.org>
> > Sent: Monday, June 04, 2001 00:10
> > Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
> >
> >
> > > In a message dated 2001-06-03 18:04:17 Pacific Daylight Time,
> > > Peter_Constable@sil.org writes:
> > >
> > > > It would seem to me that there's
> > > > another issue that has to be taken into consideration here:
> > normalisation.
> > > > You can't just do a simple sort using raw binary comparison; you
have
> > to
> > > > normalise strings before you compare them, even if the comparison
is
> a
> > > > binary compare.
> > >
> > > I would be surprised if that has even been considered. Normalization
is
>
> > one
> > > of those fine details of Unicode, like directionality and character
> > > properties, that may be completely unknown to a development team that
> > thinks
> > > the strict binary order of UTF-16 code points makes a suitable
collation
> > > order. This is a sign of a company or development team that thinks
> > Unicode
> > > support is a simple matter of handling 16-bit characters instead of
> 8-bit.
> > >
> > > While we are at it, here's another argument against the existence of
> both
> > > UTF-8 and this new UTF-8s. Recently there was a discussion about the
> use
> > of
> > > the U+FEFF signature in UTF-8 files, with a fair number of Unicode
> experts
> > > arguing against its necessity because UTF-8 is so easy to detect
> > > heuristically. Without reopening that debate, it is worth noting that
> > UTF-8s
> > > could not be distinguished from UTF-8 by that technique. By
definition
> > D29,
> > > UTF-8s must support encoding of unpaired surrogates (as UTF-8 already
> > does),
> > > and thus a UTF-8s sequence like ED A0 80 ED B0 80 could ambiguously
> > represent
> > > either the two unpaired surrogates U+D800 U+DC00 or the legitimate
> Unicode
> > > code point U+10000. Such a sequence -- the only difference between
> UTF-8
> > and
> > > UTF-8s -- could appear in either encoding, but with different
> > > interpretations, so auto-detection would not work.
> > >
> > > Summary: UTF-8s is bad.
> > >
> > > -Doug Ewell
> > > Fullerton, California
> > >
> >
> >
> >
> >
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT