Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Mark Davis (markdavis34@home.com)
Date: Mon Jun 04 2001 - 11:47:20 EDT


I am not, myself, in favor of UTF-8s. However, I do want to point out a few
things.

1) Normalization does not particularly favor one side or the other.

A binary compare is used because of performance, typically when you don't
care about the internal ordering from an international perspective (such as
a B-Tree for file systems). It does not prevent you from later imposing a
localized sort order (e.g. when the files are displayed in a window, they
can be sorted by name (or date, or author, etc) at that time).

For performance reasons, in that case it is simply not a good idea to do
normalization when you compare. You are choosing a binary compare simply
because it is a fast, well-defined comparison operation. Invoking
normalization at comparison time will defeat one of the goals. While
normalization at comparison can be pretty fast (only take the slow path when
the Quickcheck fails -- as described in #15), yet it will never be anywhere
as fast as binary compare.

The best practice for that case is to enforce normalization on data fields
*when the text is inserted in the field* . If one does, then canonical
equivalents will compare as equal, whether they are encoded in UTF-8,
UTF-8s, or UTF-16 (or, for that matter, BOCU).

2. Auto-detection does not particularly favor one side or the other.

UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
supplementary character expressed with two 3-byte values, you know you do
not have pure UTF-8. If you ever encounter a supplementary character
expressed with a 4-byte value, you know you don't have pure UTF-8s. If you
never encounter either one, why does it matter? Every character you read is
valid and correct.

Auto-detection works on the basis of statistical probability. With
sufficient non-ASCII characters, the chance that text obeys the UTF-8 byte
restrictions and is not UTF-8 is very low (see Martin Duerst's messages on
this from some time ago*). Essentially the same is true of UTF-8s.

Mark

* Martin, it'd be nice to resurrect you note into one of the Unicode FAQs.

----- Original Message -----
From: <DougEwell2@cs.com>
To: <unicode@unicode.org>
Cc: <Peter_Constable@sil.org>
Sent: Monday, June 04, 2001 00:10
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

> In a message dated 2001-06-03 18:04:17 Pacific Daylight Time,
> Peter_Constable@sil.org writes:
>
> > It would seem to me that there's
> > another issue that has to be taken into consideration here:
normalisation.
> > You can't just do a simple sort using raw binary comparison; you have
to
> > normalise strings before you compare them, even if the comparison is a
> > binary compare.
>
> I would be surprised if that has even been considered. Normalization is
one
> of those fine details of Unicode, like directionality and character
> properties, that may be completely unknown to a development team that
thinks
> the strict binary order of UTF-16 code points makes a suitable collation
> order. This is a sign of a company or development team that thinks
Unicode
> support is a simple matter of handling 16-bit characters instead of 8-bit.
>
> While we are at it, here's another argument against the existence of both
> UTF-8 and this new UTF-8s. Recently there was a discussion about the use
of
> the U+FEFF signature in UTF-8 files, with a fair number of Unicode experts
> arguing against its necessity because UTF-8 is so easy to detect
> heuristically. Without reopening that debate, it is worth noting that
UTF-8s
> could not be distinguished from UTF-8 by that technique. By definition
D29,
> UTF-8s must support encoding of unpaired surrogates (as UTF-8 already
does),
> and thus a UTF-8s sequence like ED A0 80 ED B0 80 could ambiguously
represent
> either the two unpaired surrogates U+D800 U+DC00 or the legitimate Unicode
> code point U+10000. Such a sequence -- the only difference between UTF-8
and
> UTF-8s -- could appear in either encoding, but with different
> interpretations, so auto-detection would not work.
>
> Summary: UTF-8s is bad.
>
> -Doug Ewell
> Fullerton, California
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT