Fw: UTF-8 Syntax

From: Mark Davis (mark@macchiato.com)
Date: Wed Jun 13 2001 - 11:09:43 EDT


----- Original Message -----
From: "Mark Davis" <mark@macchiato.com>
To: <toby_phipps@peoplesoft.com>; <unicore@unicode.org>
Sent: Monday, June 11, 2001 08:06
Subject: Re: UTF-8 Syntax

> I would rank the situations, from a interpreter's point of view, in the
> following order.
>
> 1. Best
> - There is only one standard tag, "UTF-8", for an 8-bit code-unit UTF.
> - When you hit the tag "UTF-8", you can always -- in practice -- assume
that
> the data conforms to the definition*.
>
> 2. Medium
> - There are two standard tags, "UTF-8" and "UTF-8S", for 8-bit code-unit
> UTFs.
> - When you hit the tag "UTF-8", you can always -- in practice -- assume
that
> the data conforms to its definition*.
> - When you hit the tag "UTF-8S", you can always -- in practice -- assume
> that the data conforms to its definition*.
>
> 3. Bad
> - There is only one standard tag, "UTF-8", for an 8-bit code-unit UTF.
> - When you hit the tag "UTF-8", you can't -- in practice -- assume that
the
> data conforms to its definition; it may in fact be UTF-8S in format but
> tagged as UTF-8.
>
> 4. Worst
> - There are two standard tags, "UTF-8" and "UTF-8S", for 8-bit code-unit
> UTFs.
> - When you hit the tag "UTF-8", you can't -- in practice -- assume that
the
> data conforms to its definition.**
> - When you hit the tag "UTF-8S", you can't -- in practice -- assume that
the
> data conforms to its definition.**
>
> * "assume that the data conforms" == you could safely reject as corrupt
> anything that didn't conform, and your customers wouldn't bitch at you.
> ** "you can't assume that the data conforms" == you don't really know
which
> it is without looking at the data, but your customers will expect you to
> handle it seamlessly anyway, despite the trouble it causes.
>
> (1) is the best of all possible worlds. We always know what we get, and it
> is only one form.
> (2) is not great, since it introduces another format, but one can live
with
> it. As long as the data is correctly tagged, it is no more a problem than
> dealing with any other UTF.
> (3) is not good. You can never depend on the data being in a particular
> form.
> (4) is the worst; has the disadvantages of (2) and (3).
>
> I think part of the issue is that pros are seeing (3) as the current
state,
> with (2) as the future if we add a definition for "UTF-8S", and the
contras
> are seeing (1) as the current state, and (3) as the future if we add a
> definition of "UTF-8S".
>
> Mark
>
> ----- Original Message -----
> From: <toby_phipps@peoplesoft.com>
> To: <unicore@unicode.org>
> Sent: Monday, June 11, 2001 02:48
> Subject: Re: UTF-8 Syntax
>
>
> >
> > Martin Duerst <duesrt@w3.org> wrote:
> > >The problem you are talking about would occur if we had a database
> > >that stored data both in UTF-8 and UTF-8S, at random, and we wanted
> > >to query it with UTF-8 (or UTF-8S or UTF-16, doesn't make a
difference).
> >
> > Agreed - having a single database store both UTF-8 and UTF-8S style data
> is
> > a very bad idea, and would lead to the situation that Mark Davis
describes
> > where dual comparisons would have to be performed for each character
with
> > mutiple possible representations across the UTF-8 and UTF-8S forms.
This
> > is something that can easily be filtered out at the SQL API however,
given
> > that it already will need to perform a transformation between the
> > differeing encodings of server and client. The client API receiving
data
> > in UTF-16 would convert it to its appropriate UTF-8 or UTF-8S form
> > depending on the server's encoding. The only danger lies where the
client
> > and server are both running in the same encoding, and the client sends a
> > malformed character (6-byte surrogate in the case of UTF-8, 4-byte
> > surrogate in the case of UTF-8S). This would need to be detected as an
> > encoding error (if it's a strict system).
> >
> > >- That it may also be a bad idea to have two very similar encoding
> > > forms (UTF-8 and UTF-8S) at all. The danger of mixing those up
> > > and getting both of them into the same database is at least as
> > > big as the danger of assuming that binary ordering gets preserved
> > > between UTF-8 and UTF-16.
> >
> > This is no worse than the fact that UTF-8 and ISO 8859-1 are similar.
> > Unless your database has a strict encoding conformance parser for all
> input
> > data, there's nothing stopping you storing 8859-1 data in a UTF-8
> database.
> > To the casual user, these two encodings are very similar, especially
when
> > dealing with English data. Add to that the fact that in everyday text,
> > surrogate characters will be many times less prevalent than the 8-bit
> range
> > of ISO 8859-1, the possibility of mixing 8859-1 data in a UTF-8 system
> > seems much more real than UTF-8S in a UTF-8 database. I see preserving
> > binary order being much more important than this existing problem that
we
> > already face today.
> >
> > >Another point: In various contributions, the assumption was made
> > >that it would be inefficient to use some of the comparison tricks
> > >that Mark and Markus have proposed. This may definitely be true
> > >for UTF-8, because in that case, you have to compare speed with
> > >memcmp, which is probably implemented in assembler if not in
> > >hardware. But for UTF-16, at least on little-endian machines,
> > >memcmp won't give the right result anyway, and so using the
> > >tricks proposed won't slow down comparison speed too much.
> >
> > The loss of performance for a string comparison is just one reason why
> > these proposal would be difficult to work with. Consider that:
> > 1. The "modified" comparison function would need to be implemented,
> > supported and popularized in each programming language where UTF-16 data
> > could be expected to appear (C, C++, Java, COBOL etc.)
> > 2. Many millions of lines of code from many vendors would have to be
> > modified to use this function. In the case of PeopleSoft, this impact
is
> > not only to our code, but to a myriad of 3rd party products that we
> > integrate into our system, that would also need to be changed to use the
> > new string comparison function.
> >
> > Toby.
> >
> > --
> > Toby Phipps
> > PeopleTools Product Manager - Global Technology
> > PeopleSoft, Inc.
> > tphipps@peoplesoft.com
> >
> >
> >
> >
> >
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT