Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Mark Davis (markdavis34@home.com)
Date: Mon Jun 04 2001 - 23:58:22 EDT


I must not have been clear, since I think we are essentially in agreement.
Let me try again.

- I am well aware that one can accept 6-byte supplementary characters on
input in UTF-8. (Did you really think I wasn't?)

- Peter was saying that you couldn't tell the difference between UTF-8 and
UTF-8S, in terms of autodetection.

- For autodetection, the task is to map a sequence of input bytes where the
encoding ID is missing!] to the *correct* sequence of output code points.
When doing this with UTF-8/8s, here is what happens.

a) If you hit any illegal sequence as defined in
http://www.unicode.org/unicode/reports/tr27/:
"Table 3.1B. Legal UTF-8 Byte Sequences" then you know that the input is
neither UTF-8 nor UTF-8S, and you cannot interpret any further.

b) Otherwise, if you hit a sequence of the form (F4 XX XX XX), you know it
is not valid UTF-8s (but you can accept and correctly interpret that
sequence in a lenient receiver). Otherwise,

c) if you hit a sequence of the form (ED, A0..AF, XX, ED, B0..BF, XX), you
know it is not valid UTF-8 (but you can accept and correctly interpret that
sequence in a lenient receiver).

[Rather than actually match this pattern, in practice it is easier to go
ahead and convert the sequence and then check the output code points.]

d) Otherwise, anything else is valid, and has exactly the same
interpretation in both UTF-8 and UTF-8s.

The overwhelming majority of the time, valid cases of UTF8 or UTF-8S will
fall under condition (d), and invalid cases will fall under (a). And in all
of those cases, you couldn't autodetect whether you were in UTF-8 or UTF-8s.
But it doesn't matter -- the codes are correctly interpreted in either case.

If you do hit a case (b) or (c), you could tell whether you were in UTF-8 or
UTF-8s (or a mixture). But it doesn't much matter in this case either.

When you are autodetecting between Latin-1 and Latin-2, it matters a great
deal which you determine the source to be. Otherwise you can't possibly map
it to the correct code points.

With autodetection between UTF-8 and UTF-8s, it doesn't matter, since the
same byte sequences couldn't mean anything else. The only downside is that
you are less likely to distinguish them both (with a lenient parser) from
other encodings; but you are only infinitesimally less likely.

Mark

----- Original Message -----
From: "Michael (michka) Kaplan" <michka@trigeminal.com>
To: "Mark Davis" <mark@macchiato.com>; <DougEwell2@cs.com>;
<unicode@unicode.org>
Cc: <Peter_Constable@sil.org>
Sent: Monday, June 04, 2001 09:44
Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

> From: "Mark Davis" <markdavis34@home.com>
>
> > 2. Auto-detection does not particularly favor one side or the other.
> >
> > UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
> > supplementary character expressed with two 3-byte values, you know you
do
> > not have pure UTF-8. If you ever encounter a supplementary character
> > expressed with a 4-byte value, you know you don't have pure UTF-8s. If
you
> > never encounter either one, why does it matter? Every character you read
> is
> > valid and correct.
>
> I would have to disagree with this point, since it is considered legal to
> accept (but not emit) six-byte supplementary characters in UTF-8 as it
> stands today. Thus there is some severe overlap between existing
> implementations -- every single implemenattion that was not thinking ahead
> to surogate pairs, for example. This would include dozens of MS apps, the
> versions of Oracle prior to them adding official UTF-8 support, and a ton
of
> other products.
>
> MichKa
>
> Michael Kaplan
> Trigeminal Software, Inc.
> http://www.trigeminal.com/
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT