Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Mon Jun 04 2001 - 12:44:49 EDT


From: "Mark Davis" <markdavis34@home.com>

> 2. Auto-detection does not particularly favor one side or the other.
>
> UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
> supplementary character expressed with two 3-byte values, you know you do
> not have pure UTF-8. If you ever encounter a supplementary character
> expressed with a 4-byte value, you know you don't have pure UTF-8s. If you
> never encounter either one, why does it matter? Every character you read
is
> valid and correct.

I would have to disagree with this point, since it is considered legal to
accept (but not emit) six-byte supplementary characters in UTF-8 as it
stands today. Thus there is some severe overlap between existing
implementations -- every single implemenattion that was not thinking ahead
to surogate pairs, for example. This would include dozens of MS apps, the
versions of Oracle prior to them adding official UTF-8 support, and a ton of
other products.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT