Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Aug 25 1998 - 15:47:30 EDT


Gunther Schadow commented:

>
> (3) UTF-8 and UTF-7 are encodings for Unicode that are most useful for
> the majority of languages used in the continents of North-America,
> South-America, and Europe. All other languages will probably prefer to
> use the 16 bit Unicode integers directly.

This is not generally the case. The decision to use UTF-8 versus UTF-16
as the encoding form for Unicode-encoded data often has other criteria
than size of resultant text file. UTF-8 implementations are generally
driven by issues of compatibility with existing API's, file structure,
and the cost tradeoffs of software adaptation to process 16-bit strings
versus the processing inefficiencies of dealing with variable-width
characters.

>
> (4) UTF-8 and UTF-7 are based on the backward compatibility of Unicode
> to US-ASCII (1) but they neglect the backward compatibility of Unicode
> to ISO Latin-1.

UTF-7 is not backward compatible to US-ASCII, since it escapes some
ASCII characters that have particular uses. In fact, UTF-7 is deprecated
for general use--the need for it has been largely eliminated by fixes
to the email protocols. So this discussion should focus on UTF-8.

...
> But I ask you to think why an
> Anglo-Americanocentric UTF is good while a UTF for all scripts based
> on Latin is so bad and politically incorrect to call for (BTW:
> wouldn't vietnamese be supported by ISO Latin-1 as well?). ...

No, it wouldn't, as John Cowan has already pointed out.

>
> But may I please ask you (especially the US-residents among the
> fighters for political correctness) at least not to interfere with a
> call for a UTF that is as compatible as Unicode is by itself? I think
> that the issue with UTF-7 and UTF-8 is more about broadening the
> narrow Anglo-American view on the world than to narrow the beautiful
> global view of Unicode towards an Euro-centrism.
>

I don't count myself as a "fighter for political correctness" here.
Basically, UTF-8 works for what it is supposed to do: pass Unicode
data through byte-oriented character protocols. The technical arguments
against adding an "adaptive UTF-8" were well-presented by Kevin.
Adding any other "slice-the-bits-differently-to-preserve-Latin-1-UTF"
might be able to side-step some of the technical arguments, but still
would run up against the practical fact that UTF-8 is actually being
implemented fairly widely now, but a new UTF would have a hard row
to hoe to gain enough implementation adherence to actually be of
any use for interoperability.

Implementers of UTF-8 are not, in my opinion, caught in a "narrow
Anglo-American view on the world" because they are working with
an encoding form that treats 0x00..0x7F differently from 0x80..0xFF.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT