Re: Is there a UTF that allows ISO 8859-1?

From: Peter_Constable@sil.org
Date: Wed Aug 26 1998 - 15:20:41 EDT


Gunther Schadow wrote:

> There are two issues here: having two mutually incompatible UTF
encoding standards is one thing, which is not that bad if you account
for the benefits that your stateful UTF has in terms of instant
compatibility of non-UTF aware software. The other thing is the image
of Unicode in "people's mind". Because there is nothing better than
Unicode (since sliced bread) and because you could comply to Unicode
this easily with the non-UTF-8 encoding, it wouldn't hurt too much.

. . .

> But this requires that you have a hand on the source code! And
that's difficult for the most part, I guess you all know that. In my
(Unix) world, it is no big deal and I think I could clean up my
personal Worksatation to use Unicode excluively. But I am not talking
about my personal software and hobby, I am talking about business
empires where programmers are hidden behind meter-thick walls from the
public guarded by legions of sales-people who don't make a big
difference between a vacuum-cleaner and a computer program.

. . .

> . . . now that UTF-sane is no longer dedicated to Latin-1
compatibility, chances are I could even get non-western-europeans into
the boat.

You mentioned the "instant compatibility of non-UTF aware software" as
a benefit of Doug's encoding (let's call it UTF-Doug). I don't
understand: How is any program going to correctly handle the string

57 61 81 01 42 19 81 00 73 61

(Doug's example) and render it as Walesa (in its true Polish spelling)
unless it is programmed to do so? I would expect most programs to
choke or display boxes on several codes, or even some other equally
meaning full glyphs. Something like

Wa[]B[]sa (where [] = sqaure box)

or

Wau:Bu:sa (where u: = u-umlaut)

(These are what appear when an app assumes CP1252 or IBM-PC "extended
ASCII".) This is hardly what I'd call instant compatibility. If I was
using this encoding for Hebrew or Thai text and then viewed that text
in my favourite "unware" app, it would be even more meaningless. There
is definitely no instant compatibility for text like that.

The only "instant compatibility" is with text I encode in UTF-Doug
where all of the characters are also in ISO 8859-1, and that assumes,
of course, that the app understands ISO 8859-1. (Someone's favourite
DOS util won't know what to do with it.) There's nothing impressive
about that. If I encode text using EBCDIC, I shouldn't be surprised if
an app that understands EBCDIC is able to render it correctly. But if
I devise some "EBCDIC-compatible" encoding that allows me to also
encode other characters, my EBCDIC app won't be able to render those
other characters correctly unless I specifically reprogram it to do
so. In other words, with UTF-Doug, apps that know about ISO 8859-1
will be able to handle ISO 8859-1 text correctly, but not my Hebrew or
Thai. The only way around this is to rewrite the app to know about
UTF-Doug. So if all I can expect the app to correctly handle is ISO
8859-1, then why would I plan to use that app for UTF-Doug-encoded
text (unless all the text is just ISO 8859-1 text, in which case I'm
not really using UTF-Doug, am I)? If I really want more than ISO
8859-1, then I'm going to have to teach my program how to do it,
whether I plan to use UTF-Doug, UTF-8, or whatever.

The only advantage of UTF-Doug or "UTF-sane" (or whatever you'd rather
call your original proposal) is that apps that understand ISO 8859-1
will be able to correctly handle text whose characters are all in ISO
8859-1. If all your text is like this, then you already have ISO
8859-1 support and you don't need support for another encoding. If
your text includes other characters, then you are going to have to add
support for those characters; i.e. you will add support for some other
encoding, whether that's UTF-sane, UTF-Doug, UTF-8, or whatever.

You mention the problem of influencing programmers who are "hidden
behind meter-thick walls . . . ", but you also seem hopeful of being
able to influence "even non-western-europeans" to adopt UTF-Doug. Why
not just try to influence them to adopt UTF-8. That will be a much
easier sell. UTF-Doug doesn't offer any advantages whatsoever over
UTF-8 if what I'm interested in is Thai; in fact, it has some major
disadvantages - nobody else has ever heard of it let alone supports
it.

My point is, there is no instant compatibility with some encoding that
supports but extends ISO 8859-1, unless a program already understands
ISO 8859-1 and your text is only 8859-1 text. If you want more than
ISO 8859-1 text, you're going to have to teach a program how to do it.
And since you have to do that, why not do it using UTF-8, like lots of
others are already doing, rather than come up with something new? It
seems that you're really trying to fight an uphill battle.

For what it's worth . . .
Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT