RE: Unicode end-users

From: Murray Sargent (murrays@microsoft.com)
Date: Fri Aug 01 1997 - 16:21:59 EDT


'twould be great if we could somehow require that all the text out there
that isn't in UTF-8 be translated into UTF-8 along with all the programs
out there that don't understand UTF-8. But unfortunately such a utopia
is currently out of reach and we're forced to be able to read/write
existing code-paged text to be compatible with the current files and to
communicate with older software that doesn't understand UTF-8. This
requirement is indeed a substantial burden and was one of the hard
things we had to solve in moving Office 97 to a Unicode base. End users
may well not care about how text is represented electronically, but they
care a great deal about being able to work with their current text files
and those of others. Very few of these files are in UTF-8.

Software has a finite lifetime, so we can imagine that programs that
don't understand UTF-8 will eventually be replaced by those that do.
Documents are likely to survive longer. Eventually the Plan9 ideal
might be reached in a more general environment. But the very least that
environment will need converters to communicate with the past. I bet
that Plan9 has some such converters hidden away somewhere.

Re software lifetimes: there's a lot of mainframe software that doesn't
understand how to deal with the year 2000. This software doesn't
understand UTF-8 either and it's been used for years. I suspect that
the year-2000 problem will be solved shortly one way or the other since
else the software is useless, but it probably won't be generalized to
handle UTF-8. Software, too, can live for a surprisingly long time.

Murray

> -----Original Message-----
> From: kuhn@cs.purdue.edu [SMTP:kuhn@cs.purdue.edu]
> Sent: Friday, August 01, 1997 12:04 PM
> To: Multiple Recipients of
> Subject: Re: Unicode end-users
>
> Graham Rhind wrote on 1997-08-01 07:39 UTC:
> > Are there plans to enable Unicode to function as
> > ASCII does, for example, so that it is application independent and
> is of
> > direct use to the user rather than just to software developers?
>
> In the Plan9 operating system (the current work of the guys who
> developed
> Unix), ASCII has already several years ago been replaced totally by
> UTF-8.
> On Plan9, you use UTF-8 as the *ONLY* character encoding. You can use
> therefore greek and cyrillic characters everywhere like latin, in
> source
> code, file names, environment variables, user names, passwords,
> printer
> names, etc. Plan9 (like Unix) does not have ANY notion of code pages
> and switching between character sets. Therefore, you have to
> introduce UTF-8 all over the place at once, which is the simplest and
> most practical solution. In code page and character set switching
> systems
> like MIME and Microsoft, Unicode will always be just one of several
> possible encodings and therefore it will always be more a part of
> the end-users problems than being a part of the end-users solution.
> Unlike Windows, Plan9 does not have separate system calls and library
> functions for Unicode and for 8-bit code pages. You cannot avoid to
> use UTF-8 under Plan9 as an applications programmer.
>
> For Unix, UTF-8 as the exclusive only way of representing characters
> is clearly the way to go, because there are no character set switching
> mechanisms and in order to minimize changes necessary to existing
> software,
> the encoding must be ASCII compatible.
>
> End-users do not care about character-set switching. They want to have
> one single simple to understand encoding that is used universally
> everywhere. All this talk about the encoding inefficiency of UTF-8
> or UCS-2 compared to 8-bit code pages is just complete academic
> nonsense:
> a) storage prices drop to 50% every two years and they will continue
> to
> do so over the next 10 years, and b) only a few percent of memory are
> usually used to store uncompressed text. We live in a time where
> application software does not fit on a single CD-ROM any more, so
> don't claim that 16- or 24-bits per character is an unbearable waste
> of memory. Switching mechanisms are an unbearable waste of complexity,
> however.
>
> Systems like the MIME or ECMA registries with their hundreds of
> different encodings are nothing the end user is interested in. The end
> user wants to type any character any time any where, and this is only
> possible with a single system-wide encoding. For Unix and most
> Internet
> protocols, this must be UTF-8 due to the ASCII legacy, for other
> more modern environments, UCS-2 might be a better approach.
>
> Markus
>
> --
> Markus G. Kuhn, Computer Science grad student, Purdue
> University, Indiana, USA -- email: kuhn@cs.purdue.edu
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT