Re: Unicode end-users

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Fri Aug 01 1997 - 18:34:48 EDT


Glen Perkins wrote on 1997-08-01 21:03 UTC:
> Are there any signs that Plan9 will ever be anything more than a science
> project? Is Plan9, and what it does or doesn't do, likely to matter more
> than, say, the Amiga?

Plan9 is at the moment only an operating system that is made very easily
available to universities, and that a number of enthusiastic computer
science students have running on their home PC. Plan9 is a sort of
Unix without the >20 years of legacy that Unix as we know it carries
around, which contributes to much of its conceptual beauty.

To be fair, I have to say that when Plan9 was converted to UTF-8 in 1992,
almost all installations and applications were hosted inside a single
organization, therefore the UTF-8 migration was only a few days of work.
See <ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/
UTF-8-Plan9-paper.ps.gz> for the legendary Pike/Thompson USENIX paper about
UTF-8 (then it was still called UTF-2) in Plan9.

Plan9 is going to matter mostly in an intellectual way and not as a brutal
market force like other operating systems. The Plan9 authors are highly
respected in the operating system design community, and many of their
ideas and tools are currently being ported to other platforms. [I've just
yesterday seen a new release of Plan9's shell "rc" for Linux for instance.]

> Are commercial unix vendors (Sun, HP, SGI, etc.) making any moves toward
> switching the OS over to Unicode/ISO 10646, or are they leaving it to
> application developers to add "support" as a "feature" if they feel so
> inclined?

I'll have to leave it to them to answer this.

> What about Linux? What are the chances that ISO 10646 will ever be more
> fundamental to Linux than, say, JPEG or any other popular data type
> that's left to applications to support (or ignore if it's too much
> bother)?

My hope is that UTF-8 will eventually replace ISO 8859-1 as the system
character set, and work on supporting this is in progress.

The problem with "Linux" is that Linux per se is not any coherent product.
The term "Linux" strictly refers only to the kernel that is maintained
by Linus Torvalds from Helsinki University, and the C library and gcc port
maintained by H.J. Lu from MIT. Apart from these two products,
Linux distributions consist of exactly the same collection of
software that has been in mostly use on any kind of Unix or POSIX
environment long before Linux existed: This includes X11, TeX, GNU,
BSD, Netscape, Motif, and thousands of programs from individual Unix
application authors that are also used on Solaris, HPUX, Irix, etc.

The Linux kernel and libc are being prepared to support a Plan9 style
UTF-8 only mode of operation, but unlike Plan9, this is done within the
frame of the ISO C Am. 1 extentions for handling multi-byte encodings.

The kernel changes that were necessary:

 - Console terminal emulator decodes UTF-8 (and displayes the 8-bit subset
   of Unicode that the currently loaded VGA-font has available).

 - Drivers of non-Linux file systems (FAT, VFAT, OS/2, WIN NT, etc.)
   that did character set conversion of the remote character set to/from
   ISO 8859-1 now also have to offer alternatively conversion to/from
   UTF-8.

C library support: ISO C Am. 1 multibyte functions for comfortable UTF-8
handling.

Application changes: most important is that strlen() can not be used
any more to determine how many character cells wide a string will be
when written with a monospaced font. In editors, cursor positioning gets
slightly more complicated, but we have the necessary code most of the
time already there to handle substitutions for control characters
like ^M. Libraries like ncurses for text screen layout have to be
modified. X11 library function have to handle strings as UTF-8 and
have to generate the corresponding 16-bit values for glyph selection.

In order to get UTF-8 widely accepted among Linux users however, a number
of non-Linux pieces of software have to be made UTF-8 aware. Most notable
are xterm and editors like emacs. This has not yet been done,
therefore UTF-8 usage under Linux is still experimental.

If we just add UTF-8 support to existing applications unconditionally,
then these applications won't work in existing environments any more.
Therefore, all changes made for UTF-8 support in libraries, applications,
etc. will be dormant until the environment variable LC_CTYPE is set to
a string containing the substring "UTF-8".

Once most software supports UTF-8, all you have to do is

  - run a script that converts all text files and filenames to UTF-8
    (not necessary if you used only 7-bit ASCII so far anyway)

  - set LC_CTYPE=UTF-8

  - enjoy the new encoding

With LC_CTYPE=UTF-8, the mount command that mounts a DOS partition
(where all is in CP437) will tell the kernel to transform the filenames
and ASCII-file contents to UTF-8 instead of ISO 8859-1. Similarly,
by seeing LC_CTYPE=UTF-8, your email software will now know in what
encoding to save received messages from now on, etc.

> What does "Posix compliant" *require* of an OS with regard to ISO 10646?

Today nothing as far as I understand, because the current POSIX standard
predates ISO 10646 and UTF-8. There seems however to be a broad consensus
(also laid down in various ISO working papers) among POSIX working group
members that if ISO 10646 is used under a POSIX system, then
UTF-8 is the way to go. Future development of the POSIX standards
will reflect this.

> I can't count the number of times I've been told by a skilled
> applications programmer that he wasn't considering adding Unicode
> support because "nobody on my team speaks Lower Slobovian or Egyptian
> Hieroglyphics or whatever".

That is, why it is important to leave the application programmer eventually
no choice between Unicode and something else. If he wants to support the
U.S. trademark sign, then he can't avoid to also support the cyrillic
alphabet at the same time. This way, at least the right-to-left monospaced
Level 1 subset of ISO 10646 (what I would like to call "ISO 10646 Level 0" or
see it specified in a separate standard, say ISO 15646-1) would get very
widely supported, because this Level 0 subset has absolutely no additional
complexity compared to ASCII apart from not being 8-bit any more.
Many applications programmers who are otherwise afraid of Unicode can
*very* quickly get used to this Level 0 of UCS. The amount of training
necessary is less than 30 minutes explanation of UTF-8 and may be
another half hour for the new ISO C Am. 1 functions.

> Is this likely to happen on major Unix platforms as well, or
> will Unix programmers all need to switch to Java to achieve the same
> effect ;-) ?

We will see. At the moment, Plan9 is the only reference implementation
of UTF-8, and Plan9 is not really a POSIX system. I hope that we will
soon be able to get Linux and various programs that are essential
under Linux (xterm, emacs, etc.) to a point where we can demonstrate
pure UTF-8 operation under Linux. Once this reference platform is up
and running, many more Unix folks will understand what is involved in
UTF-8 support, and I hope that then UTF-8 will also quickly catch on
on other Unix platforms.

Java is an excellent example where the right way has been selected:
Java programmer do not have the choice NOT to use Unicode. However,
I believe that many Java programs will only support what I call the
UCS Level 0 implementation, i.e. no bidi, no combining, nothing that
looks more complicated than 16-bit ASCII. This subset of Unicode will
play a major role in the non-i18n field soon, and it should get an
official name.

Question: With the X Consortium gone, who is now in charge of maintaining
X11? The support of the X11 maintainers will be very important in getting
UTF-8 widespread under Linux and other Unices.

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT