Re: Linux & Unicode

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Fri Dec 04 1998 - 07:23:03 EST


Arnt Gulbrandsen wrote on 1998-12-04 02:28 UTC:
> > I also think "low" approaches to getting the basics of Unicode support
> > are a good idea to get people started. For instance, I think there's
> > real value in Roman Czyborra's work on producing a simple Unicode font
> > for X11 (http://czyborra.com/unifont/). Sure, this doesn't solve the
> > problem, but it puts the shape of the problem in people's heads, and
> > appeals to hackers.
>
> Now, THIS is a good idea. Also Markus Kuhn's similar work. IMO, it
> does indeed solve a real problem: There aren't enough fonts available.
> And when there is a solution for one problem, people will built on it
> to solve the problems that it makes more tractable.

Exactly. I also agree that ISO 10646-1 efforts for Linux should focus on
"low" approaches first. Let's see UTF-8 first of all as nothing more
than a replacement for ISO 8859-* and perhaps also JIS/GB. Primary focus
should be to extend the most commonplace tools such that they can use
UTF-8 with a WGL4-style Unicode subset. Just ignore those people who
repeat on telling you that "Unicode is so much more" and scare everyone
off with complex presentation algorithms for Indic and Arabic scripts,
bidi, combining characters, etc. Users of these scripts are currently
less than 1% of the Linux community, so we shouldn't delay things by
requesting all at once. Let's start very simple.

OK, here is what does already exist:

 - There is some UTF-8 support in glibc2 and more is to come,
   but it is not documented and nobody knows how to use this in his
   applications. The free online "Easy Guide to UTF-8 for Unix C
   Programmers" still has to be written!

 - There has been some basic UTF-8 support in the Linux text mode console
   and keyboard driver since 1994, but nobody uses it, because everyone
   uses xterm instead of the console to do real work. Getting xterm
   to handle UTF-8 is a top priority task, because all other non-GUI
   tools depend on it!

 - I have extended the 6x13 xterm default font to 2800 characters. This
   project is now completed <http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html>
   and I have sent it to the XFree86 people, hoping it will find its way
   into the next release.

 - Roman Czyborra <czyborra@cs.tu-berlin.de> has collected a very complete
   character 8x16/16x16 Unifont, which I also hope will find its way
   into major distributions soon. It is not a fixed width font though, so
   I see some problems with all applications being able to use immediately the
   full repertoire, especially since there is no standard for which
   characters in Unicode should be twice as wide in a biwidth font
   (like there is for JIS).

 - The only major UTF-8 capable application at the moment is the Yudit
   editor.

Here is what is currently going on:

 - I will soon start extending my font project to many of the other
   misc-fixed-* fonts (7x13, 8x13, 7x14, 9x15, 10x20). I'll probably not have
   the time to extend all of them to the same 2800 character repertoire that
   6x13 now has, but my goal is to reach at least the WGL4 repertoire
   (652 characters, superset of the usual ISO and Microsoft code pages)
   for all of these, plus whatever people want to contribute (e.g., I got
   a 9x15 offer for Ethiopic, basic Greek has already been contributed
   for all sizes, etc.) The 6x13 size has reached an upper limit with 2800
   for this resolution, but I believe that 9x15 and 10x20 can easily
   cover over 4000 characters.

 - It seems that Brent Welch <welch@scriptics.com> is now looking into
   UTF-8 support for his fine exmh mailer.

 - Mark Leisher is looking into UTF-8 support for emacs.

 - Donald Page <donaldp@sco.COM> is trying to find out whether SCO could
   contribute their UTF-8 extension of xterm back to the Open Group sample
   implementation

 - Thomas E. Dickey <dickey@clark.net>, the maintainer of the XFree86
   version of xterm is also looking into adding UTF-8 support there.

 - Henry Spencer's regular expression library contains now UTF-8 support.

 - Perl/TCL/Python are in the process of being transformed to use UTF-8
   as their internal string encoding.

 - The GNU Ada95 3.11p compiler now contains UTF-8 support (Ada95 uses
   ISO 10646-1 as its internal character set).

It is most important that the most basic Unix development tools such as
xterm, emacs, vi, grep, bash, etc. become fully UTF-8 compliant first.
It makes little sense to focus on new fancy widget sets like Qt and GTK+
first before we haven't introduced UTF-8 in the classic basic tools.

As long as the classic basic shell tools and editors are not UTF-8
capable, it is too early to recommend users to move generally from
whatever ISO 8859 part they are currently using to UTF-8. And people will
not be motivated to fix these tools as long as xterm does not support
UTF-8.

I have been very keen on switching everything on my system to UTF-8
since 1994, but I haven't done it yet. I will do it as soon as xterm,
vi, emacs, exim, gcc, bash, and readline are ready for operating in a
pure UTF-8 environment. We are getting closer, but we are not yet there
and we have not yet reached critical mass.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT