Re: Unicode support under Linux

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Tue Jan 12 1999 - 09:38:46 EST


Alan Nash wrote on 1999-01-12 01:04 UTC:
> Do you have any suggestions on how to do the following things using
> Unicode on Linux (perhaps none is possible):
>
> (1a) xterm output in UTF8 or Unicode

There is currently no UTF-8 support in the standard xterm provided by
The Open Group or XFree86. The XFree86 version of xterm is currently
maintained by Thomas E. Dickey <dickey@clark.net> (see <http://
www.clark.net/pub/dickey/xterm/>). I have talked to him about adding
UTF-8 support, and he seems interested in the idea but wanted to
complete other things in xterm first. I don't know what the current
status of adding UTF-8 to the XFree86 xterm is.

I am eagerly looking forward to have UTF-8 support in xterm, because
practically all other UTF-8 extensions will depend on having it
available in the commonly used terminal emulator.

Unicode can only reasonably be supported under VT100-style terminal
emulators such as xterm or kermit in the UTF-8 encoding. With "Unicode"
as opposed to UTF-8, I assume that you mean UTF-2 or UTF-16, i.e. a
steam of 16-bit character values. Since VT100 terminal interfaces are
inherently 8-bit blocked, using raw 16-bit values would create
synchronization hazards and would also create very severe
backwards-compatibility problems. UTF-8 is clearly the way to use
Unicode with VT100 terminals as well as in Unix text files, text pipes,
file names, environment variables, etc. UCS-2 or UCS-4 might be used by
some applications in internal data structures, but I expect UTF-8 to
dominate any form of communication between processes, where 7-bit ASCII
or one of its 8-bit extensions is used today and where backwards
compatibility to 7-bit ASCII is highly desirable. Unix is very unlikely
to go the path of Win32 and double all interfaces to handle both ISO
8859 and UCS-2 strings.

I hope that eventually xterm can be started with some "-utf8" option and
then the displayed text will be interpreted as UTF-8, the keyboard
generates UTF-8 codes, and cut&paste functions will operate with UTF-8
as well. However this is not a completelz trivial modification to xterm
as first many internal data structures of xterm have to be made 16-bit
wide.

> (1b) console output in UTF8 or Unicode

The Linux console has been using Unicode internally since around 1994.
Again, direct 16-bit Unicode (UCS-2) output/input on a VT100 terminal
emulator does not make any sense, UTF-8 has to be used here as well.
UTF-8 has been supported by the Linux console at a rudimentary level for
a long time, but the documentation of all this leaves much to be desired
unfortunately. You can activate UTF-8 display output with ESC % G and
deactivate it with ESC % @ (the official ISO 2022 sequences for UTF-8).
Unfortunately, the keyboard (because it is a separate Linux driver) has
to be switched separately in an obscure way with an appropriate ioctl()
call (I think with the KDSKBMODE and K_UNICODE values from /usr/include/
linux/kd.h, checking out the files vt.c, keyboard.c, and console.c in
/usr/src/linux/drivers/char/ will show you the code that is handling this).

The obscurity of how to activate UTF-8 in the Linux console has caused
this feature to be mostly unused so far. I feel that ESC % G should
activate UTF-8 in *both* the console and the keyboard driver. If people
need separate activation (why?) then this should be done for both via
separate ioctl() calls. Other terminal emulators also switch both the
keyboard and the display simultaneously when the proper ISO 2022
ESC sequence is received. Comments?

We should also support in addition to ESC % G (UTF-8 with standard
return) the ESC % / G (UTF-8 without standard return) sequence, such
that the console can be protected from any accidental return to
ISO 2022 switching by binary output etc.

See sections 2.8.1 and 2.8.2 of the ISO 2022 ESC sequence registry on
<http://www.itscj.ipsj.or.jp/ISO-IR/>.

> (2) UTF8/Unicode printing

There comes a UTF-8 printing tool with Yudit which you might want to
check out, but there doesn't exist much apart from this one. The
standard plain text file printing tools like GNU enscript all have to be
extended to also eat UTF-8 input text files and use the available
Postscript standard fonts for a best effort presentation of the text.

> (3) UTF8/Unicode editing (ideally emacs)

Yudit is probably the best Linux UTF-8 editor currently available. I use
it routinely to edit/display my UTF-8 files. Mark Leisher
<mleisher@crl.nmsu.edu> seems to be working on UTF-8 support for emacs,
I don't know about the current status. There is also Plan9's Sam, which
I've not yet used myself.

> (4) Etc: more/less, Unicode text-processing file tools, etc.

None, except for a few character set conversion tools, most of which now
also offer UTF-8 processing (e.g. the GNU recode beta prereleases on
<http://www.iro.umontreal.ca/contrib/recode/>).

> I am particularly interested in support for characters in the 22xx range,
> since I use TeX a lot. My personal dream is to be able to create TeX
> source in UTF8 using characters in the 22xx range, then preprocess that
> and feed it to TeX.

Exact same dream here. However, first of all xterm has to be made UTF-8
aware, because all the other tools depend on it in daily use, then we
can see further. We should try to get a decent UTF-8 -> TeX and
UTF-8 -> LaTeX converter into GNU recode to do the preprocessing
you suggested.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT