Re: UTF-8 and Kermit

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Tue Jul 15 1997 - 16:57:58 EDT


Frank da Cruz wrote on 1997-07-15 20:15 UTC:
> > Can Kermit already be switched to handle UTF-8?
>
> Nope, not yet.

Since you use already Unicode internally, this should be *very* easy to
implement. Based on my experience as the person who added UTF-8 support
to the Linux console driver two years ago, I would say adding UTF-8
will require less than one hour of work.

Feel free to use any of the UTF-8 code from linux/drivers/char/console.c
in the Linux kernel distribution.

> > If a terminal emulator handles UTF-8, then the C1 characters will be
> > interpreted AFTER the UTF-8 decoding has taken place.
> >
> I guess. The whole issue of Unicode as an on-the-wire character set, and
> its many possible encodings, especially in terminal emulation, is going to
> be an interesting one for some time to come. I don't know what else to say.

From the rest of your mail I see that you have many missunderstandings
about UTF-8. I hope I can claify a few things:

You just declare UTF-8 as the character set encoding on the line, and
this activates a piece of code in kermit that first of all sends every
incoming byte through a UTF-8 -> UCS-2 routine, and then you continue
with all the normal processing after this conversion. All communication
coming from the Unix machine will be UTF-8, without any switching.
The purpose of UTF-8 is to make ISO 2022 switching unnecessary.

> The day we have to deal with it is the day that you Telnet to (say) a UNIX
> host and the herald and "login:" prompt come out in Unicode.

Every Plan9 machine is already doing this and has been doing this for many
years. See <ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/
UTF-8-Plan9-paper.ps.gz> for UTF-8 as used in Plan9.

> Which is not the
> same thing as logging in (as we do now) in ASCII or Latin-1, etc, and then
> maybe trying to display a Unicode file.

No, it is exactly the same! Since UTF-8 is upwards compatible to
7-bit ASCII, you will not even notice that you are on a UTF-8 system
until you get the first non-ASCII character displayed. Just as with
ISO 8859-1. The login: will be sent as exactly the same byte sequence.

> And I'm not sure how we'd handle that
> anyway. If it was anything else, we could have a host-resident file viewer
> that sent the proper ISO 2022 sequences before and after the file, but as far
> as I know, there is nothing like that for Unicode / UTF-8 / etc, since these
> do not have the ISO character-set structure.

No!

In a Unicode Unix system (just like in Plan9), absolutely everything
is encoded in UTF-8: text files, file names, environment variable names
and contents, program messages, etc. Unix is much too ASCII oriented to
allow anything but UTF-8 to be used as an Unicode encoding.
Every ASCII file is already a UTF-8 file, so there is not too much
to change (unless you have ISO 8859-1 files which have to be sent
through GNU recode once).

You won't see UCS-2 files on Unix systems. UCS-2 will only be used as
wchar_t inside programs like editors that benefit from an internal
1-word-per-character representation. The ISO C library with its
multi-byte functions (unfortunately not described in K&R 2nd edition,
therefore not known by many C programmers) provides the function to
convert between UTF-8 (called the "multi-byte encoding in ISO C")
and wchar_t.

> Stated another way: is there a movement afoot to register Unicode, UTF-8,
> etc, with ISO so that they get ISO 2022 escape sequences?

This has been done long ago. You switch from ISO 2022 to UTF-8 with ESC % G
and get back into the ISO 2022 world (as defined in ISO 2022) with ESC % @.

You should definitely read ISO 10646-1/Am. 2 where UTF-8 is defined.
The final text is available on

 ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/ISO-10646-UTF-8.html

See Appendix R.6 for the ISO 2022 ESC sequence for UTF-8.

But as I sed: Under Plan9 and Unix you do not switch! Everything just simply
is UTF-8. I want to be able to tell kermit even before I open the
telnet connection that everything will be UTF-8, and then things should
work without any ISO 2022 sequence ever being transmitted.

If you have more questions about how UTF-8 is used under Unix, I'd be
happy to assist. Plan9 and to some degree Linux are excellent test
environments for the pure UTF-8 experience.

Looking forward to UTF-8 support in Kermit ...

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT