Re: Coding Systems Different from ISO 2022 (UTF-8)

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Mon Oct 12 1998 - 18:52:29 EDT


Frank da Cruz wrote on 1998-10-12 20:45 UTC:
> Can anybody tell me where to find out what ISO means when it assigns an ISO
> 2022 escape sequence for a "coding system different from ISO 2022" (such as,
> for example, NAPLPS, or UCS-4, or UTF-8)?

That is an important topic for terminal emulator authors considering
adding ESC sequences for (de-)activating UTF-8.

"Coding system different from ISO 2022" just means any system in
which the byte range is not structured any more into the four areas
C0, GL, C1, GR. UTF-8 and UCS-2 are obvious examples of coding systems
that cannot be described within the ISO 2022 world, but so are coding
systems such as CP437 and CP1252, which also do not have a C1/GR area.

> Is the intention to identify the
> coding system to the recipient, so it can switch to it, and also disable
> ISO-2022 character-set designation and invocation from that moment onwards,
> since we have now switched to a new coding system in which we will not
> necessarily be able to recognize escape sequences for further switching?

Yes. The ISO 2022 commands (assigning G4 to GR, etc.) make no sense as
long as you are in UTF-8. They should be ignored. If you activated UTF-8
using an ISO 2022 sequence such as ESC % G, then the only ISO 2022
sequence that should from now on be accepted and not ignored is in my
opinion ESC % @, which is the sequence that terminates the "coding
system different from ISO 2022" and goes back to ISO 2022. If your UTF-8
mode was not activated using the ISO 2022 sequence (ESC % G), but by
some other means (say a command line option or a hotkey), then even ESC
% @ should be ignored like all other ISO 2022 commands, because since
you did not come from ISO 2022 you also should not be able to go back.

> In particular, I'm curious about an environment in which the host switches
> the terminal to the UTF-8 coding system. Since Unicode includes ASCII as
> well as C0 and C1 controls (and so UTF-8 can include both sets of controls
> too), should it be possible to switch out of UTF-8 coding once having
> switched into it?

If you came from ISO 2022, you should be able to go back via ESC % @.
If you did not come from ISO 2022, there should be no way back to it.

> (I know, why would anybody ever want to switch out of UTF-8? :-)

No joke:

I think ISO 2022 is quite annoying in that if one accidentally sends
random bytes to the terminal, there is a good chance that the terminal
stays in an unusable state. This danger does not exist with UTF-8, as
UTF-8 is self-synchronizing and stateless, therefore I am looking
forward to see very robust terminal emulators where I can enforce UTF-8
by configuration right from the beginning, and where no possible ESC
sequence can lead me back to the statefull ISO 2022 mess. Disabling ISO
2022 and enforcing UTF-8 can in the long run safe you a lot of hotline
calls like "Help, my terminal is messed up, I can only see
gubledeegoop".

Does your question mean we are going to see UTF-8 soon supported in
Kermit? That would be *very* nice!

Markus

-- 
Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
email: mkuhn at acm.org,  home page: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT