Re: UTF-8 and Kermit

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 15 1997 - 17:11:13 EDT


Frank da Cruz asked:

> anyway. If it was anything else, we could have a host-resident file viewer
> that sent the proper ISO 2022 sequences before and after the file, but as far
> as I know, there is nothing like that for Unicode / UTF-8 / etc, since these
> do not have the ISO character-set structure.
>
> I think this is an interesting application and might deserve some attention
> from the list, so I'm copying the list on this. To restate the problem:
>
> Suppose I have a terminal emulator that understands ISO 2022 and all sorts of
> ISO-registered character sets (such as all the ISO 646s and 8859 1-10), and
> which converts them to Unicode and displays them in a Unicode font -- and I do
> have such an emulator:
>
> http://www.columbia.edu/kermit/kuishots.html#shot3
>
> Now suppose I am logged in to a conventional UNIX host and I want to "cat" a
> Unicode file that is either bare Unicode or (more interesting) encoded in
> UTF-8 or other encoding. What escape sequence can be sent to the terminal to
> switch it into and out of Unicode / UTF-8, so that all the regular 8-bit stuff
> before and after the Unicode text appears correctly, and so does the Unicode
> text?

I'll quote here from the Draft corrected ISO/IEC 10646-1, so I get it
right. (ISO/IEC 10646-1: 1993/Cor.2.199x(E)):

17.2 Identification of UCS coded representation form with implementation level.

When the escape sequences from ISO/IEC 2022 are used, the identification
of a coded representation form of UCS (see clause 14) and an implementation
level (see clause 15) specified by ISO/IEC 10646 shall be by a designation
sequence chosen from the following list:

    ESC 02/05 02/15 04/00
       USC-2 with implementation level 1
    ESC 02/05 02/15 04/01
       USC-4 with implementation level 1
    ...
    ESC 02/05/02/15 04/05
       USC-2 with implementation level 3
    ...

If such an escape sequence appears within a CC-data-element conforming to
ISO/IEC 2022, it shall consist only of the sequences of bit combinations
as shown above.

If such an escape sequence appears within a CC-data-element conforming to
ISO/IEC 10646, it shall be padded in accordance with clause 16.
[[kww: that means as 16-bit or 32-bit control characters, depending on
your form of use.]]

[[ skip more sections on identification of subsets and control function sets.]]

17.5 Identification of the coding system of ISO/IEC 2022

When the escape sequences from ISO/IEC 2022 are used, the
identification of a return, or transfer, from UCS to the coding system
of ISO/IEC 2022 shall be by the escape sequence ESC 02/05 04/00. If
such an escape sequence appears within a CC-data-element conforming
to ISO/IEC 10646, it shall be padded in accordance with clause 16.

[[ skip lots more stuff ]]

And here is language from the amendments which define UTF-16 and UTF-8:

Amendment 1, Annex O

O.5 Identification of TF-16

When the escape sequences form ISO/IEC 2022 are used, the identification
of TF-16 and an implementation level (see clause 15)
shall be by a designation seuqnece chosen from the following list:

ESC 02/05 02/15 04/..
    TF-16 with implementation level 1
...
ESC 02/05 02/15/04/..
    TF-16 with implementation level 3 [[ <=== That's Unicode. kenw ]]

Amendment 2, Annex P

P.6 Identification of UTF-8

When the escape sequences form ISO/IEC 2022 are used, the identification
of UTF-8 and an implementation level (see clause 15)
shall be by a designation seuqnece chosen from the following list:

ESC 02/05 02/15 04/..
    UTF-8 with implementation level 1
...
ESC 02/05 02/15/04/..
    UTF-8 with implementation level 3

[[ There are editorial notes in my copy to insert the actual sequences
when registered. I imagine this has been done so by now, but I don't
have the published version in hand to cite the actual code values. ]]

>
> Stated another way: is there a movement afoot to register Unicode, UTF-8,
> etc, with ISO so that they get ISO 2022 escape sequences?

So I'd say we don't need a movement. It has already been done.

--Ken
 
> (Even though they
> might not fit into the ISO 4873 structure.) If not, should there be? If not,
> then what would be a reasonable way to mix (say) UTF-8 in with a regular
> ASCII or Latin-1 (etc) data stream?
>
> - Frank
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT