Re: names of the control characters

From: Mark Davis (mark@macchiato.com)
Date: Sun Feb 03 2002 - 15:33:18 EST


This has bitten more than a few people. For political reasons, having
to do with the synchronization of names to ISO 10646, the name fields
are empty for the control characters. That is because (at least in
theory) people could have other semantics for those characters.

Field 10 (called Unicode 1.0 Name) contains names for most of those
characters, and should be used for your purpose. See, for example,
http://www.unicode.org/Public/BETA/Unicode3.2/UnicodeData-3.2.0d1.html
where it says:

"This is the old name as published in Unicode 1.0. This name is only
provided when it is significantly different from the current name for
the character. The value of field 10 for control characters does not
always match the Unicode 1.0 names. Instead, field 10 contains ISO
6429 names for control functions, for printing in the code charts."

Thus the data from
http://www.unicode.org/Public/BETA/Unicode3.2/UnicodeData-3.2.0d8.txt
has the following. Note the use of parantheses for some (but not all)
abbreviated names, and that some of the names follow the updated ISO
6429 names, e.g. CHARACTER TABULATION instead of the better-known
HORIZONTAL TABULATION (HT).

0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;
0003;<control>;Cc;0;BN;;;;;N;END OF TEXT;;;;
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
0005;<control>;Cc;0;BN;;;;;N;ENQUIRY;;;;
0006;<control>;Cc;0;BN;;;;;N;ACKNOWLEDGE;;;;
0007;<control>;Cc;0;BN;;;;;N;BELL;;;;
0008;<control>;Cc;0;BN;;;;;N;BACKSPACE;;;;
0009;<control>;Cc;0;S;;;;;N;CHARACTER TABULATION;;;;
000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
000B;<control>;Cc;0;S;;;;;N;LINE TABULATION;;;;
000C;<control>;Cc;0;WS;;;;;N;FORM FEED (FF);;;;
000D;<control>;Cc;0;B;;;;;N;CARRIAGE RETURN (CR);;;;
000E;<control>;Cc;0;BN;;;;;N;SHIFT OUT;;;;
000F;<control>;Cc;0;BN;;;;;N;SHIFT IN;;;;
0010;<control>;Cc;0;BN;;;;;N;DATA LINK ESCAPE;;;;
0011;<control>;Cc;0;BN;;;;;N;DEVICE CONTROL ONE;;;;
0012;<control>;Cc;0;BN;;;;;N;DEVICE CONTROL TWO;;;;
0013;<control>;Cc;0;BN;;;;;N;DEVICE CONTROL THREE;;;;
0014;<control>;Cc;0;BN;;;;;N;DEVICE CONTROL FOUR;;;;
0015;<control>;Cc;0;BN;;;;;N;NEGATIVE ACKNOWLEDGE;;;;
0016;<control>;Cc;0;BN;;;;;N;SYNCHRONOUS IDLE;;;;
0017;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION BLOCK;;;;
0018;<control>;Cc;0;BN;;;;;N;CANCEL;;;;
0019;<control>;Cc;0;BN;;;;;N;END OF MEDIUM;;;;
001A;<control>;Cc;0;BN;;;;;N;SUBSTITUTE;;;;
001B;<control>;Cc;0;BN;;;;;N;ESCAPE;;;;
001C;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR FOUR;;;;
001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;;
001E;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR TWO;;;;
001F;<control>;Cc;0;S;;;;;N;INFORMATION SEPARATOR ONE;;;;
007F;<control>;Cc;0;BN;;;;;N;DELETE;;;;
0080;<control>;Cc;0;BN;;;;;N;;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
0082;<control>;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;;
0083;<control>;Cc;0;BN;;;;;N;NO BREAK HERE;;;;
0084;<control>;Cc;0;BN;;;;;N;;;;;
0085;<control>;Cc;0;B;;;;;N;NEXT LINE (NEL);;;;
0086;<control>;Cc;0;BN;;;;;N;START OF SELECTED AREA;;;;
0087;<control>;Cc;0;BN;;;;;N;END OF SELECTED AREA;;;;
0088;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION SET;;;;
0089;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION WITH
JUSTIFICATION;;;;
008A;<control>;Cc;0;BN;;;;;N;LINE TABULATION SET;;;;
008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;
008C;<control>;Cc;0;BN;;;;;N;PARTIAL LINE BACKWARD;;;;
008D;<control>;Cc;0;BN;;;;;N;REVERSE LINE FEED;;;;
008E;<control>;Cc;0;BN;;;;;N;SINGLE SHIFT TWO;;;;
008F;<control>;Cc;0;BN;;;;;N;SINGLE SHIFT THREE;;;;
0090;<control>;Cc;0;BN;;;;;N;DEVICE CONTROL STRING;;;;
0091;<control>;Cc;0;BN;;;;;N;PRIVATE USE ONE;;;;
0092;<control>;Cc;0;BN;;;;;N;PRIVATE USE TWO;;;;
0093;<control>;Cc;0;BN;;;;;N;SET TRANSMIT STATE;;;;
0094;<control>;Cc;0;BN;;;;;N;CANCEL CHARACTER;;;;
0095;<control>;Cc;0;BN;;;;;N;MESSAGE WAITING;;;;
0096;<control>;Cc;0;BN;;;;;N;START OF GUARDED AREA;;;;
0097;<control>;Cc;0;BN;;;;;N;END OF GUARDED AREA;;;;
0098;<control>;Cc;0;BN;;;;;N;START OF STRING;;;;
0099;<control>;Cc;0;BN;;;;;N;;;;;
009A;<control>;Cc;0;BN;;;;;N;SINGLE CHARACTER INTRODUCER;;;;
009B;<control>;Cc;0;BN;;;;;N;CONTROL SEQUENCE INTRODUCER;;;;
009C;<control>;Cc;0;BN;;;;;N;STRING TERMINATOR;;;;
009D;<control>;Cc;0;BN;;;;;N;OPERATING SYSTEM COMMAND;;;;
009E;<control>;Cc;0;BN;;;;;N;PRIVACY MESSAGE;;;;
009F;<control>;Cc;0;BN;;;;;N;APPLICATION PROGRAM COMMAND;;;;

Personally, I think that this is error-prone, and the UTC would be far
better off instead putting the control code names in field 1, and
simply documenting that field 1 contains the character names for
non-control characters and the ISO 6429 names for control characters.

Fewer people like yourselves would be unpleasantly surprised.

Mark

—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Jarkko Hietaniemi" <jhi@iki.fi>
To: <mark@macchiato.com>
Sent: Sunday, February 03, 2002 11:03
Subject: names of the control characters

> A question: Perl offers a way to use Unicode characters by name:
>
> use charnames ':full';
>
> $a = "fooba\N{LATIN LETTER SMALL SHARP S}";
>
> but I noticed that the C0 and C1 control characters no more have
> Official Unicode names, all they have left is <control> in the name
> field and the Unicode 1.0 name. This means that things like
>
> $b = "x\N{HORIZONTAL TABULATION}y";
>
> won't work. What's the story behind the "unnaming" of the C0 and
C1?
>
> --
> $jhi++; # http://www.iki.fi/jhi/
> # There is this special biologist word we use for 'stable'.
> # It is 'dead'. -- Jack Cohen
>



This archive was generated by hypermail 2.1.2 : Sun Feb 03 2002 - 14:57:27 EST