L2/11-281 Title: Formal Name Aliases for Control Characters Source: Ken Whistler Date: July 20, 2011 Action: For consideration by the UTC Background During the month of June there was an extended email discussion on the unicore list about the fallout for Perl 5 of the addition of an SMP character named "BELL" to Unicode 6.0, which ended up having an unexpected side effect, because of the de facto Perl regex usage of "BELL" for the control character U+0007. A clear consensus emerged from that discussion that it would be useful to somehow formally guarantee that such a name collision would not happen in the future, impacting more de facto usage of labels for control characters in implementations, where they behave like character names. The problem has arisen in part because control codes have no character names in the Unicode Standard (for a series of historical reasons I won't belabor here), and partly as a result of that, the tools we use to verify the rules for uniqueness within the character namespace haven't been checking against various de facto names that people have long been using for control codes anyway. Various approaches were suggested to accomplish the goal of avoiding future name collisions. One approach which I suggested garnered a fair amount of agreement, and so in this document I have worked out more of the details, to turn it into a formal proposal for consideration by the UTC. Proposal The Unicode character namespace consists of the union of the set of Unicode character names, the names of named character sequences, and the list of formal name aliases defined in NameAliases.txt. The cleanest and least disruptive way of extending that namespace to cover the desired set of control code names is to make use of the formal name alias mechanism. Adding the target names of interest to NameAliases.txt would automatically engage the uniqueness checking by the tools we use to guarantee no name collisions. The consensus from the unicore discussion was that the names of interest included the ISO 6429 C0 and C1 control function names, which we have long included in the printing of the Unicode names list as informal aliases, anyway. However, to solve the problem most generally, it makes sense to include the occasional alternative names for Unicode format controls (e.g. "BYTE ORDER MARK"), as well as the ubiquitous abbreviations for ASCII controls ("CR", "LF", "TAB", "NBSP", etc.) and the set of equally widespread abbreviations now in use for various Unicode format controls ("ZWNJ", "ZWJ", "RLM", etc.). Rather than add a separate data file for these kinds of additions, I propose to simply extend the existing NameAliases.txt file for Unicode 6.1, by adding a consensus collection of control code (and other) names to the list, *and* to add a third field to the format of the data file, which will serve to distinguish the types of formal name aliases. Briefly, these would consist of: 1. Corrections for name errors (what we currently have in the data file). 2. ISO 6429 control function names. 3. Other widespread names for control codes or formal controls. (Most of these are simply earlier versions of ISO 6429 control function names, but this is also where such items as BYTE ORDER MARK fit.) 4. Abbreviations for control codes ("FF"), format controls ("RLE"), spaces ("NBSP"), variation selectors ("VS1"), and miscellaneous ("CGJ"). To make this proposal very explicit, I have attached a provisional draft of NameAliases.txt, with the exact list of additions and type fields I am suggesting, for review and discussion. Notes Regarding Omissions I have deliberately omitted three control code names and their abbreviations which occur in one (obsolete) RFC, but which are an artifact of early unapproved drafts of 10646. To wit: 0080 PADDING CHARACTER (PAD) 0081 HIGH OCTET PRESET (HOP) 0099 SINGLE GRAPHIC CHARACTER INTRODUCER (SGC) Those 3 were proposed (on spec) in early drafts of 10646, for what became a failed architectural direction for 10646. They would be completely forgotten now except for the persistent (and pernicious) RFC that lists them without indicating their failed status. Nobody has ever implemented them, so they are nothing more than character encoding curiosities. I have also omitted the occasional format control codes which are script-specific and which occasionally have acquired informal abbreviations in the relevant script-specific documentation. Examples are: U+070F SYRIAC ABBREVIATION MARK (SAM) U+108E MONGOLIAN VOWEL SEPARATOR (MVS) I don't think such script-specific abbreviations rise to the level of general use and recognition which would argue for their inclusion as formal name aliases. Keeping such abbreviations simply mentioned as annotations in the names list ought to be good enough in such cases. I also have not attempted to cull lists of abbreviations for *all* of the ISO 6429 control functions. My rule of thumb here is roughly would an engineer reasonably acquainted with ASCII and Unicode be likely to recognize and identify the abbreviation in question, without resorting to looking up the abbreviation in some specialized list. I think my suggested list meets that litmus test, and I would not be in favor of adding indefinitely to the list with the goal of covering *all* the abbreviations for control codes that somebody could find. Such completeness would actually make the list less useful by filling it with unrecognizable and unused abbreviations. NameAliasesProv.txt # NameAliases-x.x.0.txt # Date: 2011-07-20, 19:30:00 GMT [KW] # # NB: This is a DRAFT of a modified format for NameAliases.txt, # for consideration by the UTC. It is not an approved data file. # # This file is a normative contributory data file in the # Unicode Character Database. # # Copyright (c) 2005-2011 Unicode, Inc. # For terms of use, see http://www.unicode.org/terms_of_use.html # # This file defines the formal name aliases for Unicode characters. # # For informative aliases see NamesList.txt # # The formal name aliases are divided into four types. # # 1. Corrections for serious problems in the character names # 2. ISO 6429 names for C0 and C1 control functions # 3. Other commonly occurring names for control codes, format characters, # and spaces # 4. Commonly occurring abbreviations for control codes, format characters, # spaces, and variation selectors # # The formal name aliases are part of the Unicode character namespace, which # includes the character names and the names of named character sequences. # The inclusion of ISO 6429 names and other commonly occurring names and # abbreviations for control codes and format characters as formal name alisases # is to help avoid name collisions between Unicode character names and the # labels which commonly appear in text and/or in implementations such as regex, for # control codes (which have no Unicode character name) or for format characters. # # For documentation, see NamesList.html and http://www.unicode.org/reports/tr44/ # # FORMAT # # Each line has three fields, as described here: # # First field: Code point # Second field: Alias # Third field: Type # # The Type labels used are: correction, iso6429, control, abbreviation # Those Type labels can be mapped to other strings for display, if desired, # e.g. "preferred", "control name in ISO 6429", "other control code name", # "abbreviated as", etc. # # In case multiple aliases are assigned, additional aliases # are provided on separate lines. Parsers of this data file should # take note that the code points are not in numerical order, and that # the same code point can (and does) occur more than once. # #----------------------------------------------------------------- # 1. Corrections for serious problems in the Unicode character names 01A2;LATIN CAPITAL LETTER GHA;correction 01A3;LATIN SMALL LETTER GHA;correction 0CDE;KANNADA LETTER LLLA;correction 0E9D;LAO LETTER FO FON;correction 0E9F;LAO LETTER FO FAY;correction 0EA3;LAO LETTER RO;correction 0EA5;LAO LETTER LO;correction 0FD0;TIBETAN MARK BKA- SHOG GI MGO RGYAN;correction 2118;WEIERSTRASS ELLIPTIC FUNCTION;correction 2448;MICR ON US SYMBOL;correction 2449;MICR DASH SYMBOL;correction A015;YI SYLLABLE ITERATION MARK;correction FE18;PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET;correction 1D0C5;BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS;correction #----------------------------------------------------------------- # 2. Aliases provided for ISO 6429 control function names 0000;NULL;iso6429 0001;START OF HEADING;iso6429 0002;START OF TEXT;iso6429 0003;END OF TEXT;iso6429 0004;END OF TRANSMISSION;iso6429 0005;ENQUIRY;iso6429 0006;ACKNOWLEDGE;iso6429 # Note that no formal name alias for the ISO 6429 "BELL" is # provided for U+0007, because of the existing name collision # with U+1F514 BELL. 0008;BACKSPACE;iso6429 0009;CHARACTER TABULATION;iso6429 000A;LINE FEED;iso6429 000B;LINE TABULATION;iso6429 000C;FORM FEED;iso6429 000D;CARRIAGE RETURN;iso6429 000E;SHIFT OUT;iso6429 000F;SHIFT IN;iso6429 0010;DATA LINK ESCAPE;iso6429 0011;DEVICE CONTROL ONE;iso6429 0012;DEVICE CONTROL TWO;iso6429 0013;DEVICE CONTROL THREE;iso6429 0014;DEVICE CONTROL FOUR;iso6429 0015;NEGATIVE ACKNOWLEDGE;iso6429 0016;SYNCHRONOUS IDLE;iso6429 0017;END OF TRANSMISSION BLOCK;iso6429 0018;CANCEL;iso6429 0019;END OF MEDIUM;iso6429 001A;SUBSTITUTE;iso6429 001B;ESCAPE;iso6429 001C;INFORMATION SEPARATOR FOUR;iso6429 001D;INFORMATION SEPARATOR THREE;iso6429 001E;INFORMATION SEPARATOR TWO;iso6429 001F;INFORMATION SEPARATOR ONE;iso6429 007F;DELETE;iso6429 0082;BREAK PERMITTED HERE;iso6429 0083;NO BREAK HERE;iso6429 0084;INDEX;iso6429 0085;NEXT LINE;iso6429 0086;START OF SELECTED AREA;iso6429 0087;END OF SELECTED AREA;iso6429 0088;CHARACTER TABULATION SET;iso6429 0089;CHARACTER TABULATION WITH JUSTIFICATION;iso6429 008A;LINE TABULATION SET;iso6429 008B;PARTIAL LINE FORWARD;iso6429 008C;PARTIAL LINE BACKWARD;iso6429 008D;REVERSE LINE FEED;iso6429 008E;SINGLE SHIFT TWO;iso6429 008F;SINGLE SHIFT THREE;iso6429 0090;DEVICE CONTROL STRING;iso6429 0091;PRIVATE USE ONE;iso6429 0092;PRIVATE USE TWO;iso6429 0093;SET TRANSMIT STATE;iso6429 0094;CANCEL CHARACTER;iso6429 0095;MESSAGE WAITING;iso6429 0096;START OF GUARDED AREA;iso6429 0097;END OF GUARDED AREA;iso6429 0098;START OF STRING;iso6429 009A;SINGLE CHARACTER INTRODUCER;iso6429 009B;CONTROL SEQUENCE INTRODUCER;iso6429 009C;STRING TERMINATOR;iso6429 009D;OPERATING SYSTEM COMMAND;iso6429 009E;PRIVACY MESSAGE;iso6429 009F;APPLICATION PROGRAM COMMAND;iso6429 #----------------------------------------------------------------- # 3. Aliases provided for other de facto control code names and # format control names in widespread use # These include ISO 6429 control function names valid in # earlier editions of that standard. 0007;ALERT;control 0009;HORIZONTAL TABULATION;control 000A;NEW LINE;control 000A;END OF LINE;control 000B;VERTICAL TABULATION;control 000E;LOCKING-SHIFT ONE;control 000F;LOCKING-SHIFT ZERO;control 001C;FILE SEPARATOR;control 001D;GROUP SEPARATOR;control 001E;RECORD SEPARATOR;control 001F;UNIT SEPARATOR;control 0088;HORIZONTAL TABULATION SET;control 0089;HORIZONTAL TABULATION WITH JUSTIFICATION;control 008A;VERTICAL TABULATION SET;control 008B;PARTIAL LINE DOWN;control 008C;PARTIAL LINE UP;control 008D;REVERSE INDEX;control 008E;SINGLE-SHIFT 2;control 008F;SINGLE-SHIFT 3;control 0091;PRIVATE USE 1;control 0092;PRIVATE USE 2;control 0096;START OF PROTECTED AREA;control 0097;END OF PROTECTED AREA;control FEFF;BYTE ORDER MARK;control #----------------------------------------------------------------- # 4. Aliases provided for de facto abbreviations of control codes, # format controls, spaces, and variation selectors in widespread use 0005;ENQ;abbreviation 0006;ACK;abbreviation 0008;BS;abbreviation 0009;HT;abbreviation 0009;TAB;abbreviation 000A;LF;abbreviation 000A;NL;abbreviation 000A;EOL;abbreviation 000B;VT;abbreviation 000C;FF;abbreviation 000D;CR;abbreviation 000E;SO;abbreviation 000F;SI;abbreviation 0015;NACK;abbreviation 001A;SUB;abbreviation 001B;ESC;abbreviation 001C;FS;abbreviation 001D;GS;abbreviation 001E;RS;abbreviation 001F;US;abbreviation 0020;SP;abbreviation 007F;DEL;abbreviation 0084;IND;abbreviation 0085;NEL;abbreviation 00A0;NBSP;abbreviation 00AD;SHY;abbreviation 034F;CGJ;abbreviation 200B;ZWSP;abbreviation 200C;ZWNJ;abbreviation 200D;ZWJ;abbreviation 200E;LRM;abbreviation 200F;RLM;abbreviation 202A;LRE;abbreviation 202B;RLE;abbreviation 202C;PDF;abbreviation 202D;LRO;abbreviation 202E;RLO;abbreviation 202F;NNBSP;abbreviation 2060;WJ;abbreviation FE00;VS1;abbreviation ... FE0F;VS16;abbreviation FEFF;BOM;abbreviation FEFF;ZWNBSP;abbreviation E0100;VS17;abbreviation ... E01EF;VS256;abbreviation # Total code points: xx # EOF