L2/01-423
2001-10-31
Kent Karlsson
The Fifth edition of ECMA-48, 1991, a.k.a. ISO/IEC 6429:1992 (Third edition), did some name changes to the names for the C0 and C1 “control” characters so as to internationalise the names somewhat. In particular, references to horizontal, vertical, up and down, were changed to refer to character, line, backward and forward respectively. Further, the file, group, record and unit separators appear to have been generalized (I don’t have access to the fourth edition) to information separator one, two, three, and four, so as not to always imply a hierarchy, or at least not the particular hierarchy of files, groups, records, and units.
The Fifth edition of ECMA-48 is the only edition of ECMA-48 now available (online), and the Third edition of 6429 is the only edition now available (from ISO), and the name changes referred to above even predate the first edition of 10646-1. Therefore, the current ECMA-48/6429 names are the names that should be used in the UnicodeData.txt data file and other UCD data files.
The Fourth edition names for C0 and C0 control characters, which are now used in UnicodeData.txt, should be preserved as alias names in NamesList.txt.
Some additional new cross references, and additional short explanations for some characters are also included below, in particular in relation to UAX 13, soft hyphen, and the new (or, rather, outdated, but new addition to Unicode) scan line characters, as well as the new word joiner character.
The new parts are marked with bold red below.
0009 <control>
= CHARACTER TABULATION
= HORIZONTAL TABULATION (HT)
* the name was changed in 1991 to a more international name (lines may be vertical)
000A <control>
= LINE FEED (LF)
= new line (nl), end of line
* see UAX 13
x (carriage return - 000D)
x (next line - 0085)
x (line separator - 2028)
x (paragraph separator - 2029)
000B <control>
= LINE TABULATION
= VERTICAL TABULATION (VT)
* the name was changed in 1991 to a more international name (lines may be vertical)
* see UAX 13
x (line separator - 2028)
000C <control>
= FORM FEED (FF)
= next page, end of page
* see UAX 13
x (line separator - 2028)
000D <control>
= CARRIAGE RETURN (CR)
* see UAX 13
x (line feed - 000A)
x (next line - 0085)
x (line separator - 2028)
x (paragraph separator - 2029)
001A <control>
= SUBSTITUTE
* used in the place of a character that has been found to be invalid or in error
* intended to be introduced by automatic means
x (replacement character - FFFD)
001C <control>
= INFORMATION SEPARATOR FOUR
= FILE SEPARATOR
001D <control>
= INFORMATION SEPARATOR THREE
= GROUP SEPARATOR
001E <control>
= INFORMATION SEPARATOR TWO
= RECORD SEPARATOR
001F <control>
= INFORMATION SEPARATOR ONE
= UNIT SEPARATOR
0020 SPACE
* sometimes considered a control code
* other space characters: 2000-200A
x (no-break space - 00A0)
x (zero width space - 200B)
x (ideographic space - 3000)
x (zero width no-break space - FEFF)
x (word joiner – 2060)
0082 <control>
= BREAK PERMITTED HERE
* used to indicate a point where a line break may occur when text is formatted
* zero width (no streach)
x (zero width space - 200B)
x (soft hyphen - 00AD)
x (mongolian todo soft hyphen - 1806)
0083 <control>
= NO BREAK HERE
* used to indicate a point where a line break shall not occur when text is formatted
x (zero width no-break space - FEFF)
x (word joiner - 2060)
0085 <control>
= NEXT LINE (NEL)
* see UAX 13
x (line feed - 000A)
x (carriage return - 000D)
x (line separator - 2028)
x (paragraph separator - 2029)
008B <control>
= PARTIAL LINE FORWARD
= PARTIAL LINE DOWN
* the name was changed in 1991 to a more international name (lines may be vertical)
008C <control>
= PARTIAL LINE BACKWARD
= PARTIAL LINE UP
* the name was changed in 1991 to a more international name (lines may be vertical)
00A0 NO-BREAK SPACE
x (space - 0020)
x (figure space - 2007)
x (narrow no-break space - 202F)
x (zero width no-break space - FEFF)
x (word joiner – 2060)
# <noBreak> 0020
00AD SOFT HYPHEN
= discretionary hyphen
* zero width, unless there is an (automatic or explicit) line break after it whence it is imaged as a hyphen
* when zero width, a soft hyphen may suppress the display of the following character in some cases for some languages (e.g. webb<SHY>bläddrare displays as webbläddrare, and remiss<SHY>svar as remissvar)
x (mongolian todo soft hyphen - 1806)
x (hyphen – 2010)
x (non-breaking hyphen - 2011)
00B7 MIDDLE DOT
= midpoint (in typography)
= Georgian comma
= Greek middle dot
x (greek ano teleia - 0387)
x (bullet - 2022)
x (one dot leader - 2024)
x (hyphenation point - 2027)
x (bullet operator - 2219)
x (dot operator - 22C5)
x (katakana middle dot - 30FB)
2010 HYPHEN
x (hyphen-minus - 002D)
x (soft hyphen – 00AD)
2011 NON-BREAKING HYPHEN
x (hyphen-minus - 002D)
x (soft hyphen – 00AD)
# <noBreak> 2010
2028 LINE SEPARATOR
* may be used to represent this semantic unambiguously
* see UAX 13
2029 PARAGRAPH SEPARATOR
* may be used to represent this semantic unambiguously
* see UAX 13
2060 WORD JOINER
* does not join multiple words, but joins inside words
* unambiguous replacement for FEFF ZERO WIDTH NO-BREAK SPACE
x (zero width no-break space - FEFF)
23BA HORIZONTAL SCAN LINE-1
* the scan line numbers here refer to old low-resolution technology for terminals, with only 9 scan lines per fixed-size (ASCII) character glyph
23BB HORIZONTAL SCAN LINE-3
23BC HORIZONTAL SCAN LINE-7
23BD HORIZONTAL SCAN LINE-9
FEFF ZERO WIDTH NO-BREAK SPACE
= BYTE ORDER MARK (BOM)
* may be used to detect byte order by contrast with FFFE which is not a character
x (<not a character> - FFFE)
x (zero width space - 200B)
x (word joiner - 2060)
x (no break here – 0083)