Re: Unicode 1.0 names for control characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 04 2001 - 16:33:01 EST


Doug wrote:

> I am surprised and puzzled by the "Unicode 1.0 Name" changes for some of the
> ASCII and Latin-1 control characters that were introduced in the latest beta
> version of the Unicode 3.2 data file (UnicodeData-3.2.0d5.txt):
>
> U+0009 HORIZONTAL TABULATION ==> CHARACTER TABULATION
> U+000B VERTICAL TABULATION ==> LINE TABULATION
> U+001C FILE SEPARATOR ==> INFORMATION SEPARATOR FOUR
> U+001D GROUP SEPARATOR ==> INFORMATION SEPARATOR THREE
> U+001E RECORD SEPARATOR ==> INFORMATION SEPARATOR TWO
> U+001F UNIT SEPARATOR ==> INFORMATION SEPARATOR ONE
> U+008B PARTIAL LINE DOWN ==> PARTIAL LINE FORWARD
> U+008C PARTIAL LINE UP ==> PARTIAL LINE BACKWARD

Well, *someone* is clearly paying close attention! And the editors haven't even
officially announced the Unicode 3.2 beta period yet.

>
> Were these "new" names (e.g. CHARACTER TABULATION) really the original
> Unicode 1.0 names?

No, they were not. The older names were the Unicode 1.0 names for
U+0009, U+000B, U+001C..U+001F. Unicode 1.0 didn't have *any* names
for C1 control codes.

The official UTC doctrine now is that C0/C1 control characters do not
have Unicode names, formally. But what we do is print ISO 6429 control
function names as aliases in the names list for the charts. (This
was an official decision by the UTC for Unicode 3.0, so is not just
an editorial whim.)

The mechanism that the names list generation tool currently uses for that
is to print "<control>" in the name area and to grab the Unicode 1.0
name field for the alias (if one exists). This is special-cased code
just for the control characters. The simplest fix for updating the
ISO 6429 names to match the actual, current 6429 standard, was simply to
update the Unicode 1.0 name field for the 8 instances you cite above.

Incidentally, in case you are worried about historic accuracy here,
the "Unicode 1.0 name" field was already fully suborned for the
Unicode 3.0 publication, since the ISO 6429 C1 function names were
inserted into that field, even though Unicode 1.0 had *no* names
at all for C1 control characters.

> IMHO, the new names CHARACTER TABULATION and LINE TABULATION are much less
> intuitive than HORIZONTAL TABULATION and VERTICAL TABULATION. Sometimes you
> even see the abbrevations HT and VT for these two characters. The new names
> appear to have been invented by someone who imagined a lack of clarity in the
> old names.

Kent explained the standards rationale for updating these. It is a matter
of actually using the names from the published version
of the standard we are nominally referring to.

Incidentally, take a look also at NamesList-3.2.0d3.txt in the same BETA
directory. It shows that all the older C0 names have been retained as
further aliases, since they are actually more familiar to most people,
as you are pointing out.

>
> The "old" names for these six control characters were used as far back as the
> original 1963 version of ASCII, according to Mackenzie (pp. 245-247).

Yep. Venerable names. Honored names. Useful names.

>
> I know this 1.0 name field is not subject to the same rule of "no changes,
> ever" that applies to the regular Character Name field, but why should these
> names be changed at all?

Aliases, actually, from the Unicode point of view, not formal names.

And Kent explained why update the aliases.

>
> On this same topic, parenthesized abbreviations have been added to the 1.0
> names for U+000A LIFE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE
> RETURN (CR), and U+0085 NEXT LINE (NEL). Does the addition of these
> abbreviations mean that they are now part of the official 1.0 name,

Nope.

> and if
> so, why? Other characters typically don't have abbreviations as part of
> their names, even if they are as meaningful and as commonly used as these,
> and again it is a change from the 1.0 name we have seen for a decade.

Off and on, I work at a project to backrev from UnicodeData-1.1.5.txt
to produce a Unicode 1.0 version of UnicodeData.txt, as it would
have been defined if such a data file had been defined at the
time. (It wasn't.) If I get around to posting that, then people can
use the Unicode name field itself as the documentation of what the
Unicode 1.0 name was!

In the meantime, if you want the old time religion for the Unicode 1.0
names, you can extract them from UnicodeData-2.0.14.txt (the version
officially released with Unicode 2.0), before the field was repurposed
for the Unicode 3.0 publication.

>
> Perhaps I've been checking the beta files a bit TOO carefully.

I suppose we should add a note to UnicodeData.html, clarifying the
special status of the Unicode 1.0 name field for the control
characters.

--Ken

>
> -Doug Ewell
> Fullerton, California
>
>



This archive was generated by hypermail 2.1.2 : Tue Dec 04 2001 - 16:32:01 EST