RE: Where is the First> Last> convention documented?

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 13 2007 - 18:17:47 CDT

Next message: Stephane Bortzmeyer: "[kim.davies@icann.org: Tool to convert IDN into image file]"

Previous message: Kenneth Whistler: "RE: Where is the First> Last> convention documented?"
Maybe in reply to: Stephane Bortzmeyer: "Where is the First> Last> convention documented?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I have not changed my tune nor even my intimate intuition if what Isaidwas
> not clear and could be interpreted differently.

WhatEVer.

> The need for stable names for C0 andC1 controls remains, and when I speak
> about stability, it's not within the Unicode standard itself (because such
> names are still not present), but within applications or documents needing
> names to reference them in a more clear way than just U+00xx (which is not
> ambiguous but not clear enough, for readers, given that even Unicode needs
> to define "aliases" to reference them in many places in its annexes.

I get it that you think there should be a standard list of names for
all the C0 and C1 control codes.

What you seem to be missing still is that the purpose of the Unicode
control character codes, U+0000..U+001F, U+007F..U+009F is for
interoperable mapping to the ISO-2022 framework C0 and C1 control
codes. And in that context, any particular control code does not
have a fixed control function or name. Usage differs by application.

For example, Marc-8 (http://www.loc.gov/marc/specifications/speccharmarc8.html)
makes use of non ISO-6429 C1 control function assignments, namely:

0x88 non-sorting character(s) begin
0x89 non-sorting character(s) end
0x8D joiner
0x8E nonjoiner

Now in what way would discussing mappings and interoperating with
Marc-8 and Unicode be clarified by referring to, for instance,

U+0088 CHARACTER TABULATION SET
U+0089 CHARACTER TABULATION WITH JUSTIFICATION
U+008D REVERSE LINE FEED
U+008E SINGLE SHIFT TWO

> So your attempt to say that the proposed names using "<>" or "# within names
> were non conforming are not relevant. What application need are stable names
> even if those names come from another character property which does not
> respect the current rules for existing standard character names. After all,
> Unicode references the "na1" property

Exegesis for those not completely steeped in the arcana of the UCD...
There are two "name" properties carried in UnicodeData.txt in
the UCD:

# ================================================
# Miscellaneous Properties
# ================================================
...
na ; Name
na1 ; Unicode_1_Name

The "Name" property is the normative, immutable character name
property I have been talking about. "Unicode_1_Name" is
an informative property that is neither complete nor
completely consistent, as it has been put to use in part
just to produce ISO 6429 aliases for C0 and C1 control codes,
for printing in the charts.

> (see the XML proposed format for the
> UCD),andcould as well have another property if it does not want to change
> the value of existing properties. And we have lots of other properties for
> CJK ideographs.

Yes, it is always possible to add more properties, including more
informative name attributes, but you would have to convince the
UTC of the cost/benefit tradeoff in doing so. Note that the
printed Unicode Standard (and the machine-readable NamesList.txt)
is full of informative aliases for characters, but other than
the few normative formal aliases, nobody has seen sufficient requirement
to turn these into formal values of character properties.

Furthermore, if the concern is stability of applications, having
*another* name property isn't going to help at all. It isn't
going to change API's that return *the* character name for
a Unicode character -- i.e. the normative character name.
All it does is introduce another bunch of names in an informative
list for people to get confused about, frankly.

> Most commonly used names are those based on 2/3 character abbreviations, so
> these "aliases" are still the best: "NUL, ..., TAB, LF, VT, FF, CR, ... DC1,
> ..., CSI, ...".
>
> I won't take the 2-characters Keld's mnemonic as they are broken even if
> they remain in old charset definition RFCs:

Ah, *some*thing we can agree on!

> But at least, these names would simplify the writing of new specifications,
> or could help disambiguate some old RFCs by making them more precise if some
> normative reference was simply available to specify this without long lists
> of local definitions in each document needing them (including in the Unicode
> standard annexes where these names are needed and redefined locally).

If you think a standard list of string labels for
C0/C1 control codes (either short ones like "TAB" and "LF" or
long ones like "CHARACTER TABULATION" and "LINE FEED") is
required, then by all means, write an RFC specifying your list.

I just don't think the UTC has any interest in going there
for the Unicode Standard, since it is already on record as
having specified the current scheme as printed in the standard --
namely, control codes have no character name, but are printed
with an (informative) alias to the ISO 6429 control function
name (if one exists).

--Ken

Next message: Stephane Bortzmeyer: "[kim.davies@icann.org: Tool to convert IDN into image file]"
Previous message: Kenneth Whistler: "RE: Where is the First> Last> convention documented?"
Maybe in reply to: Stephane Bortzmeyer: "Where is the First> Last> convention documented?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 13 2007 - 18:20:20 CDT