Re: APL Under-bar Characters from Asmus Freytag (t) on 2015-08-16 (Unicode Mail List Archive)

From: Asmus Freytag (t) <asmus-inc_at_ix.netcom.com>
Date: Sun, 16 Aug 2015 19:16:17 -0700

On 8/16/2015 6:57 PM, alexweiner@alexweiner.com wrote:

Bug APL,

After much discussion with The Unicode Consortium Mailing List,

Can we use this to give the characters unique names?

If you need to "elevate" certain combinations to make them named as units, adding named sequences would be the right ticket.
There are many types of repertoires where the elements are not necessarily characters but also fixed sequences.

One example comes to mind: internationalized domain names. The new standards track RFC for an xml representation of (among other things) the repertoire for IDNs includes both code points and sequences on equal footing. It would be great if more identifier definitions were constructed that way.

A./

It seems that they will never be given a new code point:

http://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt

Then maybe we could work off that as a pseudo-standard?

-Alex

-------- Original Message --------
Subject: Re: APL Under-bar Characters
From: David Starner <prosfilaes@gmail.com>
Date: Sun, August 16, 2015 6:27 pm
To: alexweiner@alexweiner.com, Ken Whistler <kenwhistler@att.net>
Cc: unicode@unicode.org

http://unicode.org/policies/stability_policy.html , in particular, the Normalization Policy. The way the APL A with underscore is encoded is the way we've been saying, and Unicode has promised its users that there's no other way of writing it.

The current precedent is that when users ask for things like this is that they are told they can't have them; for example, the Lithuanians were told that the way to encode LATIN CAPITAL LETTER A WITH OGONEK AND ACUTE is U+0104 U+0301, not any other way. They can be listed in http://www.unicode.org/Public/UCD/latest/ucd/NamedSequences.txt so that there can be a unique name to refer to them, but there will not be any new codepoint.

On Sun, Aug 16, 2015 at 6:16 PM <alexweiner@alexweiner.com> wrote:

David,

I don't understand what you mean by saying that the standard is set. By Ken's account, The Consortium decided to create a policy specifically regarding this, by vote of APL (and I assume interested Unicode) users worldwide. The Standard itself is in version eight. Why does a vote seem so ridiculous, especially in the case of an addition, rather than a subtraction?

What is the current precedent for this sort of thing?

-Alex

-------- Original Message --------
Subject: Re: APL Under-bar Characters

From: David Starner <prosfilaes@gmail.com>
Date: Sun, August 16, 2015 5:59 pm
To: alexweiner@alexweiner.com, Ken Whistler <kenwhistler@att.net>
Cc: unicode@unicode.org

The standard is set here. The Unicode Consortium has declared that it won't encode precomposed characters that can be created from characters in the standard, because that would be destabilizing and potentially introduce security holes in programs depending on Unicode. If you want, we can have a vote on whether or not APL should use characters with underlines, since I was unfairly locked out of that vote by not being born yet.

On Sun, Aug 16, 2015 at 5:52 PM <alexweiner@alexweiner.com> wrote:

Ken,

You pose a very strong, and well worded response. The historical element really helps to illuminate what I thought was lost knowledge: "Why are there no under-bars". To this I can only ask one thing:

Can we put this to a vote again? To put things in perspective, I was thee years old at the time of the ballot in 1993 and had much larger issues to deal with (comprehending speech, learning to walk, etc.), and was unable to participate in this internationally binding vote.

Perhaps feelings about the under-bar characters have changed since then. I know that the APL landscape is very different than it was in 1993.

I have a copy of one of those IBM books that has the italicized upper-case under-bars. If my proposal for a new vote is well received, maybe we should include those as well, for completeness sake.

-Alex

-------- Original Message --------
Subject: Re: APL Under-bar Characters

From: Ken Whistler <kenwhistler@att.net>
Date: Sun, August 16, 2015 5:15 pm
To: alexweiner@alexweiner.com
Cc: unicode@unicode.org

Alex,

On 8/16/2015 12:41 PM, alexweiner@alexweiner.com wrote:

As far as I know, APL definitely predates the Unicode consortium. Do you think that The Consortium possibly overlooked the pre-existing under-bar character set?

The answer to that is no.

Initially, Unicode 1.0 attempted to punt the entire APL complex functional symbol
problem by encoding U+2300 APL COMPOSE OPERATOR.

The concept was essentially that any of the combined symbols -- the old
rack of stuff that people complained about entering with symbol/backspace/symbol
keying, could simply be represented as sequences of existing symbols.
Think of 2300 as an early attempt to introduce an APL "script"-specific
conjunct-forming virama, a la much-later artificially introduced script-specific
joiners. Cf. U+2D7F TIFINAGH CONSONANT JOINER.

But U+2300 APL COMPOSE OPERATOR was an innovation that failed.
It was fiercely opposed *by the APL community*, who wanted it
out of 10646 and replaced with a explicit list of pre-formed complex
functional symbols. Presumably for the same reason we are talking
about here now: essentially that each symbol had to work as a "character",
and in an APL context that meant fixed width and the same data size as
all the other characters.

The removal of Unicode 1.0 U+2300 APL COMPOSE OPERATOR is documented
in Unicode 1.1 as of 1993:

http://www.unicode.org/versions/Unicode1.1.0/

(see page 3)

The addition of APL functional symbols is documented in Section 5.4.8, pp. 39-41.

The exact repertoire that ended up encoded in the standard was the result of meetings
between some Unicode representatives and some folks from the APL community. The names
escape me at the moment, although it might be possible to recover some
information eventually. (Documentation regarding Unicode events in late 1991 is
sparse these days.) At any rate the agreed upon additional repertoire is probably
that included in:

X3L2/92-035, Unicode Request for Additional Characters in ISO/IEC 10646-1.2.
And the rest of the consequences and processing can be dug out of the ballot history record
for the voting on 10646 in 1992.

At any rate, a propos *this* discussion, we agreed that the repertoire would cover
all the complex functional symbols, but *not* the letters
with underscores. And it is not that they were simply overlooked.

How do I know? Well, first, there were APL specialists involved in coming up
(and promoting) the repertoire that was carried into the 10646 balloting at
the time. It isn't as if a bunch of ignorant Unicoders just grabbed one APL
book off the shelf and coded up the table, not noticing that some stuff was
missing.

Second, the text that is currently in the core specification about this issue,
to wit:

" ... All other APL extensions can e encoded by composition of other
Unicode characters. For example, the APL symbol a underbar can be
represented by U+0061 LATIN SMALL LETTER A + U+0332 COMBINING LOW LINE."
(Unicode 7.0, Section 22.7, p. 772)

is *ancient* text. It was first printed on p. 6-83 of Unicode 2.0 in 1996,
with exactly the same wording. And the only reason it took until 1996 to appear,
instead of 1993, was that the editing of Unicode 2.0 and its code charts
was such a massive task at the time.

So the clear intent in *1993* was to represent any APL letter with underbar
as a combining character sequence -- as noted. The only problem I see there
is that the text in the core spec mistakenly used U+0061 (the lowercase "a")
instead of U+0041 (the uppercase "A") for the exemplification.

Third, I can attest that at least some of us at the time -- as early as 1989, had
printed copies of IBM EBCDIC code page 293 for APL, which had
the EBCDIC uppercase Latin letters with underscores (italicized, by the way),
together with the regular EBCDIC upper and lowercase letters. [Dates from 1984.]
*And* IBM EBCDIC code page 310 for APL, which dropped all the
regular upper- and lowercase letters but added more symbols.
*And* IBM PC code page 907 (with the underscored uppercase Latin
letters) and PC code page 909 (CP437 hacked up for APL, without the
underscored uppercase Latin letters), which was quickly superseded by
PC code page 910, which also did not use the uppercase Latin letters
with underscores.

So yeah, we knew about these. Encoding them as combining character
sequences instead of as atomic characters was a deliberate decision
taken in 1992. And that decision made it through both UTC and
international balloting for publication in 1993.

--Ken

Received on Sun Aug 16 2015 - 21:17:18 CDT

This archive was generated by hypermail 2.2.0 : Sun Aug 16 2015 - 21:17:18 CDT