Re: Apostrophes, quotation marks, keyboards and typography

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 19 1999 - 20:57:15 EDT


Markus,

>
> Hm, strange argument. Why should the apostrophe in "can't" not be a
> letter if "can't" is just a variant spelling of of "cannot"?

The apostrophe in English, or the elision in French are clearly not
letters of either orthography. They are both mechanisms for maintaining
visible morphological distinctions in the writing system, where
phonological reductions have eliminated sounds, in ways that even
these orthographies -- normally tolerant of "silent letters" -- could
not stomach. Thus, in English, "cant" != "can't", "its" != "it's",
and so on, and in French "lente" != "l'ente". In most cases these
are just orthographical conventions for keeping important grammatical
markings (negation, possession, articles) from "disappearing" into
the word, as an aid to reading. This helps keep English from starting
to look like Algonquian, by the way:

   ayshud@gon t@ð@stor bayw@n@klok.

   ("I should've gone to the store by one o'clock.")

But in any case, this apostrophe is not a letter. It traditionally has
been treated as punctuation, though of an unusual sort. And for
computer implementations, the ASCII heritage (itself derived from the
typewriter heritage) means that it is also hopelessly confused with
the single quotation mark (of either direction or neither). On the
other hand, apostrophes and liaison in English or French typically
should not mark word boundaries for word selection. That means that
any "punctuation" apostrophe should get special treatment when a process
is looking for word boundaries. (They typically don't get handled
correctly for selection by simple processes that depend on character
properties alone. Even Word 97, which correctly selects across an apostrophe,
incorrectly picks up a trailing right single quote, while correctly not
picking up a trailing right double quote on a word selection. Perhaps
Word 2000 finally got this one right.)

But there are clearly orthographies where a mark that looks like an
apostrophe functions as a letter of the orthography. It is a carrier
of a unit of the phonology, fitting in the system of letters, and is
not some type of punctuation. Similarly for the reversed form seen
in Hawaiian, for example, confusable in form with the left single
quotation mark. As such, these letters do not have the same properties
as either an apostrophe or a quotation mark; they are letters and not
punctuation.

0027;APOSTROPHE;Po;0;ON;;;;;N;APOSTROPHE-QUOTE;;;;
                ^
2019;RIGHT SINGLE QUOTATION MARK;Pf;0;ON;;;;;N;SINGLE COMMA QUOTATION MARK;;;;
                                 ^
02BC;MODIFIER LETTER APOSTROPHE;Lm;0;L;;;;;N;;;;;
                                ^

> I get a bit
> the suspicion that the original rationale behind MODIFIER LETTER
> APOSTROPHE might be a load of nonsense (similar to the "rationale" for
> the many digraphs in Latin B), but I am looking forward to first read
> the essay that Michael Everson has announced on the topic before I form
> an opinion. Probably someone made this distinction between punctuation
> and letter apostrophe just up.

The distinctions were there. The fact that all three functions, of
quotation mark, of apostrophe/elision, and of letter, got typed with
the same typewriter key (usually Shift-8, before the keyboards were
expanded for electric typewriters), which imprinted a glyph that was
non-directional, so the same key could be used for either side of a
quotation, is beside the point. Remember, on the early typewriters the
"l" key doubled for a "1" and the "O" key doubled for a "0" as well--
a practice that carried over into people's typing habits on electric
typewriters and even computers long after the key (and character code)
distinctions had been clearly made.

Character encodings prior to Unicode (the Mac was the early example)
made the distinction between the directionless quote 0x27 and the
right quotation mark. But even now on Windows systems it is not that
easy to enter the right quotation mark that is present in all the
code pages. So we still have massive confusion in the data.

Unicode 1.0 tried to clarify the distinctions by introducing the
letter U+02BC, and claiming that it should also be the one used
for the apostrophe. However, that turned out to be inconsistent
with the way Windows implementations had to be done, because of the
continuing confusion between apostrophes and the right quotation
mark. So in the Unicode 2.1 clarification, apostrophe was identified
with U+2019 instead. That had the benefit not only of following
industry practice, but also making the property distinctions between
U+02BC and U+2019 clearer. That doesn't mean that handling actual
apostrophes is any easier -- they still cannot be distinguished from
quotation marks by code, but have to be analyzed by context.

>
> Before you write Java programs in Navaho, you should better worry
> about allowing me to use identifiers such as
>
> while (it's_not_yet_ready,_ol'_boy,_so_let's_rock_it_again) {
> rock'n'roll();
> }

Coercing U+02BC to be your apostrophe character would let you do
this (since it would be distinguished from U+0027 and U+2019, which
would be interpreted as quotation marks, and thereby delimiters).
U+02BC should not break an identifier in Java.

>
> Seriously, computer languages are rather controlled artificial languages
> that are NOT intended for end users. They have developed their very own
> and very special ideosyncracies (for very good reasons), i18n attempts
> at programming language syntax could very easily be misguided badly, and
> using Navajo in program identifiers is most certainly not what we should
> consider to be good software engineering. Even French, Russian, and
> Japanese identifiers are bad enough, especially when mixed in the same
> program. Lucky you, if you never had to maintain one of these. ASCII for
> identifiers is just fine, and if it forces software engineers to stay
> with English identifiers, then trust me, this is a feature, not a bug.

But your point about computer languages is well taken. Certainly
keywords should not be mucked with in formal syntax. And most attempts
to internationalize internal identifiers are misguided, since the
little help they provide to a group who wants to use non-ASCII
identifiers is often offset by the troubles of code maintenance and
portability.

However, there are occasions where extended repertoires for identifiers
do make sense -- when the identifiers themselves can be externally
visible to end users. An example of this are column and table names
in SQL, which are visible to people designing the database and making
queries on the database. Some of these can also be hidden by appropriate
layering of graphic abstractions over the database, but such
identifiers are qualitatively different in their visibility than
C/C++ identifiers seen only by the programmer and the compiler.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT