From: Steve Summit (scs@eskimo.com)
Date: Sat May 20 2006 - 10:58:31 CDT
Thanks for your thoughtful reply, Jukka. I hadn't fully thought
about the lexical class of U+02BC. This is clearly the crux of
the matter. But if we think about it carefully, I'm not at all
sure what we mean when we talk about the character which, as you
say, "we commonly regard as (punctuation) apostrophe".
I'm beginning to think that this idea of a "punctuation
apostrophe" is a notional one that exists only in our heads, a
vestige of the bad old days when U+0027 tried to do everything
and clearly had to be regarded as a punctuation character.
Now that we've separated out U+2018 and U+2019 for the quotation
mark uses, and U+2032 for prime, I'm really not sure we ever have
to think about the plain apostrophe as a "punctuation" character
any more!
In particular, let's look at the Unicode Standard's current text
on the matter, which says in section 6.2 ("General Punctuation",
page 159 in version 4.0.0):
Letter Apostrophe. U+02BC MODIFIER LETTER APOSTROPHE is
preferred where the apostrophe is to represent a modifier
letter (for example, in transliterations to indicate a
glottal stop). In the latter case, it is also referred
to as a letter apostrophe.
Punctuation Apostrophe. U+2019 RIGHT SINGLE QUOTATION
MARK is preferred where the character is to represent a
punctuation mark, as for contractions: "We've been here
before." In this latter case, U+2019 is also referred to
as a punctuation apostrophe.
Now, the key question is: even in English, in what sense is the
apostrophe in the word "we've" actually a punctuation character?
If you're using a graphical, mouse-based environment, move the
mouse cursor anywhere over the word "we've" and double click.
You get the whole word, as clearly you should. Run the word
through your spellchecker. It does not report "ve" as a separate
word which is misspelled, as clearly you would not want it to.
In both of these cases, the apostrophe (the plain old ASCII
apostrophe) is being treated as part of the word. I've written
word-matching code any number of times, and I tend to treat the
apostrophe as a letter for this purpose.
There are complications, of course. If you put the word 'we've'
in single quotes and double click, you tend to get just the word,
not the quotes. And of course this is because the software is
using some fancier heuristics, treating an embedded apostrophe,
with alphabetics on both sides, differently. But then you get
aberrations with possessives, as can be seen when comparing
"my parents' house" and "my sister's house". *BUT*, and this
is the key point, if the apostrophe character were encoded using
a different code point than the single quote character, the
software wouldn't have to resort to those ambiguous heuristics.
In English, I think the only sense in which the apostrophe is not
a "letter" is that we don't list it along with the other 26 when
we say our ABC's, and we don't treat it as significant in
alphabetization. But, really, surprising as it seems, I think
it's much more like a letter than punctuation; and furthermore,
this is as true in English as in other languages which explicitly
call their apostrophe a "modifier letter". I don't think there's
much useful distinction to be made -- syntactic, semantic, or
otherwise -- between the contractive apostrophe in the English
"we've" and, say, the apostrophe-as-glottal-stop in languages
that use it that way. (And, certainly, any such distinction is
considerably *less* significant than the distinction between
apostrophe and closing single quote!)
Continuing:
>> (Me, I'd really like to distinguish apostrophes from quotes in
>> textual data, as they're obviously quite different semantically.)
>
> Many people have expressed the same view. It would meant that a new
> character would have been defined, for unambiguous use as punctuation
> apostrophe.
Well, yes, but only if you think that the "punctuation
apostrophe" is punctuation! Back when the first Unicode
Standard came out, I really did think that not one but several
"new characters had been defined", and that they had been defined
precisely so that I could start making useful distinctions,
e.g. between U+02BC for the true apostrophe and U+2018 and U+2019
for the quotes. It seemed odd that U+02BC was a "modifier
letter" and not the punctuation I still thought it was, but I was
prepared to overlook this, because the distinct code point was
distinct, and the glyph appearance was correct, and everything
would have worked out fine. (But then Unicode 3.0 went and took
away the useful, liberating new distinction I thought I'd been
granted, and I was crushed.)
> I don't think traditional or modern typography ever distinguises
> between a punctuation apostrophe and a right single quotation
> mark...
Certainly not, which is why we now have the curious situation
that a character whose name is "Right Single Quotation Mark"
can carry a recommendation saying that it is also "the preferred
character for apostrophe".
(Although with that said, I'm noticing that the visual appearance
of U+02BC and U+2019 under several of the systems I use does tend
to be different, although I suspect that this is due much more to
accidents of implementation than to deliberate design.)
> Thus, the difference would be _purely_ semantic. Would people
> really want to make such distinctions in writing?
Probably not in casual writing, but certainly in precise data
encoding. Wouldn't it be nice, for example, if you could
mechanically replace all the single quotes in a document with
double, or check for proper open/close quote pairing, without
having to worry about the apostrophes?
But certainly, a big part of this issue is that, practically,
people don't have a good way of making such distinctions in
writing. There's still only one key on most keyboards for both
apostrophe and single quote, and the "smart quotes" feature of
e.g. Microsoft Word is usually able to, in effect, turn that
one key into two, but not into three. So if U+02BC is listed
as the "preferred character for apostrophe", people who have no
convenient way of entering it have to feel guilty that they're
doing something wrong. (In fact, in cynical moments, I have
almost concluded that the main reason for the change in Unicode's
recommendations over time about U+02BC and U+2019 was just to
reduce this guilt among users of Microsoft Word.)
Today, however, if I want to reduce ambiguity by reserving
U+2018/U+2019 for quotes, and using U+0027 or U+02BC for
apostrophes, I get beat up for it: people point out Unicode's
recommendation that U+2019 is preferred, and now *I* have to
feel guilty.
> Similarly, the use of the full stop character "." as a sentence
> termination (period) is semantically quite distinct from its use in
> abbreviations (as in "Mr."), and its use as a decimal separator (in
> English) or as a thousands separator (in many other languages) are
> semantically distinct, too.
Funny you should mention that -- just yesterday I was realizing
that having distinct code points for "full stop" versus "decimal
point", and "comma" versus "thousands separator", would be quite
useful, especially when doing on-the-fly conversion of text to
properly locale-representative forms.
> Making distinctions on purely semantic grounds, for a character
> that is commonly understood as one character with multiple uses,
> would apparently have opened a can of worms.
"Would have"? Remember that Unicode has done exactly that in
several other places as well! We've split off U+2010 Hyphen,
U+2013 En Dash, and U+2212 Minus Sign from the old, ambiguous,
ASCII, U+002D Hyphen-Minus. We've split off U+2044 Fraction
Slash and U+2215 Division Slash from U+002F Solidus. (Granted,
I'm not sure anyone makes use of all these disambiguations,
though of course in typography the true hyphen is distinct.)
We've got U+212B Angstrom Sign distinct from U+00C5 Latin Capital
Letter A with Ring Above, and several other glyph-identical
characters in the 21xx Letterlike Symbols block. We've got
U+00B5 Micro Sign distinct from U+03BC Greek Small Letter Mu,
although of course that one was forced on us by ISO 8859-1.
I'm not sure why the can of worms is so much squirmier for
apostrophes than for the other characters. I was *hoping* that
what would change, over time and with the help of Unicode's
new distinctions, was that people's "common perception of one
character with multiple uses" would be reduced, that people would
start to recognize the distinctions. Unfortunately, in the case
of apostrophes, we've slid backwards, and Unicode has changed to
reflect the common perception that apostrophes and close single
quotes are still the same.
This archive was generated by hypermail 2.1.5 : Sat May 20 2006 - 11:06:38 CDT