Another take on the English apostrophe in Unicode

Marcel Schneider charupdate at
Mon Jun 15 01:48:07 CDT 2015

On Thu, Jun 11, 2015, Philippe Verdy  wrote:

> The ASCII punctuations have been ovveriden for a lot of different roles. There's simply no way to map them to a category that matches their semantic role. So the ASCII hyphen and apostrophe-quote can only be given a very weak category that just exhibit their visual role. "Pd" (dash) is then appropriate for the ASCII hyphen-minus. You can't really tell from the character alone if it is a punctuation or a minus sign.

> If it is a minus sign you can reencode it better using the more specific mathematical minus sign. Otherwise, even if it is not a minus sign, it can be:
> - a connector between words in compound words (hyphen)
> - a trailing mark at end of lines for indicating a word has been broken in the middle (but remember that I asked previously for another character for that role because this word-breaking hyphen is not necessarily an horisontal hyphen (in dictionaries I've seen small slanted tildes, or slanted small equal signs, to make the distinction with true hyphens used in compound words, also because sometimes these breaks are not necessarily between two syllables in "pocket books" with very narrow columns and minimized spacing)
> - a bullet leading items in a vertical list (this should be an en dash, follwoed by some spacing)
> - a punctuation (not necessarily at begining of line) marking the change of person speaking (very common in litterature, notably in theatre).

> As a connector between words, there's a demonstrated need of differentiating regular hyphens, longer hyphens (preferably surrounded by thin spaces) for noting intervals (we can use the EN DASH for that), long hyphens between two separate names that are joined (example in propers names, after mariage, there's an example in France, where INSEE encodes it for now using TWO successive hyphens, which are also used in French identity cards, passports, social security green cards...).

In most fonts, the glyph of the hyphen-minus U+002D is the same as the one of the hyphen U+2010, while the minus sign U+2212 is longer and higher, at half-height of digits, to match between or before, as opposed to the hyphen and hyphen-minus which are positioned at half height of lowercase letters. As a minus sign, these work well only with Elzevir digits. This is why, in most fonts, the hyphen-minus U+002D is very unpleasant when used as a minus sign, especially when the plus sign, equals sign and other operators are present too.

In this, the hyphen differs from the apostrophe U+0027, whose differenciated characters (apostrophe U+02BC and single close-quote U+2019) have exactly the same glyph. But hyphen and apostrophe resemble in the fact that in many fonts, only the paired or assorted character is present, while the other is missing. So even in Arial, where the letter apostrophe U+02BC is present, the hyphen U+2010 is missing. The user is supposed to use U+002D as a hyphen and U+2212 as the minus sign. The system hyphen displayed in automatic word break at line end, is converted to U+002D for PDF. This isnʼt ideal, as you point out, because to reverse the word break, one canʼt simply replace all U+002D by nothing. Word processors allow to remove all instances of (U+002D, EOL), but this can delete some orthographic hyphens. The solution would be to use U+2010 for orthographic hyphens (with compatible fonts) and to let the system place its U+002D.

The letter apostrophe U+02BC is indispensable because the glyph of U+0027 is unfit for typography. We are also told that U+0027 is unstable, but this is mainly due to the autocorrect smart quotes, which can be turned off at input. I use the autocorrect from now on to convert U+0027 to U+02BC.

Another difference between apostrophes and hyphens, and perhaps the main difference, is that except if they are used for word break, hyphens generally donʼt need to be replaced at further stages. At input, the user will replace U+002D with U+2212 where appropriate, and the autocorrect may replace two hyphens with an en dash U+2013. In some fonts, U+002D will need to be replaced with U+2010 for glyphic reasons. 

By contrast, quotes are to be converted, Ted Clancy points out in his paper. 
Ambiguating one of them with the apostrophe was a very bad idea. 
Well, I still believe it was *not* the idea of any Unicode Committee, nor of any Standards Body at all.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list