Another take on the English apostrophe in Unicode

Philippe Verdy verdy_p at
Thu Jun 11 00:17:11 CDT 2015

The ASCII punctuations have been ovveriden for a lot of different roles.
There's simply no way to map them to a category that matches their semantic
role. So the ASCII hyphen and apostrophe-quote can only be given a very
weak category that just exhibit their visual role. "Pd" (dash) is then
appropriate for the ASCII hyphen-minus. You can't really tell from the
character alone if it is a punctuation or a minus sign.

If it is a minus sign you can reencode it better using the more specific
mathematical minus sign. Otherwise, even if it is not a minus sign, it can
- a connector between words in compound words (hyphen)
- a trailing mark at end of lines for indicating a word has been broken in
the middle (but remember that I asked previously for another character for
that role because this word-breaking hyphen is not necessarily an
horisontal hyphen (in dictionaries I've seen small slanted tildes, or
slanted small equal signs, to make the distinction with true hyphens used
in compound words, also because sometimes these breaks are not necessarily
between two syllables in "pocket books" with very narrow columns and
minimized spacing)
- a bullet leading items in a vertical list (this should be an en dash,
follwoed by some spacing)
- a punctuation (not necessarily at begining of line) marking the change of
person speaking (very common in litterature, notably in theatre).

As a connector between words, there's a demonstrated need of
differentiating regular hyphens, longer hyphens (preferably surrounded by
thin spaces) for noting intervals (we can use the EN DASH for that), long
hyphens between two separate names that are joined (example in propers
names, after mariage, there's an example in France, where INSEE encodes it
for now using TWO successive hyphens, which are also used in French
identity cards, passports, social security green cards...).


Still nobody replied to my past comment (about 1 month ago) about the
various forms of the word-breaking hypĥen / line-wrapping symbol:

* I'm not speaking about the SHY control, but about the real character
whose glyph appears when SHY is materialized at end of lines (and which
should be neither minus, or en-dash but also not the same as the
orthographic hyphen used between words in a compound word).

* This character can also be found (and is needed) also for breaking long
mathematical formulas and must be clearly distinct from the regular minus.

* This character is also needed for rendering long lines of programming
code or textual data (it is something that must not be entered in programs
but that must be rendered because theses programs or codes have significant
line breaks: the glyph indicates that the following rendered line break is
to be discarded). Not all programming languages have a syntax allwong to
use an escape before the line break (such escaping varies, it may be a
backslash in C/C++, or an underscore in Basic, but in data dumps such as
CSV files, it is impossible to note such escape in the data language
itself, and we need to render some specific glyph).

* This character is absolutely needed when rendering on a static medium
(i.e. printing or broadcasting) ;  for dynamic medium (such as personal
displays with a personal UI) we could still use scrolling, but users don't
like horizontal scrolls and highly prefer reading the text directly. So
they expect to see a distinctive glyph (or icon) to see the distinction
between line breaks where there are significant or where they just wrap too
long lines, and still see the distinction with other regular hyphens and
minus (that are also significant and very frequently distinct)

2015-06-11 0:51 GMT+02:00 Ted Clancy <tclancy at>:

> On 4/Jun/2015 19:01, Leo Broukhis wrote:
>> Along the same lines, we might need a MODIFIER LETTER HYPHEN, because,
>> for
>> example, the work ack-ack isn't decomposable into words, or even
>> morphemes,
>> "ack" and "ack".
> I do think that U+2010 (HYPHEN) is miscategorised. I think it should have
> General Category = Pc, not Pd. (That is, hyphens are connectors, not
> dashes.) That would make it a "word" character.
> Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning
> it can occur in the middle of numbers or letters). UAX #29 says that U+2010
> deliberately does *not* have Word Break = MidNumLet, though an
> implementation may treat it as if it did. (UAX #29 doesn't give any reasons
> for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have
> Word Break = MidNumLet, due to its history of being used as a dash or minus
> sign, but U+2010 should never be used as a dash or minus sign, so I don't
> see the problem.)
> But luckily, the miscategorisation of U+2010 hasn't led to any pressing
> practical problems, unlike the misuse of U+2019 for the apostrophe.
> - Ted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list