[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #11332(new data)

Opened 10 days ago

Last modified 10 days ago

French needs U+0027 ['] and U+005F[_], and should have U+2015 [―], in exemplar punctuation

Reported by: Marcel Schneider <charupdate@…> Owned by: anybody
Component: main Data Locale:
Phase: dsub Review:
Weeks: Data Xpath:


Following up on forum post from Mark Davis http://st.unicode.org/cldr-apps/v#forum/fr//30743 "[v34] 2018-08-09 16:05", I kindly request the ASCII apostrophe, the underscore and the horizontal bar to be included in the set of French punctuation exemplars at French / Core Data / Alphabetic Information / Others: punctuation http://st.unicode.org/cldr-apps/v#/fr/Alphabetic_Information/6cf943e652b01478.

U+0027 and U+005F

Inclusion of U+0027 APOSTROPHE and U+005F LOW LINE is justified in that they are on the French computer (and typewriter) keyboards. Both are directly accessed when using mainstream layout, so that APOSTROPHE is in current use instead of the preferred (and likewise currently used) U+2019 RIGHT SINGLE QUOTATION MARK, while the underscore, aka “hyphen on the 8-key” (“tiret du 8”) is part of e-mail addresses of many end-users and therefore should not be excluded from French punctuation, given the e-mail address plays a crucial role in personal identity on the internet and, consequently, in real life. I note that the rationales of both characters are true for English and many other languages, too.


U+2015 by contrast is less obvious and could effectively be a candidate for exclusion or non-inclusion, but it should still be included because, although it has a number of issues, it is a Unicode character matching semantics in French. Its annotation in the French translation of the Code charts states that it can be used to introduce questions and responses in dialogues, much as in English:
“peut s’utiliser pour introduire les répliques des dialogues ; le tiret cadratin remplit la même fonction”
(can be used to introduce the replies of dialogues; the em-dash performs the same function)

Therefore U+2015 should be included in the French exemplar punctuation, be it simply because Unicode chose to encode it and give it that semantics which is used in French, too.


The issues as I see them are both on typographers’ side and on Unicode’s side. In the Unicode standard, on glyph level U+2015 has the same length as U+2014, while there is a difference in semantics. The latter vanishes in the French translation. But fonts like Cambria show a difference in length, so that dashes range by quarters of an em, from the quarter dash U+2010 to the plain dash U+2014, with U+2013 as half dash and U+2015 as three-quarter dash. Typographers seem to skip the latter. That is particularly striking in French where U+2010 is actually called “tiret quart de cadratin”, and U+2013 “tiret demi-cadratin,” with respect to U+2014 “tiret cadratin.” So there seems to be no name left for U+2015, that I consistently propose to call “tiret trois-quarts de cadratin.”

I cannot leave out U+2015 in the punctuation set I proposed, given I have U+2015 on Shift+3, U+2014 on Shift+4, and U+2013 on Shift+5, with numeric mnemonics rather than by increasing length or by code point (mnemonics would be better if they could be on 2, 3 and 4, but Shift+2 is taken by uppercase É, and third level by digits), and I’m about to propose this keyboard layout as a prototype for testing. It’s not about completeness (eg U+2010 is not mapped on any key any more, nor U+2012, but U+2011 is), but about not arbitrarily depriving users of an option that does exist in practice with good fonts. Often U+2013 looks too short, but U+2014 too long, so that U+2015 is right if only it has the right length in the typeface.


Change History

comment:1 Changed 10 days ago by charupdate@…

I’m worried about the removal of the TC vote for the (adhoc) [!-#\&(-*,-/\:;?@\[\]§«»‐-—‘’“”†‡…‹›] item. While excluding the three abovementioned characters, it significantly completed the Accepted Data set [\- ‐ – — , ; \: ! ? . … ’ " “ ” « » ( ) \[ \] § @ * / \& # † ‡] with the non-breaking hyphen U+2011, the single angle quotation marks U+2039 U+203A, and the turned apostrophe-quote U+2018. I was grateful for this TC vote despite the set was slightly reduced (see above), and it would be deplorable if we had again to stick with a truncated punctuation subset.

I could neither anticipate that U+0027, U+005F and U+2015 would be problematic, nor could I accept seeing these excluded. The legacy Accepted set has no authority, as it is incomplete and inconsistent. While including U+0022, it excludes U+0027. If the goal was to define a typographical set, both are to be excluded. But in a set of French exemplar punctuation, both must be included.

Next inconsistency is to include the at and number signs, but to exclude the underscore. The former two were not on French typewriters, but the latter was (on the unshifted 8 key). The historic and actual use of @ and # in France is not stronger than the use of LOW LINE, called “tiret bas” or formerly “souligné”. In computerized text, this is used to denote an underline by bracketing a word or a phrase, like in other languages.

The legacy set is also inconsistent in that it includes U+2010, but excludes U+2011, U+2012, and U+2015. In practice, the status of U+2010 with respect to U+002D is somewhat akin to the status of U+2015 with respect to U+2014, in that it is not used, because it is deemed to be a mere clone of its counterpart. This is untrue as of U+2015 (as shown above), but in practice this is true as of U+2010, that is a homoglyph of U+002D in nearly all fonts. This is correct (and the fonts using opposite policy [I only know a single one] are wrong), whereas U+2015 should not at all be that homoglyph of U+2014 it is shown as.

Yet another drawback of the Accepted Data is the exclusion of all single quotation marks, either angle- or apostrophe-shaped (U+2039 U+203A; U+2018 as counterpart of U+2019). That makes fr-FR look stubborn in contrast with all surrounding locales. I’m ashamed and am rejecting that situation as inacceptable

Hence the actual set of French punctuation in CLDR has been carelessly, incompetently and unprofessionally thrown together when CLDR was set up, and Unicode’s blocking threshold of 20 votes is distracting and discouraging vetters from reviewing the data, the more as such sets may be considered non-crucial when compared with the thousands of other issues in CLDR, or even irrelevant as these sets are (AFAIK) not used to check user input for disallowed characters. Nevertheless I think that if these exemplars are not pet data, they should be completed.

Kind request

Please consider voting for the set that has actually the most support, or at worst take any subset, but please never leave any longer the fake set currently “Accepted” so far.


Single angle quotation marks

“Guillemets français simples, en forme de chevron (‹ ... ›), séparés de leur contenu par une espace insécable (usage philologique)”
French single quotation marks, angle-shaped (‹ … ›), […] (used in philology)


Add a comment

Modify Ticket

as new

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.