[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #11423(accepted)

Opened 7 months ago

Last modified 3 months ago

Group separator migration from U+00A0 to U+202F

Reported by: charupdate@… Owned by: kristi
Component: exemplars-etc Data Locale:
Phase: dvet Review:
Weeks: Data Xpath:
Xref:

Description

To be cost-effective, locales using space as numbers group separator should migrate at once from the wrong U+00A0 to the correct U+202F. I didn’t aim at making French stand out, but at correcting an error in CLDR. Having even the Canadian French sublocale stick with the wrong value makes no sense and is mainly due to opaque inheritance relationships and to severe constraints on vetters applying for fr-FR and subsequently reduced to look on helpless from the sidelines when sublocales are not getting fixed.

Attachments

Change History

comment:1 Changed 7 months ago by charupdate@…

After having painstakingly catched up support of some narrow fixed-width no-break space (U+202F).
the industry is now ready to migrate from U+00A0 to U+202F. Doing it in a single rush is way more
cost-effective than migrating one locale this time, another locale next time, a handful locales the time
after, possibly splitting them up in sublocales with different migration schedules. I really believed that
now Unicode proves ready to adopt the real group separator in French, all relevant locales would be
consistently pushed for correcting that value in release 34. The v34 alpha overview makes clear they
are not.

http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration

I aimed at correcting an error in CLDR, not at making French stand out. Having many locales and
sublocales stick with the wrong value makes no sense any more.

https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d

The only effect is implementers skipping migration for fr-FR while waiting for the others to catch up,
then doing it for all at once.

There seems to be a misunderstanding: The locale setting is whether to use period, comma, space,
apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic.
Whether "space" is NO-BREAK SPACE or NARROW NO-BREAK SPACE is not a locale setting,
but it’s all about Unicode design and Unicode implementation.
I really thought that that was clear and that there’s no need to heavily insist on the ST "French" forum.
When referring to the "French thousands separator" I only meant that unlike comma- or period-using
locales, the French locale uses space and that the group separator space should be the correct one.
That did not mean that French should use another space than the other locales using space.

comment:2 Changed 7 months ago by Marcel Schneider <charupdate@…>

I've to confess that I did focus on French and only applied for fr-FR, but there was a lot of work, see http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth waiting for very few vetters. Nevertheless I also cared for English (see various tickets), and also posted on CLDR-users in a belated P.S. that fr-CA hadn’t caught up the group
separator correction yet: https://unicode.org/pipermail/cldr-users/2018-August/000825.html

Also I’m sorry for failing to provide appropriate feedback after beta release and to post upstream messages urging to make sure all locales using space for group separator be kept in synchrony.

I think the point about not splitting up all the data into locales is a very good one.

There should be a common pool so that all locales using Arabic script have automatically group separator set to ARABIC THOUSANDS SEPARATOR (provided it actually fits all), and those locales using space should only need to specify "space" to automatically get the correct one, ie NARROW NO-BREAK SPACE as soon as Unicode is ready to give it currency in that role.

Also there is a display issue in the charts, where whitespaces show up as what they are: blanks, regardless whether they are wide or narrow, justifying or fixed-width. Non-breaking behavior may be induced from context, but we see that other correct behavior cannot be induced from context, given numbers were supposed to be grouped using a justifying space, so that it only works halfway where justification is turned off (eg in Wikipedia).

comment:3 follow-up: ↓ 4 Changed 7 months ago by verdy_p@…

Note that there's another open ticket about making the CLDR survey display the actual codepoint used for whitespaces, jsut like it does but only for some controls.

I discussed while the submision phase was open about how to get the actual codepoints using the browser's developer console to select what is displayed in the page and then enter a javascript expression in the console to get the code UTF-16 code units converted to hexadecimal (actually not necessarily the codepoints, but all whitespace codepoints in question here are encoded in UTF-16 as a single code unit as they all are in the BMP).

Due to this lack of differenciation, it's simply too difficult and too long to comment each entry and then vote for changes consistently. So the locale data after the submission and vetting phase is compeltely incoherent even if it was largely agreed in a few locales, but many locales were left behind (and this causes lot of inconsistencies in many other locales due to fallback mechanisms.

So we need now to check these whitespace differences everywhere in the CLDR data: we'll need a bot action to restore at least the data consistency and get some statistics from agreed changes per locale to see those that should have their whitespaces changed consistantly.

For now the vetting process has largely failed and is too inefficient and takes really too much work time for vetters to do that, and the submission and vetting periods are much too short in time to have all changes submitted and vetted correctly: vetting only works item per item and locale by locale and works best only for evident terminology or orthography. But it does not work to get consistant votes along groups of related items (and item groups in CLDR are much too large, we don't have any filter to get custom groups on which we can vote globally): we can vever reach the expected consensus with enough vetters because the thresholds are too high and this requires really too much work for all of them (in addition this takes lot of server resources and many vetters cannot vote as the CLDR vetting tool takes now really too much browser resources and memory and response time is now dramatically slow !)

So all we can do is to discuss these consistancy issues in trackers like this one, until a CLDR Tech admin applies the changes in discussion.

The CLDR survey tool really need major cleanup about its client-side design (really excessive use of javascript in browsers, and event handlers that constantly run very long DOM reconstruction, and many quirks about how it handles input focus: a click anywhere frequently gets ignored, or aplpied on another item due to the very long delays, creating havoc everywhere with submissions that vetters even did not want to do at all): the tool now almost unusable, even on the fastest PCs with 64-bit browsers and fast 4-core CPU and lot of RAM. So many vetters abandon working halfway.

comment:4 in reply to: ↑ 3 Changed 7 months ago by Marcel Schneider <charupdate@…>

Replying to verdy_p@…:

Note that there's another open ticket about making the CLDR survey display the actual codepoint used for whitespaces, jsut like it does but only for some controls.

I discussed while the submision phase was open about how to get the actual codepoints using the browser's developer console to select what is displayed in the page and then enter a javascript expression in the console to get the code UTF-16 code units converted to hexadecimal (actually not necessarily the codepoints, but all whitespace codepoints in question here are encoded in UTF-16 as a single code unit as they all are in the BMP).

Due to this lack of differenciation, it's simply too difficult and too long to comment each entry and then vote for changes consistently. So the locale data after the submission and vetting phase is compeltely incoherent even if it was largely agreed in a few locales, but many locales were left behind (and this causes lot of inconsistencies in many other locales due to fallback mechanisms.

So we need now to check these whitespace differences everywhere in the CLDR data: we'll need a bot action to restore at least the data consistency and get some statistics from agreed changes per locale to see those that should have their whitespaces changed consistantly.

For now the vetting process has largely failed and is too inefficient and takes really too much work time for vetters to do that, and the submission and vetting periods are much too short in time to have all changes submitted and vetted correctly: vetting only works item per item and locale by locale and works best only for evident terminology or orthography. But it does not work to get consistant votes along groups of related items (and item groups in CLDR are much too large, we don't have any filter to get custom groups on which we can vote globally): we can vever reach the expected consensus with enough vetters because the thresholds are too high and this requires really too much work for all of them (in addition this takes lot of server resources and many vetters cannot vote as the CLDR vetting tool takes now really too much browser resources and memory and response time is now dramatically slow !)

So all we can do is to discuss these consistancy issues in trackers like this one, until a CLDR Tech admin applies the changes in discussion.

The CLDR survey tool really need major cleanup about its client-side design (really excessive use of javascript in browsers, and event handlers that constantly run very long DOM reconstruction, and many quirks about how it handles input focus: a click anywhere frequently gets ignored, or aplpied on another item due to the very long delays, creating havoc everywhere with submissions that vetters even did not want to do at all): the tool now almost unusable, even on the fastest PCs with 64-bit browsers and fast 4-core CPU and lot of RAM. So many vetters abandon working halfway.

I totally agree. We’ll xref this there, and also open a new ticket about Charts display, where even if fixed in ST, whitespace disambiguation and certainly other confusables like curly close quote vs letter apostrophe need to be checkable at first sight to make lookup efficient. Already tooltips showing the code points would be helpful, even prior to sorting out how to visually represent invisibles and confusables.

comment:6 follow-up: ↓ 7 Changed 7 months ago by mark

  1. Tooling Mechanics

The way the tooling works, we have an input processor (aka DAIP) that we use to clean up the data. Where we are sure that a transform of data is correct we can automatically transform the data. That processor is also run over all data before a release.

The processor can be:

  • global to all locales
  • specific to given locales (or exclude some locales)
  • specific to given XML paths
  • etc.
  1. Policy


We need to be very clear when we add something to the processor that the choices we make are valid for the locales that are affected. Where we have any question about what would be best practice for a given locale, that requires querying vetters/linguists in that locale. If we get back a satisfactory answer, then we can add to the list of locales for that input processing.

comment:7 in reply to: ↑ 6 Changed 7 months ago by Marcel Schneider <charupdate@…>

Replying to mark:

That processor is also run over all data before a release.

The processor can be […] specific to given locales

We need to be very clear when we add something to the processor that the choices we make are valid for the locales

From these three points and the above it results that prior to release of v34, all locales using space as a group separator will be updated to use NNBSP.

That’s fine. Thanks.

To be very clear:

  • Even before U+202F was encoded, U+00A0 was the wrong choice. It should have been U+2007 as suggested in UAX#14 (“2007   FIGURE SPACE   This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.”)
  • If U+2008 had had its line break property set to GL like what was done for U+2007, then the group separator should have been set to U+2008, because the International System of Units prescribes a narrow space.

National Institute of Standards and Technology: NIST Special Publication 811, 2008 Edition: Guide for the Use of the International System of Units (SI):
“The digits of numerical values having more than four digits on either side of the decimal
marker are separated into groups of three using a thin, fixed space counting from both the
left and right of the decimal marker. For example, 15 739.012 53 is highly preferred to
15739.01253. Commas are not used to separate digits into groups of three. (See Sec. 10.5.3.)”
https://physics.nist.gov/cuu/pdf/sp811.pdf

The following Canadian source is particularly useful as it additionally emboldens to override the actual setting in CLDR:
http://canada.justice.gc.ca/eng/rp-pr/csj-sjc/legis-redact/legistics/p1p34.html
“Such a triad separator should be a small space […].”

comment:8 follow-up: ↓ 9 Changed 7 months ago by mark

From these three points and the above it results that prior to release of v34, all locales using space as a group separator will be updated to use NNBSP.

That is a misunderstanding. "If we get back a satisfactory answer, then we can add to the list of locales for that input processing." That is, we'd have to either query native speakers for each locale or have authoritative sources, before making such a change. The recommendations of organizations such as NIST are valuable, but not determinant.

For example, they recommend using thin-spaces after the decimal also. That is not customary usage (for English), even when a NNBS is used as a grouping separator. That is, it may be done for technical documentation, but we are looking at more customary usage, such as what is in major newpaper/journal style guides.

comment:9 in reply to: ↑ 8 ; follow-up: ↓ 23 Changed 7 months ago by Marcel Schneider <charupdate@…>

Replying to mark:

"If we get back a satisfactory answer, then we can add to the list of locales for that input processing." That is, we'd have to either query native speakers for each locale or have authoritative sources, before making such a change.

At first sight, that seems to hold for everything but use of a justifying space for grouping digits to triads in numbers.

The recommendations of organizations such as NIST are valuable, but not determinant.

For example, they recommend using thin-spaces after the decimal also. That is not customary usage (for English), even when a NNBS is used as a grouping separator.

Indeed we should not use grouping after the decimal, consistently with the way of spelling decimal fractions reading digits one by one, not “[…] dot five hundred three thousandths six hundred four millionths” which is not customary at all. I understand that not every recommendation can be relied upon.

It might indeed be a proof of good diplomacy to ask representatives of each locale, and thus that might be a point for survey 35, if communities feel better served this way.

That is, it may be done for technical documentation, but we are looking at more customary usage, such as what is in major newpaper/journal style guides.

I think that this may work with The Economist style guide, but not with CMOS whose recommendations WRT use of NBSP I’ve challenged as not “informing the editorial canon with sound, definitive advice” but encountered unresponsiveness.

Hence sorting out whose advice to follow is definitely up to Unicode.

comment:10 Changed 6 months ago by mark

  • Component changed from main to other

comment:11 Changed 5 months ago by mark

  • Milestone changed from UNSCH to to-assess

comment:12 Changed 5 months ago by mark

  • Owner changed from anybody to mark
  • Component changed from other to exemplars-etc

comment:13 follow-ups: ↓ 15 ↓ 19 Changed 5 months ago by mark

  • Phase changed from dsub to dvet
  • Milestone changed from to-assess to assessed

Needs discussion. Main issues are:

  1. Confirmation from native speakers for each affected locale that the narrow form is preferred*
  2. Looking at downstream testing impact.
  • I suspect that in many cases people are not aware of the option of a narrower space, since it is generally not on keyboards. I also suspect that if it is the preferred form for the cldr targeted locales, we could make the change globally.

The affected locales would be:
·af· ·agq· ·bas· ·be· ·bg· ·br· ·cs· ·cu· ·de_AT· ·dje· ·dua· ·dyo· ·en_FI· ·en_SE· ·en_ZA· ·eo· ·es_CR· ·et· ·ewo· ·ff· ·fi· ·fr_CA· ·hu· ·hy· ·ka· ·kab· ·kea· ·khq· ·kk· ·ksf· ·ksh· ·ky· ·lt· ·lv· ·mfe· ·nb· ·nmg· ·nn· ·os· ·pl· ·prg· ·pt_PT· ·ru· ·sah· ·se· ·ses· ·shi· ·shi_Latn· ·sk· ·smn· ·sq· ·sv· ·tg· ·tk· ·tt· ·twq· ·tzm· ·uk· ·uz· ·uz_Cyrl· ·xh· ·yav· ·zgh·

comment:14 follow-ups: ↓ 16 ↓ 27 Changed 3 months ago by mark

  • Owner changed from mark to kristi
  • Status changed from new to accepted
  • Milestone changed from assessed to discuss

Agreed to reassess after v35 to fully gauge impact; we also need more justification for the individual languages affected.

comment:15 in reply to: ↑ 13 Changed 3 months ago by marcel schneider <charupdate@…>

Replying to mark:
[…]

The affected locales would be:
·af· ·agq· ·bas· ·be· ·bg· ·br· ·cs· ·cu· ·de_AT· ·dje· ·dua· ·dyo· ·en_FI· ·en_SE· ·en_ZA· ·eo· ·es_CR· ·et· ·ewo· ·ff· ·fi· ·fr_CA· ·hu· ·hy· ·ka· ·kab· ·kea· ·khq· ·kk· ·ksf· ·ksh· ·ky· ·lt· ·lv· ·mfe· ·nb· ·nmg· ·nn· ·os· ·pl· ·prg· ·pt_PT· ·ru· ·sah· ·se· ·ses· ·shi· ·shi_Latn· ·sk· ·smn· ·sq· ·sv· ·tg· ·tk· ·tt· ·twq· ·tzm· ·uk· ·uz· ·uz_Cyrl· ·xh· ·yav· ·zgh·

·de· is affected too since it does not use the period as grouping separator, except in mechanically or handwriting amounts of money “for security reasons”, which is almost surely outdated legacy practice and does not justify assuming period as grouping separator in CLDR for ·de·, nor apostrophe for de_CH (see the belowmentioned Wikipedia entry noting the practice of the federal chancellery as opposed to externally and even internally inconsistent practice in some states [cantons] that has ever been discouraged by reliable authorities, and therefore should have no place in CLDR).

Please read through the Wikipedia article in German Wikipedia about writing numbers:

https://de.wikipedia.org/wiki/Schreibweise_von_Zahlen

I suspect the issue is an educational one, especially on vetters’ side, given the ·de· locale has messed up its grouping separator value, which should be <NNBSP> since a long time. Surely this locale is not the only one that has wrong data in CLDR and isn’t about to getting it fixed. I’d suggest that all vetters participate in an online training set up by CLDR on how to assess locale data. Everybody, including me, believes he or she knows how to write their locale. But taking a close look unveils that not everybody (including me) has got the opportunity to get enough training, and/or to assimilate enough reading, and/or to look up the relevant article(s) in Wikipedia(s, since not each locale Wikipedia has a good article on everything).

But I still cannot help seeing Unicode as the main culprit in that it did not encode U+2008 PUNCTUATION SPACE as a non-breaking space, like it did so for U+2007 FIGURE SPACE which is non-breaking. The standards specifying the use of a non-breaking thin space turn out being fairly numerous, and even an single one would have been sufficient for Unicode to encode a non-breakable thin space. That isn’t even about the thin space, given the punctuation space is sort of a duplicate encoding of the thin space. Since figure dash, figure space and punctuation space were all encoded for typesetting tables, there was really no point in not making the punctuation space non-breakable while the figure space was made non-breakable. Given the result of that flaw was that correct typesetting with correct non-breakable thin spaces was possible only in DTP applications, Hanlon’s razor does not apply.

Is there any evidence that the line break property of U+2008 was set to BA instead of GL inadvertantly, as opposed to being set so by design? And if not so, for what good exactly?

This bug tracker is not the place to discuss UAXes, but just in this specific context there is a need to point out that the following statement in UAX #14 is unresponsive, and isn’t even applied in CLDR where the group separator was U+00A0 instead of the “preferred” U+2007:

2007 FIGURE SPACE
This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.

The fact that “[i]t has the same width as a digit” is precisely the reason why the group separator should not be that space but a thin space; see the security concern mentioned above.
Also this statement in UAX #14 is misusing the alphanumeric character identifier (aka name) in making us believe that this space was encoded for use in numbers. The reality was AFAIU that this space was encoded to fill up lines in tables in order to align vertical box drawing characters horizontally to generate vertical borders between table columns, in legacy hot-metal style typesetting figure tables (please correct me if I’m wrong). The punctuation space was encoded for that same purpose for use in table cells where the number did not have a decimal separator, while in other table cells in the same column it did.

Obviously, since nobody is doing table layout this way any longer, the relevant three characters U+2007, U+2008, U+2012 are nowadays hijacked for other purposes, or at least they should be fit for being so, but in fact they are scarcely any useful, mainly because U+2007 is breakable (and because Unicode did not specify the vertical alignment of dashes, and did not recommend U+2012 for interval notation).

As a result, we’ve lost a 20 years in fixing the digital representation of the world’s languages, and are today committed to spend our time poking around in issues that should be part of the settled basics since two decades.

comment:16 in reply to: ↑ 14 Changed 3 months ago by marcel schneider <charupdate@…>

Replying to mark:

Agreed to reassess after v35 to fully gauge impact; we also need more justification for the individual languages affected.

Gauging the impact is biased by the bad font called “Courier New”, which in persistently unsupporting <NNBSP> even in latest distributions of Android seems to position itself as an instrument of a supposed anti-NNBSP-lobby, that is supposed to include some top actors of the DTP industry. I cannot see any other reason emboldening vendors to not update that widely used typeface to a decent level of Unicode support.

That messy font (and perhaps some others, don’t know any though) is actually misleading software engineers in the United States (name of exemplar individual provided off-line on request) to discourage French keyboard layout developers (idem) from using <NNBSP> for proper punctuation spacing. I don’t understand why Courier New is set as default in Android browser and no means is provided to set default to another font, user reported. This UI is annihilating and destroying many efforts to digitally represent the French language on the internet.

As a sidenote: The poor legibility of Courier New wouldn’t even make the font worth updating. Instead it should be definitively discarded. Alternative monospace fonts with correct through very good legibility are available on the marketplace. Users dislike Courier New, prefer Consolas. Microsoft’s VS Code text editor defaults to 'Droid Sans Mono'.

I kindly request that this ticket be forwarded to the relevant vendors prior to assessing the impact of the group separator migration. Then please wait until Unicode has sorted out what is malicious bug, unlawful lobbying, and destructive misuse of DTP software.

comment:17 Changed 3 months ago by marcel schneider <charupdate@…>

Quotation from https://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0125.html

[…] since every Unicode implementation must rely on the character
properties, and given keeping this library up-to-date is straightforward
and easy, there is really no point in displaying a .notdef box in lieu
of whatever whitespace.

As a consequence, prior to assessing the impact of the group separator
migration from (wrong) <NBSP> to (correct) <NNBSP> on implementations
and interoperability, Unicode would be well advised to start assessing
the impact of implementations (and, of course, the backing vendors) on
correct rendering of <NNBSP>, and on the related usability and
interoperability of the digital representation of those many locales
that should rely on <NNBSP>.

comment:18 Changed 3 months ago by marcel schneider <charupdate@…>

Corrigendum

Font support for NARROW NO-BREAK SPACE

According to:
https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Courier New does have <NNBSP>.

My apologies for previous posts. We must have been using old copies of the typeface.

Also about the somewhat peculiar history of <NNBSP> please refer to my reply on Unicode Public:
https://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0146.html

comment:19 in reply to: ↑ 13 ; follow-up: ↓ 20 Changed 3 months ago by marcel schneider <charupdate@…>

Replying to mark:

Needs discussion. Main issues are:

  1. Confirmation from native speakers for each affected locale that the narrow form is preferred*
  2. Looking at downstream testing impact.
  • I suspect that in many cases people are not aware of the option of a narrower space, since it is generally not on keyboards. I also suspect that if it is the preferred form for the cldr targeted locales, we could make the change globally.

The affected locales would be:
·af· ·agq· ·bas· ·be· ·bg· ·br· ·cs· ·cu· ·de_AT· ·dje· ·dua· ·dyo· ·en_FI· ·en_SE· ·en_ZA· ·eo· ·es_CR· ·et· ·ewo· ·ff· ·fi· ·fr_CA· ·hu· ·hy· ·ka· ·kab· ·kea· ·khq· ·kk· ·ksf· ·ksh· ·ky· ·lt· ·lv· ·mfe· ·nb· ·nmg· ·nn· ·os· ·pl· ·prg· ·pt_PT· ·ru· ·sah· ·se· ·ses· ·shi· ·shi_Latn· ·sk· ·smn· ·sq· ·sv· ·tg· ·tk· ·tt· ·twq· ·tzm· ·uk· ·uz· ·uz_Cyrl· ·xh· ·yav· ·zgh·


Many of the scripts used are neither Mongolian nor Latin.

Also one needs to assess what languages NNBSP is actually used in. Statements like the following:

The only other widely noted use for U+202F NNBSP is for representation of the thin non-breaking space (espace fine insécable) regularly seen next to certain punctuation marks in French style typography. […]

http://www.unicode.org/review/pri308/pri308-background.html

…are actually untrue. One example of non-French Latin usage is in German "z.<NNBSP>B." according to:

https://www.mrunix.de/forums/showthread.php?50696-Gesperrte-Leerzeichen

After running Google translate on this page, we read:

“how can you create a locked space in Tex?
For example, between abbreviations such as "for example," that no line break occurs here. Or also with "§ 1".
[…]
a protected space with the tilde (~), but for spaces in abbreviations a space should be used, so a horizontal distance a little less than the normal interword space. This you generate with \,
[…]
Even worse is the separation between numbers and units! Or in dates.”

[Marcel who has posted on this forum is not me.]

I note an intense lobbying effort starting after release 34 of CLDR featuring the group separator migration from U+00A0 to U+202F, directed against the use of NNBSP. The completely useless addition of Script_Extensions={Latn Mong} seems to be part of that effort aiming at rolling back or at least at containing, halting, damming and curtailing the use of NNBSP, while that character is just the regular non-breaking thin space, that should have been encoded from the beginning on as PUNCTUATION SPACE with line-break class GL like FIGURE SPACE (that is too wide for most purposes).

comment:20 in reply to: ↑ 19 Changed 3 months ago by marcel schneider <charupdate@…>

Trying to be more specific than marcel schneider <charupdate@…>:
[…] Statements like the following:

The only other widely noted use for U+202F NNBSP is for representation of the thin non-breaking space (espace fine insécable) regularly seen next to certain punctuation marks in French style typography. […]

http://www.unicode.org/review/pri308/pri308-background.html

…are actually untrue. One example of non-French Latin usage is in German […]


The statement quoted above, found on the PRI #308 background information page, is a compound of two true parts and one wrong part:

  1. Another “use for U+202F NNBSP is for representation of the thin non-breaking space (espace fine insécable) regularly seen next to certain punctuation marks in French style typography.” TRUE
  2. The use of NNBSP in French is “widely noted.” TRUE
  3. The use of NNBSP in French “[is t]he only other” one. FALSE

comment:21 Changed 3 months ago by marcel schneider <charupdate@…>

comment:22 follow-up: ↓ 24 Changed 3 months ago by jameskasskrv@…

Mark wrote (message 13),

I also suspect that if it is the preferred form for the cldr targeted
locales, we could make the change globally.

Quoting from,
https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html
“... Likewise, while the U.K. and U.S. use a comma to separate groups of thousands, many other countries use a period instead, and some countries separate thousands groups with a thin space. ...”

No mention of a normal space. It’s hard to imagine any alphabetic writing system preferring a normal space. Moving the group separator from U+00A0 to U+202F seems like the right thing to do.

Some might prefer to await locale-by-locale confirmation. A desire to err on the side of caution is understandable, but I wonder if it applies here. Are there any reasonably official sources in any locales using spaces as group separators which say that normal spaces are preferred? If so, they would probably be in the minority. If not, ...

comment:23 in reply to: ↑ 9 Changed 3 months ago by marcel schneider <charupdate@…>

Updating comment:9:
[…]

I think that this may work with The Economist style guide, but not with CMOS whose recommendations WRT use of NBSP I’ve challenged as not “informing the editorial canon with sound, definitive advice” but encountered unresponsiveness.

My apologies. A relaunch one month later was readily answered (CMOS hadn’t realized that I was waiting for feedback). Various suggestions are being taken into account. (Posting further details is not appropriate in this bug report that is focused on CLDR.) I did not see this incoming e‑mail until now, although it was received several days before I posted comment:9.

comment:24 in reply to: ↑ 22 ; follow-up: ↓ 25 Changed 3 months ago by marcel schneider <charupdate@…>

Replying to jameskasskrv@…:

Mark wrote (message 13),

I also suspect that if it is the preferred form for the cldr targeted
locales, we could make the change globally.

Quoting from,
https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html
“... Likewise, while the U.K. and U.S. use a comma to separate groups of thousands, many other countries use a period instead, and some countries separate thousands groups with a thin space. ...”

No mention of a normal space. It’s hard to imagine any alphabetic writing system preferring a normal space. Moving the group separator from U+00A0 to U+202F seems like the right thing to do.


Indeed it is, and the need to do so is simple to assess since the group separator must never have the width of a digit or more. The only non-breakable space left is NNBSP.

In function tables as in Meyers Rechenduden from 1960, with values computed on an IBM 650, the group separators have either the width of a punctuation (decimal comma) or less than the width of a punctuation, but never the width of a digit.

In running text typeset with proportional advance width, the interword space may be significantly narrower than a digit. When numbers are grouped using FIGURE SPACE as recommended in UAX #14, the grouping spaces are wider than the surrounding spaces, which is of course totally inappropriate. Documents like the Unicode Line Breaking Algorithm specification (UAX #14) stating that the preferred space for use in numbers is FIGURE SPACE were either written up while not checking how it works in practice, or constructed in an attempt to hide the fact that the Unicode Standard was lacking a narrow fixed width non-breaking space as it is required in normal typography with most if not all scripts.

As of NO-BREAK SPACE, it may be significantly wider than a digit, depending on justification. Therefore, NBSP is inappropriate as a group separator. It may eventually be used as a fallback provided that justification is turned off. One example of such a low-end layout is the Wikimedia layout engine, which is deeply biased through not prompting Basic Latin script users to update their fonts and settings, as is done for Extended Latin and many if not all other scripts. It’s a form of hypocrisy that is very lenient about unsupport of the standard non-breaking thin space, although that space is the only true alternative to the justifying NO-BREAK SPACE and the digit-wide FIGURE SPACE. There is really no point in sticking with Latin‑1 and the ISO/IEC 8859 series (a policy that only encourages to make governments, economies and individuals vulnerable to cybercriminality by keeping old systems in use).

comment:25 in reply to: ↑ 24 ; follow-up: ↓ 26 Changed 3 months ago by marcel schneider <charupdate@…>

Note to comment:24 :

the group separator must never have the width of a digit or more.

Except in monospace rendering. But referring to that is pointless in this discussion except if considering the edge case where justification is turned on while the font is monospaced. Sometimes justification is emulated in a fixed-width context by adding extra spaces to the last interword spaces of a line, or better, by interspersing additional spaces to some interword spaces across the line. Then the algorithm needs to spare the NBSP used in numbers. However that is not something that should be considered in a focussed discussion of the topic.

comment:26 in reply to: ↑ 25 Changed 3 months ago by marcel schneider <charupdate@…>

Edit to comment:25:

Note to comment:24 :

the group separator must never have the width of a digit or more.

Except in monospace rendering.

Where it will have the width of a character, but must never be wider—which happens when NBSP is used and real justification is turned on, except in word processors other than Word 2013; word processors typically tailor the NBSP to make it fixed-width, creating an illusion of “acceptable” layout that is destroyed once such a text is posted on the internet in HTML with text-align:justify.

Clearly NBSP as a group separator is not what could be called a robust solution. It is less than an acceptable fallback. (It is actually a blunder.)

comment:27 in reply to: ↑ 14 Changed 3 months ago by marcel schneider <charupdate@…>

Replying to mark:

Agreed to reassess after v35 to fully gauge impact;

Crossing with Unicode Public:

Parsing the missing-font-support argument

I didn’t think of it as a real concern, until I was kindly informed offlist that a well-known member of a small European NB was significantly less than eager to adopt the Unicode Standard. From the fact that such people could stay in office I infer that this surprising and technically inexplainable behavior was widespread. After crossing with two written examples from another country, I now understand that missing font support is merely a symptom of a deeply embedded mindset diverting people from upgrading to full Unicode support.

Obviously those people don’t deserve to get their language digitally represented in an accurate and interoperable way.

It’s up to the locales to cut the influence of those people off, and to spread Unicode education among the innocent youth and other benevolent end-users. Almost the first step is to provide appropriate keyboard layouts.

It would be nice, though, that end-users see their language in user interfaces displayed in a standard way, not downgraded to ugly fallbacks.

I’m now likely to step out of this discussion, as I’m unable to contribute much more except submitting the requested proposal. If things are not being fixed and I’m to write about it in end-user documentation, I may hopefully link to this ticket.

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.