[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #11423(new)

Opened 2 months ago

Last modified 4 days ago

Group separator migration from U+00A0 to U+202F

Reported by: charupdate@… Owned by: anybody
Component: other Data Locale:
Phase: dsub Review:
Weeks: Data Xpath:
Xref:

Description

To be cost-effective, locales using space as numbers group separator should migrate at once from the wrong U+00A0 to the correct U+202F. I didn’t aim at making French stand out, but at correcting an error in CLDR. Having even the Canadian French sublocale stick with the wrong value makes no sense and is mainly due to opaque inheritance relationships and to severe constraints on vetters applying for fr-FR and subsequently reduced to look on helpless from the sidelines when sublocales are not getting fixed.

Attachments

Change History

comment:1 Changed 2 months ago by charupdate@…

After having painstakingly catched up support of some narrow fixed-width no-break space (U+202F).
the industry is now ready to migrate from U+00A0 to U+202F. Doing it in a single rush is way more
cost-effective than migrating one locale this time, another locale next time, a handful locales the time
after, possibly splitting them up in sublocales with different migration schedules. I really believed that
now Unicode proves ready to adopt the real group separator in French, all relevant locales would be
consistently pushed for correcting that value in release 34. The v34 alpha overview makes clear they
are not.

http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration

I aimed at correcting an error in CLDR, not at making French stand out. Having many locales and
sublocales stick with the wrong value makes no sense any more.

https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d

The only effect is implementers skipping migration for fr-FR while waiting for the others to catch up,
then doing it for all at once.

There seems to be a misunderstanding: The locale setting is whether to use period, comma, space,
apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic.
Whether "space" is NO-BREAK SPACE or NARROW NO-BREAK SPACE is not a locale setting,
but it’s all about Unicode design and Unicode implementation.
I really thought that that was clear and that there’s no need to heavily insist on the ST "French" forum.
When referring to the "French thousands separator" I only meant that unlike comma- or period-using
locales, the French locale uses space and that the group separator space should be the correct one.
That did not mean that French should use another space than the other locales using space.

comment:2 Changed 2 months ago by Marcel Schneider <charupdate@…>

I've to confess that I did focus on French and only applied for fr-FR, but there was a lot of work, see http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth waiting for very few vetters. Nevertheless I also cared for English (see various tickets), and also posted on CLDR-users in a belated P.S. that fr-CA hadn’t caught up the group
separator correction yet: https://unicode.org/pipermail/cldr-users/2018-August/000825.html

Also I’m sorry for failing to provide appropriate feedback after beta release and to post upstream messages urging to make sure all locales using space for group separator be kept in synchrony.

I think the point about not splitting up all the data into locales is a very good one.

There should be a common pool so that all locales using Arabic script have automatically group separator set to ARABIC THOUSANDS SEPARATOR (provided it actually fits all), and those locales using space should only need to specify "space" to automatically get the correct one, ie NARROW NO-BREAK SPACE as soon as Unicode is ready to give it currency in that role.

Also there is a display issue in the charts, where whitespaces show up as what they are: blanks, regardless whether they are wide or narrow, justifying or fixed-width. Non-breaking behavior may be induced from context, but we see that other correct behavior cannot be induced from context, given numbers were supposed to be grouped using a justifying space, so that it only works halfway where justification is turned off (eg in Wikipedia).

comment:3 follow-up: ↓ 4 Changed 2 months ago by verdy_p@…

Note that there's another open ticket about making the CLDR survey display the actual codepoint used for whitespaces, jsut like it does but only for some controls.

I discussed while the submision phase was open about how to get the actual codepoints using the browser's developer console to select what is displayed in the page and then enter a javascript expression in the console to get the code UTF-16 code units converted to hexadecimal (actually not necessarily the codepoints, but all whitespace codepoints in question here are encoded in UTF-16 as a single code unit as they all are in the BMP).

Due to this lack of differenciation, it's simply too difficult and too long to comment each entry and then vote for changes consistently. So the locale data after the submission and vetting phase is compeltely incoherent even if it was largely agreed in a few locales, but many locales were left behind (and this causes lot of inconsistencies in many other locales due to fallback mechanisms.

So we need now to check these whitespace differences everywhere in the CLDR data: we'll need a bot action to restore at least the data consistency and get some statistics from agreed changes per locale to see those that should have their whitespaces changed consistantly.

For now the vetting process has largely failed and is too inefficient and takes really too much work time for vetters to do that, and the submission and vetting periods are much too short in time to have all changes submitted and vetted correctly: vetting only works item per item and locale by locale and works best only for evident terminology or orthography. But it does not work to get consistant votes along groups of related items (and item groups in CLDR are much too large, we don't have any filter to get custom groups on which we can vote globally): we can vever reach the expected consensus with enough vetters because the thresholds are too high and this requires really too much work for all of them (in addition this takes lot of server resources and many vetters cannot vote as the CLDR vetting tool takes now really too much browser resources and memory and response time is now dramatically slow !)

So all we can do is to discuss these consistancy issues in trackers like this one, until a CLDR Tech admin applies the changes in discussion.

The CLDR survey tool really need major cleanup about its client-side design (really excessive use of javascript in browsers, and event handlers that constantly run very long DOM reconstruction, and many quirks about how it handles input focus: a click anywhere frequently gets ignored, or aplpied on another item due to the very long delays, creating havoc everywhere with submissions that vetters even did not want to do at all): the tool now almost unusable, even on the fastest PCs with 64-bit browsers and fast 4-core CPU and lot of RAM. So many vetters abandon working halfway.

comment:4 in reply to: ↑ 3 Changed 2 months ago by Marcel Schneider <charupdate@…>

Replying to verdy_p@…:

Note that there's another open ticket about making the CLDR survey display the actual codepoint used for whitespaces, jsut like it does but only for some controls.

I discussed while the submision phase was open about how to get the actual codepoints using the browser's developer console to select what is displayed in the page and then enter a javascript expression in the console to get the code UTF-16 code units converted to hexadecimal (actually not necessarily the codepoints, but all whitespace codepoints in question here are encoded in UTF-16 as a single code unit as they all are in the BMP).

Due to this lack of differenciation, it's simply too difficult and too long to comment each entry and then vote for changes consistently. So the locale data after the submission and vetting phase is compeltely incoherent even if it was largely agreed in a few locales, but many locales were left behind (and this causes lot of inconsistencies in many other locales due to fallback mechanisms.

So we need now to check these whitespace differences everywhere in the CLDR data: we'll need a bot action to restore at least the data consistency and get some statistics from agreed changes per locale to see those that should have their whitespaces changed consistantly.

For now the vetting process has largely failed and is too inefficient and takes really too much work time for vetters to do that, and the submission and vetting periods are much too short in time to have all changes submitted and vetted correctly: vetting only works item per item and locale by locale and works best only for evident terminology or orthography. But it does not work to get consistant votes along groups of related items (and item groups in CLDR are much too large, we don't have any filter to get custom groups on which we can vote globally): we can vever reach the expected consensus with enough vetters because the thresholds are too high and this requires really too much work for all of them (in addition this takes lot of server resources and many vetters cannot vote as the CLDR vetting tool takes now really too much browser resources and memory and response time is now dramatically slow !)

So all we can do is to discuss these consistancy issues in trackers like this one, until a CLDR Tech admin applies the changes in discussion.

The CLDR survey tool really need major cleanup about its client-side design (really excessive use of javascript in browsers, and event handlers that constantly run very long DOM reconstruction, and many quirks about how it handles input focus: a click anywhere frequently gets ignored, or aplpied on another item due to the very long delays, creating havoc everywhere with submissions that vetters even did not want to do at all): the tool now almost unusable, even on the fastest PCs with 64-bit browsers and fast 4-core CPU and lot of RAM. So many vetters abandon working halfway.

I totally agree. We’ll xref this there, and also open a new ticket about Charts display, where even if fixed in ST, whitespace disambiguation and certainly other confusables like curly close quote vs letter apostrophe need to be checkable at first sight to make lookup efficient. Already tooltips showing the code points would be helpful, even prior to sorting out how to visually represent invisibles and confusables.

comment:6 follow-up: ↓ 7 Changed 2 months ago by mark

  1. Tooling Mechanics

The way the tooling works, we have an input processor (aka DAIP) that we use to clean up the data. Where we are sure that a transform of data is correct we can automatically transform the data. That processor is also run over all data before a release.

The processor can be:

  • global to all locales
  • specific to given locales (or exclude some locales)
  • specific to given XML paths
  • etc.
  1. Policy


We need to be very clear when we add something to the processor that the choices we make are valid for the locales that are affected. Where we have any question about what would be best practice for a given locale, that requires querying vetters/linguists in that locale. If we get back a satisfactory answer, then we can add to the list of locales for that input processing.

comment:7 in reply to: ↑ 6 Changed 2 months ago by Marcel Schneider <charupdate@…>

Replying to mark:

That processor is also run over all data before a release.

The processor can be […] specific to given locales

We need to be very clear when we add something to the processor that the choices we make are valid for the locales

From these three points and the above it results that prior to release of v34, all locales using space as a group separator will be updated to use NNBSP.

That’s fine. Thanks.

To be very clear:

  • Even before U+202F was encoded, U+00A0 was the wrong choice. It should have been U+2007 as suggested in UAX#14 (“2007   FIGURE SPACE   This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.”)
  • If U+2008 had had its line break property set to GL like what was done for U+2007, then the group separator should have been set to U+2008, because the International System of Units prescribes a narrow space.

National Institute of Standards and Technology: NIST Special Publication 811, 2008 Edition: Guide for the Use of the International System of Units (SI):
“The digits of numerical values having more than four digits on either side of the decimal
marker are separated into groups of three using a thin, fixed space counting from both the
left and right of the decimal marker. For example, 15 739.012 53 is highly preferred to
15739.01253. Commas are not used to separate digits into groups of three. (See Sec. 10.5.3.)”
https://physics.nist.gov/cuu/pdf/sp811.pdf

The following Canadian source is particularly useful as it additionally emboldens to override the actual setting in CLDR:
http://canada.justice.gc.ca/eng/rp-pr/csj-sjc/legis-redact/legistics/p1p34.html
“Such a triad separator should be a small space […].”

comment:8 follow-up: ↓ 9 Changed 2 months ago by mark

From these three points and the above it results that prior to release of v34, all locales using space as a group separator will be updated to use NNBSP.

That is a misunderstanding. "If we get back a satisfactory answer, then we can add to the list of locales for that input processing." That is, we'd have to either query native speakers for each locale or have authoritative sources, before making such a change. The recommendations of organizations such as NIST are valuable, but not determinant.

For example, they recommend using thin-spaces after the decimal also. That is not customary usage (for English), even when a NNBS is used as a grouping separator. That is, it may be done for technical documentation, but we are looking at more customary usage, such as what is in major newpaper/journal style guides.

comment:9 in reply to: ↑ 8 Changed 2 months ago by Marcel Schneider <charupdate@…>

Replying to mark:

"If we get back a satisfactory answer, then we can add to the list of locales for that input processing." That is, we'd have to either query native speakers for each locale or have authoritative sources, before making such a change.

At first sight, that seems to hold for everything but use of a justifying space for grouping digits to triads in numbers.

The recommendations of organizations such as NIST are valuable, but not determinant.

For example, they recommend using thin-spaces after the decimal also. That is not customary usage (for English), even when a NNBS is used as a grouping separator.

Indeed we should not use grouping after the decimal, consistently with the way of spelling decimal fractions reading digits one by one, not “[…] dot five hundred three thousandths six hundred four millionths” which is not customary at all. I understand that not every recommendation can be relied upon.

It might indeed be a proof of good diplomacy to ask representatives of each locale, and thus that might be a point for survey 35, if communities feel better served this way.

That is, it may be done for technical documentation, but we are looking at more customary usage, such as what is in major newpaper/journal style guides.

I think that this may work with The Economist style guide, but not with CMOS whose recommendations WRT use of NBSP I’ve challenged as not “informing the editorial canon with sound, definitive advice” but encountered unresponsiveness.

Hence sorting out whose advice to follow is definitely up to Unicode.

comment:10 Changed 3 weeks ago by mark

  • Component changed from main to other

comment:11 Changed 4 days ago by mark

  • Milestone changed from UNSCH to to-assess
View

Add a comment

Modify Ticket

Action
as new
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.