[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #11046(closed: fixed)

Opened 8 months ago

Last modified 8 weeks ago

Investigate data needed for number range formatting

Reported by: shane Owned by: mark
Component: numbers Data Locale:
Phase: dsub Review: emmons
Weeks: Data Xpath:
Xref:

Description (last modified by shane) (diff)

CLDR already has one field for range formatting data:

​https://www.unicode.org/cldr/charts/latest/by_type/numbers.number_formatting_patterns.html#Miscellaneous_Patterns_

However, that data point does not answer questions such as the following:

  1. When formatting currency ranges, do you show the currency sign once ($3-5) or twice ($3-$5)?
  2. How do ranges work for negative numbers, like -3 to -5?
  3. How about data for word ranges, like "3 to 5" instead of "3-5"?
  4. How are scientific notation, compact notation, and measure units affected? (1E2-1E2 or 1-2E2? 1K-3K or 1-3K? 1 foot-3 feet, 1 foot - 3 feet, or 1-3 feet?)
  5. What happens when the range bounds become the same after rounding? This could happen if you have integer rounding and ask for the range 4.9-5.1, for example. ICU should not give an API that prints "$5-5" as the default behavior; we will need some sensible fallback, perhaps an "approximate" pattern like "~$5".

These questions should be answered in order to build a robust number range formatter in ICU.

GoogleIssue:16511905

Attachments

Change History

comment:1 Changed 8 months ago by shane

  • Description modified (diff)

comment:2 Changed 7 months ago by shane

  • Description modified (diff)

comment:3 Changed 7 months ago by mark

Those are good questions: I'll add some comments. However, what we should also do to get a sense of the variation is see what various style guides have, both for publications (The Economist) and academic (CMoS).

  1. When formatting currency ranges, do you show the currency sign once ($3-5) or twice ($3-$5)?
    • I suspect a good default for currencies and units is to only have the currency (also unit) once. We could address that either by adding boolean flags duplicateUnitInRange, duplicateCurrencyInRange, or additional patterns unitRange and currencyRange, like "{2}{0}{1}". However, I suspect that the range pattern for English with currencies or units might need spaces around the en-dash. Eg 1 – 5kg, rather than 1–5kg.
  2. How do ranges work for negative numbers, like -3 to -5?
    • I don't think we care too much what these are formatted as, because a range will be very infrequent (I don't have a particularly good intuition as to what they should be formatted like in English! "-50 – -40kg"? "-50..-40kg"?. Might be better to fall back to including the unit/currency. We could add an additional pattern, but I think it is a lower priority.
  3. How about data for word ranges, like "3 to 5" instead of "3-5"?
    • Those get really ugly, since they often require inflections. Probably best to avoid.
  4. How are scientific notation, compact notation, and measure units affected? (1E2-1E2 or 1-2E2? 1K-3K or 1-3K? 1 foot-3 feet, 1 foot - 3 feet, or 1-3 feet?)
    • I don't think we care about "programmer scientific notation". I suspect real scientific notation would be ok to just use the range. Compact numbers are a problem, since 1–5K could mean 1–5000 or 1000–5000. Units I think should work like currencies.
  5. What happens when the range bounds become the same after rounding? This could happen if you have integer rounding and ask for the range 4.9-5.1, for example. ICU should not give an API that prints "$5-5" as the default behavior; we will need some sensible fallback, perhaps an "approximate" pattern like "~$5".
    • Very interesting case. I think your suggestion of an "approximate" pattern might be the best answer.

comment:4 Changed 7 months ago by shane

On each point:

  1. Are we sure all locales are going to prefer one currency sign instead of two? Is there no variation between locales? I would be more comfortable hearing from some l10n experts.
  2. Agreed that this is lower priority for now.
  3. I think we should eventually get this data added to CLDR. So many other items have both the "narrow" form and the "long" form. But it might not be required for the first version of this feature.
  4. Like with 1, I would like to see feedback from l10n experts.
  5. Cool; how do we go about acquiring the data for the approximate pattern? Can we get it with the summer survey tool cycle?

comment:5 Changed 7 months ago by shane

  • Cc mark added

Adding mark as CC; see my replies above.

comment:6 Changed 7 months ago by mark

We do have enough time, if we get agreement soon enough.

For #4 we need to survey linguists/vetters to get a sense of what should be done, and it might result in adding a separate pattern.

For #1 we need to establish what the English would be. From https://www.safaribooksonline.com/library/view/the-economist-style/9781846686061/xhtml/h3_id_28.html, it appears that we have 3 choices:

≈ or ≅ — approximately equal to,

~ — of the order of or similar to

comment:8 Changed 7 months ago by shane

For #1 and #4, we could give users an option to either force the duplicate affix in all locales, omit the duplicate affix in all locales, or use some locale-dependent default value, similar to what we ended up doing with grouping digits. Maybe this question is peculiar enough that there isn't already an established standard, and we can just pick something that makes sense, and if someone raises the issue later, we can start customizing locales.

For #5 (approximate pattern), in my opinion, the English pattern should be ~{0}. I think the other symbols and are binary operators, like x ≈ y, whereas ~ is a unary operator.

What are the next steps on these two items?

comment:9 Changed 7 months ago by markus

  • Cc markus added

What happens when the range bounds become the same after rounding? This could happen if you have integer rounding and ask for the range 4.9-5.1, for example. ICU should not give an API that prints "$5-5" as the default behavior; we will need some sensible fallback, perhaps an "approximate" pattern like "~$5".

I disagree. I would expect to see just "$5". We don't otherwise indicate that rounding happened, and if a range devolves to a single point, then that's what I want to see.

comment:10 Changed 7 months ago by shane

I think the API definitely needs to have a toggle switch so that the user can choose the behavior they want. I see a few viable behaviors that we could let users choose between:

  1. Use an approximation pattern, like ~$5
  2. Fall back to the non-range format, like $5
  3. A programmatic failure, like an exception or null return value

We can do (b) and (c) already right now. So I think the question is whether we think (a) is important enough to gather data. Personally, I think the approximation pattern definitely is worthy of data acquisition. We can discuss what we think should be the default behavior when this gets to ICU land, but I would like to see an approximation pattern either way, because I think it has uses outside of just range formatting.

comment:11 Changed 7 months ago by shane

On issues #1 and #4, if we are going to Survey Tool now I would actually like to see the preferences enumerated.

I want to see the following strings in all locales:

  • ~5.00
  • ~$5.00

In addition, I want to see each locale choose a preference among the following choices:

  • Integer range
    • $3-5 or $3-$5?
  • Fraction range
    • $3.00-5.00 or $3.00-$5.00?
  • Approximate value (range with same lower and upper bounds)
    • $5 or ~$5?

All of these can be options in the API once this gets to ICU land, but we should have defaults for each locale.

comment:12 Changed 7 months ago by mark

  • Keywords google added

Add google keyword: should also add GoogleIssue:XXX if available

comment:13 Changed 7 months ago by shane

  • Description modified (diff)

GoogleIssue added.

comment:14 Changed 6 months ago by shane

I asked the Google Geo team to clarify which range formats they were looking for, and the following outputs were suggested by robspoel:

• (5, 10) -> "5 – 10 USD" or "5 USD – 10 USD"
• (5, 5) -> "5 USD" or "~5 USD"
• (5, _) -> "5 USD" or "From 5 USD"
• (_, 5) -> "5 USD" or "Up to 5 USD"

We already have the pattern for the first case, so we just need the following three additional ones:

  • Approximate pattern: ~{0}
  • Lower-bound pattern: From {0}
  • Upper-bound pattern: Up to {0}

comment:15 Changed 6 months ago by shane

The information in comment:11 would also be useful to have, but I think is easier to add on later.

comment:16 Changed 6 months ago by mark

We already have the lowerbound pattern.

<miscPatterns numberSystem="latn">

<pattern type="atLeast">⩾{0}</pattern>
<pattern type="range">{0}{1}</pattern>

</miscPatterns>

So I suggest

<pattern type="atMost">≤{0}</pattern>
<pattern type="approximately">~{0}</pattern>

Now, I don't know if we need the "word version". It is more complicated, since we would at least need plural support, so my inclination would be to hold off for this release.

comment:17 Changed 6 months ago by shane

comment:16 SGTM, thanks!

comment:18 Changed 6 months ago by mark

  • Owner changed from anybody to mark
  • Status changed from new to accepted
  • Milestone changed from UNSCH to 34

taking, as per email

comment:19 Changed 6 months ago by mark

  • Cc shane added
  • Status changed from accepted to reviewing
  • Review set to emmons

comment:20 Changed 5 months ago by mark

  • Priority changed from assess to major

Had to fix coverage level ...

comment:21 Changed 8 weeks ago by emmons

  • Status changed from reviewing to closed
  • Resolution set to fixed
View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.