[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #6269(closed enhancement: fixed)

Opened 5 years ago

Last modified 4 years ago


Reported by: mark Owned by: mark
Component: xxx-spec Data Locale:
Phase: Review: pedberg
Weeks: Data Xpath:

Description (last modified by mark) (diff)

The RBNF rules are not tracked and vetted in the same way as other locale data. As a result,

  1. there are many forms that are effectively unusable by general purpose software, while
  2. other forms are missing.

For example, we have for German spellout-cardinal-masculine, spellout-cardinal-feminine. So there is some attempt to account for context (in this case, gender). But we don't have masculine/feminine/neuter ordinals, nor accusative, dative, genitive, nor the number, nor the determiner; yet we give no indication that they are incomplete. (Nor, if we had them, would someone be able to apply them without a good model of German grammar, plus a mechanism (and database) for German noun genders.

So I suggest that we do both of the following:

  1. we add some strong caveats to the data for RBNF.
  2. either we document what rulesets names like the following mean, or we make them private:


I put together a spreadsheet below.



Change History

comment:1 Changed 5 years ago by mark

  • Description modified (diff)

comment:2 Changed 5 years ago by kent.karlsson14@…

For now (2013-06-10), many of the ruleset names you list in that spreadsheet are already private. I think those should not be listed in that spreadsheet: they exist only to express the public rules in a concise manner, and should never be used directly (indeed, ICU does not expose an API for calling them; that is what it means for them to be "private").

You also list items like "MISSING-spellout-cardinal", where there are (e.g.) "spellout-cardinal-feminine" and "spellout-cardinal-masculine". Well, that is because no "spellout-cardinal" should be there. The "spellout-cardinal-*" rulesets are spellouts that are to be used in conjunction with a noun. And in several languages you *need* to know the grammatical gender of the noun to pick the *correct* number spellout. Sometimes the difference is slight, sometimes greater. But if the number spellout at all differ between the grammatical genders, there should be no "spellout-cardinal" (without qualification) ruleset. The grammatical gender of the noun is used in the name of the spellout ruleset (that is a current error in the Arabic rules, as far as I can tell). True, picking the wrong gender is unlikely to impede understandability, but it will look funny.

It's true that often the "spellout-ordinal-*" rulesets are missing. But that is only because it (at the time at least) was hard to find sufficient data to make such rulesets for all languages covered. Note that ordinal numbers have an (implicit or explicit) noun.

For the case that there is no noun, there are two rulesets in each covered locale: "spellout-numbering" (for the general case; often equal to the masculine variant in languages with grammatical gender) and "spellout-numbering-year" for the case that one is spelling out a calendar year number. The word corresponding to "year" should not occur in the rules (that is a current error in the Ewe rules...).

As for further grammatical variation (e.g. grammatical case), that 1) was even harder to find any data to build upon, and 2) are rarely used and when used it is only in rather special contexts.

I agree that the documentation should list (somewhere) the public RBNF ruleset names for each RBNF covered locale. The documentation should also note that "spellout-numbering" and "spellout-numbering-year" exist in all for RBNF covered locales.

comment:3 Changed 5 years ago by emmons

  • Status changed from new to assigned
  • Component changed from unknown to spec
  • Priority changed from assess to medium
  • Milestone changed from UNSCH to 24final
  • Owner changed from anybody to mark
  • Type changed from unknown to enhancement

comment:4 Changed 5 years ago by emmons

  • Cc grhoten added

comment:5 Changed 5 years ago by grhoten

I'm mostly agreeing with Kent. I am also agreeing that the rules for German are insufficient. There has been some discussion about expanding the German rules to account for additional forms. The same issue of insufficient variants goes for Finnish and some other Nordic and Germanic languages. The increased number of variants also make it more confusing to vet and use.

There are 3 main numbering types to consider:

  1. Numbering
  2. Cardinal
  3. Ordinal

A common subtype for numbering is numbering-year. Unfortunately, some people misunderstand the purpose of the numbering-year type. So some documentation would be helpful.

The private rules are all language specific, and they should not be used in a public manner. Most of them do not handle complete ranges, and they are usually subordinate to the public rules. They tend to handle a subrange of the public rules. In generating the report of existing rules, it's important to leave out the private rules that are currently included in the spreadsheet. They should not be publicized. They can arbitrarily change. I try to be a little more consistent for the public rule names. The Number Format Tester has a better way to retrieve these public rules: http://st.unicode.org/cldr-apps/numbers.jsp

I do agree that some form of vetting completeness would be helpful. I've seen a few contributed rules that are obviously incomplete once you get a native speaker to vet the rules.

comment:6 Changed 5 years ago by grhoten

To be a little more explicit, these rules in question are all private:


comment:7 Changed 5 years ago by mark

  • Milestone changed from 24final to 25design

comment:8 Changed 5 years ago by emmons

  • Milestone changed from 25design to 25rc

Moving all 25dsub and 25design tickets to 25rc. If you plan to complete items in the 25M1 time frame, please move those tickets to 25M1.

comment:9 Changed 4 years ago by mark

  • Milestone changed from 25rc to 25final

comment:10 Changed 4 years ago by mark

  • Status changed from assigned to reviewing
  • Review set to pedberg

comment:11 Changed 4 years ago by grhoten

Should the documentation at least try to explain the difference between numbering, numbering-year, cardinal and ordinal? The German example explains one of the important problems, but it's not the common problem. The problem is that most people don't understand the difference between the 4 common types, and this is an issue that we need to explain. It should also describe cases where RBNF is not appropriate, like the durations. Some people want to contribute the duration data for RBNF, and we don't want to contribute that because it's handled with other API and data. Including non-number words, like durations, is not the best use of RBNF in CLDR.

comment:12 Changed 4 years ago by grhoten

Instead of only describing how RBNF can't be used, here is some additional information that would help with how it can be used and translation guidelines.

There are 4 common spellout rules to consider. Some languages may provide more than 4 types.

  • numbering: This is the default used when there is no context for the number. For many languages, this may also be used for enumeration of objects, like used when pronouncing "table number one" and "table number two". It can also be used for pronouncing a math equation, like "2 - 3 = -1".
  • numbering-year: This is used for cases where years are pronounced or written a certain way. An example in English is the year 1999, which comes out as "nineteen ninety-nine" instead of the numbering value "one thousand nine hundred ninety-nine". The rules for this type have undefined behavior for non-integer numbers, and values less than 1.
  • cardinal: This is used when providing the quantity of the number of objects. For many languages, there may not be a default cardinal type. Many languages require the notion of the gender and other grammatical properties so that the number and the objects being referenced are in grammatical agreement. An example of its usage is "one e-mail", "two people" or "three kilometers". Some languages may not have dedicated words for 0 or negative numbers for cardinals. In those cases, the words from the numbering type can be reused.
  • ordinal: This is used when providing the order of the number of objects. For many languages, there may not be a default ordinal type. Many languages also require the notion of the gender for ordinal so that the ordinal number and the objects being referenced are in grammatical agreement. An example of its usage is "first place", "second e-mail" or "third house on the right". The rules for this type have undefined behavior for non-integer numbers, and values less than 1.

In addition to the spellout rules, there are also a numbering system rules. Even though they may be derived from a specific culture, they are typically not translated and the rules remain in root. An example of these rules are the Roman numerals where the value 8 comes out as VIII.

With regards to the number range supported for all these number types, the largest possible number range tries to be supported, but some languages may not have words for large numbers. For example, the old Roman numbering system can't support the value 5000 and beyond. For those unsupported cases, the default number format from CLDR is used.

Any rules marked as private should never be referenced externally. Frequently they only support a subrange of numbers that are used in the public rules.

comment:13 Changed 4 years ago by kent.karlsson14@…

I agree with grhoten that the checked in text is too negative and not very helpful. The text suggested by grhoten is indeed more appropriate, as it covers the aspects of number spellout that are intended to be covered. And those are, I believe, common cases, whereas requirering to cover all odd corners of number spellout is not very helpful.

comment:14 Changed 4 years ago by pedberg

  • Status changed from reviewing to accepted

I also favor the text from George as being more useful than the current text, perhaps with the additional comment that RBNF language coverage may be less complete many other areas of CLDR. I am happy to roll this in myself if that is okay.

comment:15 Changed 4 years ago by pedberg

  • Cc mark, pedberg added

comment:16 Changed 4 years ago by grhoten

I'm fine with adding a qualifying statement that the vetting and coverage is incomplete compared to the rest of CLDR.

comment:17 Changed 4 years ago by pedberg

The Modifications entry was also wrong, I fixed that along with fixes for cldrbug 7071:

comment:18 Changed 4 years ago by mark

  • Status changed from accepted to reviewing

Added the text from George (with minor tweaks)

comment:19 Changed 4 years ago by grhoten

I think the recent changes look good.

comment:20 Changed 4 years ago by pedberg

  • Status changed from reviewing to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.