[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #4736(closed defect: fixed)

Opened 6 years ago

Last modified 5 years ago

Vet the RBNF data

Reported by: grhoten@… Owned by: grhoten
Component: main Data Locale: many
Phase: Review: emmons
Weeks: Data Xpath:
Xref:

Description

The Rule Based Number Format data needs vetting. I have attached the revised data. I used flat files to develop the revisions.

It would also help if the charts provided some form of this data. I have also attached an example way to format the data for review. Here is an example way to call the program.

javac -cp icu4j-4_8_1_1.jar:. ListNumbers.java && java -cp icu4j-4_8_1_1.jar:. -Dfile.encoding=UTF8 ListNumbers > output.html

I will be providing more data, but here is what I have vetted so far. This seemed like a good check point on contributing the data. I can answer questions as needed. I changed a lot of data.

Attachments

rbnf.zip (22.0 KB) - added by grhoten@… 6 years ago.
flat RBNF rules
ListNumbers.java (10.9 KB) - added by grhoten@… 6 years ago.
Source code to generate tables of numbers for review
rbnf.patch (249.0 KB) - added by grhoten@… 6 years ago.
Example ICU4C patch

Change History

Changed 6 years ago by grhoten@…

flat RBNF rules

Changed 6 years ago by grhoten@…

Source code to generate tables of numbers for review

Changed 6 years ago by grhoten@…

Example ICU4C patch

comment:1 Changed 6 years ago by grhoten@…

Also here is a way to produce a table for a single language with the provided source code.

javac -cp icu4j-4_8_1_1.jar:. ListNumbers.java && java -cp icu4j-4_8_1_1.jar:. -Dfile.encoding=UTF8 ListNumbers en en.txt > output.html

comment:2 Changed 6 years ago by kent.karlsson14@…

Firstly, I'm usually positive to feedback on the RBNF rules (despite that the feedbacks
are often very incomplete and sometimes hard to interpret).

However, this ticket includes a very large chunk of changes, and while they are at an
unusually detailed level (proposing changes directly on the rules level), they are also
without explicit motivation. Some of them may be fine, some I'm more suspicious about.

I understand you want a review of the RBNF, very good. So far I haven't tried the Listnumbers
program (for generating chart web pages), just looked at the "example patch". Not sure how
much example it is, and how much proposal that is. One problem with the "example patch" is
the needless change from uppercase to lowercase in "\u" character escapes. (UTF-8 for both
would be an option, as ICU can take UTF-8 files for this, as long as there is a BOM at the
beginning of the file. But UTF-8 has other masking and display problems, so it is not ideal.)

A mentioned, I think some of the fixes in the "example" are just fine (like replacing "..."
with an actual spellout for decimal mark). But others seem to miss one point or another.
For instance: "pt.txt" is actually "pt_BR.txt" and hence there should be no "pt_BR.txt";
a "%spellout-ordinal-masculine-plural" has been added for "Latin American Spanish", I've
stayed away from "ordinal plurals" as they largely don't make any sense ("the first ones",
ok (fuzzy lead part), even "the first ten ones", fine, but the "the fiftheenth ones", naa);
also some "%%spellout-fraction" have been added. I used those only on explicit request, in
particular I avoided it for Swedish (while one does, for 3.14, say "tre komma fjorton", it
gets more and more odd the more decimals you have, and one quickly goes for digit by digit
in the decimal part). And I'm sure there is more to remark on if one looks closer. So I'd
say that the "example patch" (which I've largely taken to be "some proposed changes") needs
quite a bit of review itself, and should not be taken as a staring point.

comment:3 Changed 6 years ago by grhoten@…

I don't have the ability to use the LDML2ICUConverter converter, but I can use the native2ascii converter in Java. That is why the escaped notation changed the case. The patch is helpful more for showing the aliases and fallbacks that are needed. You can ignore the overall ICU patch. The source rules in the zip file are the more important patch.

Unfortunately, Portuguese did not have a clear fallback behavior. It had some significant spelling mistakes and it could not be determined to be Iberian or Brazilian Portuguese. You can pivot the data as appropriate.

The %%spellout-fraction was added in cases where certain types of numbers need to be used after the decimal point. For example, the gender before the decimal point may not be the same after the decimal point. I did not focus on the complete way to format the fractions right now. For example, it's common to have 0.21 formatted as zero point twenty one, and I have tried to go with a general formatting example instead of handling specific cases like this. This type of scenario may only appear when referring to the temperature or other types of measurements.

Apple has specific use cases where the new types are needed to get the numbers inflected correctly, and the translators know the context where the numbers are used. The Spanish plurals need to stay. Unlike English and most other languages, ordinal plurals make sense depending on where in the grammar it is used. I realize this stance may explode the German rules in order to get case, declension, gender and plurality involved.

comment:4 Changed 6 years ago by grhoten@…

Actually, the plurals are not strictly required for Spanish. An s can be added to an ordinal to get the same effect. This solution probably won't work in all languages though.

This does bring up a common problem that I had with getting this data vetted. UTS 35 provides no usage and no vetting guidance on the available rule types http://unicode.org/reports/tr35/#Rule-Based_Number_Formatting. Guidance on what is a good type to include in CLDR and unhelpful types should be provided. Describing the difference between a cardinal number and a numbering number sometimes takes a while to explain. This guidance would help to determine when a number inflection should be included and vetted through CLDR and those cases where a set of private rules should be defined separately from CLDR.

I will submit a separate ticket to provide a draft proposal on RBNF usage and vetting guidance. It can then be reviewed by the CLDR Technical Committee.

comment:5 Changed 6 years ago by mark

  • Cc emmons, pedberg added
  • Owner changed from anybody to chrish
  • Priority changed from assess to major
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 22

comment:6 Changed 6 years ago by kent.karlsson14@…

The patch is helpful more for showing the aliases and fallbacks that are needed.

...

Unfortunately, Portuguese did not have a clear fallback behavior.

Not sure exactly what you mean with "fallback behaviour" here. (I've used "fallbacks"
(i.e. simplifications) at a meta-level. But *technically* there are no fallbacks in the RBNF rules...)

It had some significant spelling mistakes and it could not be determined to be Iberian or
Brazilian Portuguese. You can pivot the data as appropriate.

See also ticket:4655.

The %%spellout-fraction was added in cases where certain types of numbers need to be used after the
decimal point. For example, the gender before the decimal point may not be the same after the
decimal point.

Aha. Looking closer at your sv.txt file:

1.1 "ett komma ett" (current: fine, like in "ett komma ett skålpund", ok nobody uses

skålpund anymore, but most other units are reale (=utrum))

1.1 "en komma en" (current: fine, like in "en komma en grader")
1.1 "en komma ett" (your proposal: strange, can't be used for anything)

I realise that there may be personal opinions about this (there are no actual rules for when to use
neutre and when to use reale). But in my opinion, your proposal would definitely be a change for
the worse, and indeed plainly wrong.

At some point (earlier) something went wrong with the thousands: it's "ettusen", but *not*
"tjugoettusen", it's "tjugoentusen". Ok, YMMV. And maybe I'm guilty of saying ok, in a weak moment,
to this erroneous change, which of course I shouldn't have. Not as bad as your proposal here,
but still not ideal.

I did not focus on the complete way to format the fractions right now. For example, it's common to
have 0.21 formatted as zero point twenty one, and I have tried to go with a general formatting example
instead of handling specific cases like this. This type of scenario may only appear when referring
to the temperature or other types of measurements.

That's fine.

Apple has specific use cases where the new types are needed to get the numbers inflected correctly,
and the translators know the context where the numbers are used.

I don't mind adding all inflections. Just that one take the most commonly occurring first, since some
languages have quite a lot of inflection also for numbers.

I'm glad to hear that the inflected variants are usable. It's been claimed (by someone...) before that
they are not usable.

The Spanish plurals need to stay. Unlike English and most other languages, ordinal plurals make sense
depending on where in the grammar it is used. I realize this stance may explode the German rules in
order to get case, declension, gender and plurality involved.

...

Actually, the plurals are not strictly required for Spanish. An s can be added to an ordinal to get
the same effect. This solution probably won't work in all languages though.

Tricks that work for just one language... No.

I don't actually oppose the addition, just that use cases might be very few. However, when trying to
find examples (not that easy...), it seems that the use cases are "nonce plurals", a (singular) group
referred to in plural just because it is a group even though it is singular (since it is a single group).

I don't think German has that many inflections for number spell-out. But Slavic languages do.
And Icelandic...

I haven't yet looked through the rest of the proposed changes. Even so, your proposal also contain
some significant errors (more than I've mentioned here). So, I'm all for vetting the RBNF rules,
but please don't use your attached proposal as a starting point. Some changes are for the better,
granted, but there are also changes for the worse.

comment:7 Changed 6 years ago by grhoten

  • Owner changed from chrish to grhoten

comment:8 Changed 5 years ago by grhoten

  • Review set to emmons

comment:9 Changed 5 years ago by emmons

  • Review emmons deleted

George,

Instead of putting @noparse in the rule name in the LDML, you should use the attribute allowsParsing="false" instead. Please change the DTD back to its original state and use the attribute. LDML2ICUConverter will add the @noparse to the ICU ruleset automatically.

Kent had some problems with the changes to Swedish - I'm going to have to look at these carefully as well, probably next week.

His comments below:

Hi John!

I'm strongly opposing the changes to Swedish RBNF.

"en komma ett" is plain wrong. "en komma en" and "ett komma ett" are just
fine.

"hundra" instead of "etthundra" and "tusen" instead of "ettusen" is just
sloppy language.

Indeed "tjugoettusen" is wrong, it's "tjugoentusen" (despite "ettusen"),
so an earlier "fix" was wrong in trying to "correct" that. It was ok
the way I originally submitted it.

Minor: a ":e" is wrongly dropped. To be very strict it in some
circumstances should be ":a", but ":e" is not wrong, and better
than dropping it. But this is just for an "overflow" case.

Kind regards
/Kent K

comment:10 Changed 5 years ago by grhoten

I've modified the @noparse usage to use allowsParsing="false" instead. Let me get back to our Swedish expert with Kent's comments.

comment:11 Changed 5 years ago by grhoten

  • Review set to emmons

I have not heard of any objections from my last e-mail to Kent for the latest Swedish RBNF data. So I am marking this one for review again.

comment:12 Changed 5 years ago by emmons

  • Status changed from assigned to closed
  • Resolution set to fixed
View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.