[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #10327(accepted)

Opened 19 months ago

Last modified 19 months ago

RBNF inflections

Reported by: mark Owned by: grhoten
Component: numbers-rbnf Data Locale:
Phase: rc Review:
Weeks: Data Xpath:


Consider whether to name the for inflected RBNF forms based on http://universaldependencies.org/u/feat/all.html (if they aren't already).


Change History

comment:1 Changed 19 months ago by mark

Also, we don't have a good test for completeness of inflections. I was talking to Sasch, and here's an idea.

There is a bunch of treebank data for different languages, such as https://github.com/UniversalDependencies/UD_Czech. Pull down the data for the available languages, and write a little program to extract what are all the inflected forms, for cardinals and ordinals. (The data is easily parsable.)

The set of all Nominal features that are in the data could provide information we can use to test for completeness of RBNF data. For example, if there is an {ablative neuter} form of some number, then we can test for the existence of an RBNF label for that set. So the test could generate a message like

de.xml — missing RBNF form for ablative-neuter.

That does not mean that all forms need to be distinct. There is a lot of redundancy in natural language. So if (say) the ablative-neuter forms for all of the ordinals are the same as the genitive-feminine forms, we can just have an alias from one to the other. In that case, the test could generate a error message that indicates that, eg.

de.xml — missing RBNF form for ablative-neuter. Consider adding an alias to the genitive-feminine rules.

comment:2 Changed 19 months ago by mark

  • Cc sascha added

comment:3 Changed 19 months ago by mark

  • Cc grhoten added

comment:4 Changed 19 months ago by grhoten

It's a nice idea. It's one of the reasons why I did a presentation about it here: https://www.youtube.com/watch?v=KclVxxHX26k

Personally it's one of the reasons why I think that most POS taggers are useless. You really need a grammeme tagger. I even wrote a bitfield based grammeme tagger for this type of topic. I'm even open to collaborate on such a topic through ICU or CLDR. The code that I have does lemmatization and word inflection.

In any case, numbers are only affected by a subset of grammemes (grammatical properties as stated on that web site). If the grammatical category modifies the surface form, I try to include the variant into RBNF. If the grammatical category is an unbound morpheme from the number (regardless of how the order changes based on the value), I tend to exclude it from the RBNF rules. For example, Hebrew has the construct definiteness in numbers, but outside of the semitic languages, that just isn't needed.

Now as far German is concerned, that language only has 4 grammatical categories and 3 genders that affect the number spelling. German has the nominative, accusative, dative and genitive grammatical cases, and it has the masculine, feminine and neuter genders. German doesn't have an ablative form. Cluttering up the rules to make it as complicated as the cross product of Polish, Russian and Finnish is insane. You wouldn't be able to get anyone to accurately review the RBNF rules.

German could have gone 2 different ways about this rule naming. The current set of rules were the easiest to get reviewers to look at or to allow the translators to select the appropriate rules by only creating 3 additional rules instead of an additional 9 rules. This has the downside that you can't programmatically know by the rule names which grammatical properties are supported. Personally, I think such a mapping of grammatical names to RBNF rule names would be more appropriately handled outside of these RBNF rule names. Some of these rule names are just spelling variants, like the English verbose type, which doesn't map to any grammatical categories.

That universal feature list does accurately list some number types that are not currently supported in RBNF. The multiplicative and collective numbers have been requested for RBNF, but I've been on the fence about whether this is a good idea.

So in summary, I think the following are interesting to investigate further.

1) Create infrastructure to map language specific grammemes to RBNF rule names.
2) Investigate holes in the data for additional number types, like the collective and multiplicative numbers.
3) Provide a way to get a number and a noun into grammatical agreement. This is probably the ultimate goal, and it would provide focus for the 2 items above.

comment:5 Changed 19 months ago by mark

Taking this offline, because "German doesn't have an ablative form. Cluttering up the rules to make it as complicated as the cross product of Polish, Russian and Finnish is insane. You wouldn't be able to get anyone to accurately review the RBNF rules." is a complete misreading of what is being suggested, which is to look just at those categories that are used in the treebank for each language.

So German would only have what is used in German:
case={nominative, accusative, dative, genitive}
gender={masculine, feminine, neuter}
definite={definite, indefinite, mixed}

comment:6 Changed 19 months ago by sascha

I believe Mark’s proposal could be rephrased as: Use Universal Feature identifiers as controlled vocabulary, similar to how CLDR has imported various other identifier systems. Eg., use one of {Masc, Com, Fem, Neut} to identify Gender http://universaldependencies.org/u/feat/all.html#al-u-feat/Gender, and likewise for other linguistic properties (if they matter for a language). For example, a Polish RBNF rule might be invoked by specifying Card-Ins-Neut.

comment:7 Changed 19 months ago by mark

Right, I should have given an example. Take http://universaldependencies.org/u/feat/all.html#al-u-feat/Definite, which is needed for German ordinals.

We could use 'ind' for "indefinite" (like our other Ids, we'd probably want to lowercase for a normalized form). As in other cases, we can grandfather in the current ids, which typically match one of the long-form ids on http://universaldependencies.org/u/feat/all.html

It needs further investigation, since we need the Ids to be unique and stable. We also need to have a determinant order among features, for uniqueness.

comment:8 Changed 19 months ago by mark

  • Status changed from new to accepted
  • Component changed from unknown to rbnf
  • Phase changed from dsub to rc
  • Milestone changed from UNSCH to 32
  • Owner changed from anybody to grhoten
  • type changed from unknown to dtd

comment:9 Changed 19 months ago by grhoten

  • Priority changed from assess to medium
  • Milestone changed from 32 to UNSCH

I really dislike those abbreviations. I don't think it adds much value, especially to people new to RBNF. I prefer to be use verbose names. I am fine with keeping the rule names rooted in common linguistic terms, which are mentioned on that web site. With that said, I don't think there is anything to do for the current rule names.

Now if you want a way to map combinations of Universal Dependencies abbreviations to RBNF names, that's a much more interesting request. Then you don't have to worry about the ordering of the grammatical category values, and it becomes easier to react to incomplete requests. This would still depend on infrastructure that isn't defined in CLDR or ICU, which would go back to the video that I originally mentioned. I sometimes use a bit field for such requests instead of cumbersome strings.

In any case, this not a high priority topic for me. I'm still open to a meeting to discuss this topic further.


Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.