[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #1079(accepted)

Opened 13 years ago

Last modified 3 months ago

Language-sensitive titlecasing

Reported by: mark.davis(at)icu-project.org Owned by: mark
Component: to-assess Data Locale:
Phase: Review:
Weeks: 1 Data Xpath:




Description (last modified by pedberg) (diff)

Different languages may have different conventions for titlecasing, such as
Dutch "IJzer", or even English cases such as "Buffy the Vampire Slayer" (where
"t" is not capitalized). Consider adding structure that allows implementations
to do better titlecasing. (In general, there will still be cases that can't be
handled algorithmically, but this would allow much better treatment than
currently available.)

One possibility: use the same structure as segmentation, whereby the text is
segmented immediately prior to any point with initial capitals. Thus, for
example, it would return "|I|Jzer" or "|Buffy the |Vampire |Slayer". That would
allow us to use the same mechanisms as already exist.


Change History

comment:1 Changed 13 years ago by mark

moved from incoming to future

comment:2 Changed 13 years ago by deborah

changed notes2

comment:3 Changed 13 years ago by deborah

moved from future to dtd

comment:4 Changed 12 years ago by mark

changed notes2

comment:5 Changed 11 years ago by verdy_p(at)wanadoo.fr

(Guest Reply)

Titlecasing exposes some more complex issues than just converting some words using one (or more) initials with capitalization. If you just use segmentation, you may incorrectly assume that the first word requires a capital, or that this word cannot be derived.

This would be wrong for French, when the first word(s) are in fact just an article that can be eventually contracted with the previous words in a sentence or a few prepositions. Such first word(s) are generally ignored when sorting titles (or more exactly, they are sorted as if they were at end of the sentence).

So if you capitalize the title: "Les Aristochats"
(an old famous W.Disney animation film)
this is correct only at the begining of a sentence or in isolated form.
But you would sort it as if it was: "Aristochats, les" (note the inversion and removed capital for the leading article, moved at end).
In a sentence: "Dans les « Aristochats » ..."

And in a sentence where it would occur after a preposition like "de", this leading article would contract as "Dans le film des « Aristochats » ...". Note how the leading article is extracted from the title and combines in the normal sentence.

For now in CLDR, there's no easy way to manage data about the possible contractions of localized items. There are several ways to do that:

  • provide a locale-specific list of segments that require special handling when combining them in complex format strings: this requires parsing both the constant text with the format string, and the text within the variable part, and it may be quite difficult to specify that.
  • either allow the resource translated in a language to specify optionally multiple forms for the same text (this is the approach used in GNU/gettext .po files for handling multiple singular/plural forms) by using an external grammatical case selector (for example a resources would be allowed to specify a variant for genitive forms instead of assuming that it is valid to append a translated string like you do in English in C with format strings like "of %s"; instead the same title without English "of" would have an optional variant for the genitive form where "of " is part of the translated text; in another language another word could be used, or contractions would become possible)
  • Some complex cases may occur of a translation contextually requires inversion of words. For example in French, it is not always correct to place a qualifying adjective after the noun (otherwise it may have another posible meaning or would seem strange/"lourd" for native speakers), it it should be better placed in some cases before it; if the noun or adjective are in different resources combining them in a format string could be tricky. The same thing will happen if you blindly use the adjective after or before an automatically titlecased work title.
  • In addition, it will be sometimes required to use some information coming from the translated text to get things like:
  • is it a plural?
  • is it a full sentence requiring quotation marks?
  • is it a feminine or plural substantive group requiring using another desinence for the other part of the text?

More generally, when translating some resource, there's no way to add optional meta-data (like gender or number or grammatical case) that could be used to adapt the surround ing text where the resource is used.

comment:6 Changed 11 years ago by guest

sent reply 1

comment:7 Changed 11 years ago by verdy_p(at)wanadoo.fr

(Guest Reply)


Include some way to specify such locale-dependant meta-data along with a translated resource text, along with a way to test the value of such contextual meta-data to return the appropriate text and meta-data.

For example in French, some syntax similar to:
"{test:genetive?des:les} Aristochats{set:plural}{set:masculine}{unset:sentence}"
And in English:
"{test:genetive?of} The Aristocats{set:plural}{unset:sentence}"

comment:8 Changed 11 years ago by guest

sent reply 2

comment:9 Changed 11 years ago by pedberg

changed notes2

comment:10 Changed 11 years ago by mark

changed notes2

comment:11 Changed 9 years ago by markus

  • Weeks set to 1

comment:12 Changed 8 years ago by pedberg

  • Milestone set to UNSCH

Blank milestone -> UNSCH per cldrbug 3400:

comment:13 Changed 8 years ago by pedberg

  • Cc markus added
  • Xref changed from 1138 to 1138 1493
  • type changed from defect to enhancement
  • Description modified (diff)

There was a duplicate cldrbug 1138: "Add Language-Sensitive titlecasing", also originated by Mark, which has some additional ideas.

comment:14 Changed 4 years ago by srl

  • Xref changed from 1138 1493 to 1138 1493 2645

ticket:2645 did this for Dutch - close this as dup?

comment:15 Changed 4 years ago by markus

  • type changed from enhancement to dtd
  • Component changed from dtd to unknown

comment:16 Changed 4 years ago by srl

  • Status changed from new to accepted

comment:17 Changed 3 months ago by mark

  • Component changed from unknown to to-assess

Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.