[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #10099(accepted data)

Opened 4 months ago

Last modified 2 months ago

Territory-Language Information wildly inaccurate

Reported by: fios@… Owned by: rick
Component: supplemental Data Locale:
Phase: rc Review:
Weeks: Data Xpath:


The Territory-Language Information (/cldr/charts/latest/supplemental/territory_language_information.html) is wildly inaccurate the way it's given/formatted. It appears to simply apply the *national* literacy rate across all languages spoken in that country, which is simply not a reflection of reality. 99% of Bavarian speakers simply are NOT literate in Bavarian. I would be surprised if this figure even approach 10% as the language is not recognized as a language and virtually not taught anywhere.

Specifically in our case, this was noticed when Mozilla used CLDR data to give stats on literacy rates on Scottish Gaelic. Looking at the UK, the data has the following serious issues:

1) It lists English, Irish and Scottish Gaelic as {0} which is factually wrong. Oddly enough, English is NOT legally the official language though it is the de-facto official language. Technically, through the Welsh Language Act, Welsh is the only official language in the UK. Irish has no legal status in Northern Ireland and Scottish Gaelic is not official in Scotland either. It has the oddly meaningless legal status of a "language enjoying equal respect".

2) 99% literacy for all the languages not English is inaccurate. Sylheti is notorious for not being taught despite its prevalence in the Bangladeshi community. For Scottish Gaelic, the literacy figure is at BEST 37% (2011 census(I'd post a link but apparently that's spam...) has no separate category for native speaker literacy but the closest measure is "speak, read and write Gaelic" which is 37.2%)

Glancing through the table there is also the issue that it seems to conflate writing systems and languages. Simplified/Traditional Chinese are not the same thing as speakers of Mandarin as Simplified/Traditional Chinese are equally used to write Mandarin, Cantonese, Wu etc.

This list, however well-meant, either needs marking as a beta version or taken off the public site until fixed. I suggest for starters to immediately change the cell formatting so the national literacy rate only applies to the (de facto) national language unless specific data is available for the other languages (I would imagine this data exists for some languages such as Basque or Catalan).


Change History

comment:1 Changed 4 months ago by fios@…

The Gaelic census data is here

comment:2 Changed 2 months ago by emmons

  • Owner changed from anybody to rick
  • Phase changed from dsub to rc
  • Status changed from new to accepted
  • Component changed from unknown to supplemental
  • Milestone changed from UNSCH to 32

comment:3 Changed 2 months ago by mark

Rick, transfer back to me once you are done, so that I can do the documentation clarification.

comment:4 Changed 2 months ago by rick

Rick to fix what can be done based on the info here, then pass along to Mark for documentation of literacy info in the charts. (FWIW, I don't think projects like Mozilla should be using CLDR lang/pop data as if CLDR were a primary source.)


Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.