[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #10099(accepted)

Opened 23 months ago

Last modified 7 weeks ago

Territory-Language Information wildly inaccurate

Reported by: fios@… Owned by: rick
Component: locale-demographics Data Locale:
Phase: rc Review:
Weeks: Data Xpath:


The Territory-Language Information (/cldr/charts/latest/supplemental/territory_language_information.html) is wildly inaccurate the way it's given/formatted. It appears to simply apply the *national* literacy rate across all languages spoken in that country, which is simply not a reflection of reality. 99% of Bavarian speakers simply are NOT literate in Bavarian. I would be surprised if this figure even approach 10% as the language is not recognized as a language and virtually not taught anywhere.

Specifically in our case, this was noticed when Mozilla used CLDR data to give stats on literacy rates on Scottish Gaelic. Looking at the UK, the data has the following serious issues:

1) It lists English, Irish and Scottish Gaelic as {0} which is factually wrong. Oddly enough, English is NOT legally the official language though it is the de-facto official language. Technically, through the Welsh Language Act, Welsh is the only official language in the UK. Irish has no legal status in Northern Ireland and Scottish Gaelic is not official in Scotland either. It has the oddly meaningless legal status of a "language enjoying equal respect".

2) 99% literacy for all the languages not English is inaccurate. Sylheti is notorious for not being taught despite its prevalence in the Bangladeshi community. For Scottish Gaelic, the literacy figure is at BEST 37% (2011 census(I'd post a link but apparently that's spam...) has no separate category for native speaker literacy but the closest measure is "speak, read and write Gaelic" which is 37.2%)

Glancing through the table there is also the issue that it seems to conflate writing systems and languages. Simplified/Traditional Chinese are not the same thing as speakers of Mandarin as Simplified/Traditional Chinese are equally used to write Mandarin, Cantonese, Wu etc.

This list, however well-meant, either needs marking as a beta version or taken off the public site until fixed. I suggest for starters to immediately change the cell formatting so the national literacy rate only applies to the (de facto) national language unless specific data is available for the other languages (I would imagine this data exists for some languages such as Basque or Catalan).


Change History

comment:1 Changed 23 months ago by fios@…

The Gaelic census data is here

comment:2 Changed 22 months ago by emmons

  • Owner changed from anybody to rick
  • Phase changed from dsub to rc
  • Status changed from new to accepted
  • Component changed from unknown to supplemental
  • Milestone changed from UNSCH to 32

comment:3 Changed 22 months ago by mark

Rick, transfer back to me once you are done, so that I can do the documentation clarification.

comment:4 Changed 22 months ago by rick

Rick to fix what can be done based on the info here, then pass along to Mark for documentation of literacy info in the charts. (FWIW, I don't think projects like Mozilla should be using CLDR lang/pop data as if CLDR were a primary source.)

comment:5 Changed 19 months ago by rick

Link to Gaelic census data is dead (404). For other suggested numbers, there are no sources. Based on what's reported here, the only change would be to change Gaelic literacy to approximate a value around 37% of the Gaelic-speaking population of Scotland. That is: 37% of 60k = 22,200 or so, for Scottish Gaelic literacy in UK. Pending link to census data, for example.

Last edited 19 months ago by rick (previous) (diff)

comment:6 Changed 19 months ago by rick

I also wonder about this statement: "when Mozilla used CLDR data to give stats on literacy rates on Scottish Gaelic" -- What is the context for that?

comment:7 Changed 19 months ago by fios@…

Mozilla used data from CLDR on number of literate speakers in one of their subprojects (Pontoon https://pontoon.mozilla.org/teams/, the data has since been removed as it was agreed to be unreliable) in order to help devs get a general idea of the relative "importance" of each locale (I believe the main idea was to make sure that no large locale falls behind in its localization efforts).

If there are no reliable sources, then the data shouldn't be in CLDR, no?

And you're suggestion on Gaelic is way off. 37% of speaker are literate and the number of speakers is less than 1% of the Scottish population. So it's about 37% of 60,000 speakers, give or take.

I've asked scotlandcensus about the PDF that has disappeared.

comment:8 Changed 19 months ago by rick

The purpose of the data in CLDR is fairly circumscribed, not general purpose. We've been trying to put together documentation to better describe it.

comment:9 Changed 19 months ago by fios@…

Documentation won't fix the issue at hand. A lot of the data is simply so wrong, no amount of documentation or circumscription will fix it. It's like saying 757 million people in North Dakota and somehow following from that that there are 757 million speakers of Dakota. You cannot equate the national literacy rate with literacy in a minority or regional language unless ALL the education in the territory takes place through the medium of the regional/minority language and has done so since the time the region's oldest people where schooled. Even in Catalonia where all education has been in Catalan for almost 40 years, literacy rate in Catalan is at about 40%, give or take whereas literacy in Spanish in Spain is at about 98%.
I'm a great fan of CLDR but on this occasion, there's seems to be a logical knot in the underlying thinking behind the literacy rate for non-national languages.

comment:10 Changed 19 months ago by rick

Split out the Scottish Gaelic question into cldrbug 10474.

comment:11 Changed 19 months ago by fios@…

Hang on, if the Gaelic question is pending a reliable source (which is fair enough), then I'd *really* like to see the sources supporting 98% literacy rates for Bavarian...

comment:12 Changed 19 months ago by c933103 <c933103@…>

Wikipedia started using territory language data to shortlist the quick external language link to the left of their article starting from ~1-2 years ago, and that generated lots of complain, as it seems to be contradictory to people's understanding in many cases [well there are also other problems associated with the function itself], and as a result they have to turn off the module for some wikipedia and manually adjust the ordering for some wikis. Users are also told to report inaccuracies to CLDR but the speed of CLDR handling those data does not seems to be very fast.

One of problem realted to using CLDR for Wikipedia language link which I don't know is it inside or outside the scope of CLDR, is the intelligible among different language. In CLDR Territory language info page, it said "the population that is able to read and write each language, and is comfortable enough to use it with computers.". Currently, intelligiblty among different languages are not taken into account when creating the CLDR list, however as exampled in ticket #9870 , there are population of many languages that can also read other languages, and "is comfortable enough to use it with computers", without even learning that other languages, and are usually not taken into account in stats. Would it be suitable for CLDR to consider the percentage of those in the territory language info list?

This is probably a list of unresolved tickets submitted by users that are requesting for changes in CLDR territory language info: http://unicode.org/cldr/trac/query?owner=rick&status=accepted&status=design&status=new&status=reviewfeedback&status=reviewing&col=id&col=summary&col=status&col=type&col=priority&col=milestone&col=component&col=time&desc=1&order=id

comment:13 Changed 17 months ago by rick

  • Milestone changed from 32 to 33

comment:14 Changed 11 months ago by rick

  • Milestone changed from 33 to upcoming

Bulk move to next rel.

comment:15 Changed 3 months ago by pedberg

  • Milestone changed from upcoming to UNSCH

CLDR 34 BRS closing item, move all upcoming → UNSCH

comment:16 Changed 7 weeks ago by mark

  • Component changed from other-supplemental to locale-demographics

Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.