Dataset for all ISO639 code sorted by country/territory?

Mats Blakstad mats.gbproject at
Thu Nov 10 16:54:21 CST 2016

I'm continuing the discussion I started on unicode at here;
Sorry for posting in wrong email list!

On 10 November 2016 at 20:34, Shawn Steele <Shawn.Steele at>

> I didn't really say anything because this is kinda a hopeless task, but it
> seems like some realities are being overlooked.  I'm as curious about
> cataloguing everything as the next OCD guy, but a general solution doesn't
> seem practical.
> Maybe in addition to number of speakers we could give each language
different values for the different territories like official / unofficial,
lingua franca / home language, recognized / not recognized, etc
Maybe we could manage to work out some more objective categories?
Then the dataset could cover more different needs from those that want to
use it to extract the list they want, as example they could make a list of
only the official languages in the world sorted by country/territory, or
maybe a list of all non-recognized languages in different countries.

> * There are a *lot* of languages
Yes :) We would not get all in the start, but if we could start add data
for all the languages it can be done a little by little.
For myself I have many contacts that I think could be interested to help
add information.

> * Many countries have speakers of several languages.
>         * In the US it's "obvious" that a list of languages for the US
> should include "English"
For sure! The amount of speakers and that it is the primary language used
speakse for it.
Beside, is not "US English" considered a variant of English?

>         * Spanish in the US is less obvious, however it is often
> considered important.
It is interesting issue. Wasn't Spanish the primary language in southern US
while being a part of Mexico?
And is there not a lot of Spanish newspapsers/media in the US?

>         * However, that's a slippery slope as there are many other
> languages with large groups of speakers in the US.  If such a list includes
> Spanish, should it not include some of the others?  San Francisco requires
> documents in 4 languages but provides telephone help for 200 languages.
> Where's the line?
> * Some languages happen in many places.  There are a disproportionate # of
> Englishes in CLDR, however Chinese is also spoken in lots of the countries
> that have English available in CLDR.  Yet CLDR doesn't provide data for
> those.
Could you elaborate a little bit on this?

> * Some language/region combinations could encounter geopolitical issues.
> Like "it's not legal for that language to be spoken in XX" (but it
> happens).  Or "that language isn't YY country's language, it's ours!!!"
We could add documented amount of speakers and tag it as "not recognized"

> * The requirement "where the language has been spoken traditionally" is
> really, really subjective.  "Traditionally" the US is an English speaking
> country.  However, "Traditionally", there are hundreds of languages that
> have been spoken in the US.  What could be more "traditional" than the
> native American languages?  Yet those often have low numbers of speakers in
> the modern world, many are even dying languages.  There are also a number
> of "traditional" languages spoken by the original settlers.  Which differ
> than the set of languages spoken by modern immigrants.  So your data is
> going to be very skewed depending on the person collecting the data's
> definition of "traditional".
I agree "traditional" is not a good way to collect the data.
Native american languages should of course be mapped with territories
despite having few speakers. The point is to map all languages.
We could also map languages with years, English is then spoken in what is
USA today since 1607.
Urdu is spoken in what is today Norway since the 1970th.

> Ethnologue has done a decent job of identifying languages and the number
> of speakers in various areas, but it would be very difficult to draw a line
> that selected "English and Spanish in the US" and was consistent with
> similar real-life impacts across the other languages.  Do you pick the top
> n languages for each country?  Languages with > x million speakers (that
> would be very different in small and big countries).  Languages with > y%
> of the speakers in the different countries?

If Ethnologue have done it, I guess it should also be possible for CLDR
However they operate with a category "Immigrant Languages", I'm not sure
what that means, ss exmaple Turkish, the second most spoken language of
Germany, is marked it as "Immigrant Language", I'm not sure how they make
that distinction.

> And then you end up with each application having to figure out it's own
> bar.  Applications will have different market considerations and other
> reasons to target different regions/languages.  That would skew any list
> for their purposes.

Okay, at least it could be possible to add number of speakers for other
"6,300 lesser-known living languages", or why do we cut the list to 675
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the CLDR-Users mailing list