Dataset for all ISO639 code sorted by country/territory?

Hugh Paterson hugh_paterson at
Wed Nov 16 11:42:28 CST 2016

Also, after thinking about this some more: If as is the stated case with San
"San Francisco requires documents in 4 languages but provides telephone
help for 200 languages.  Where's the line?"

How would you propose that Unicode database maintainers,
de-list institutional support for languages when institutional support

i.e. lets say that San Francisco falls on some hard times finically, and
can not afford to operate in 4 languages, and reduces their support to two
languages, How is this to be reflected in this proposal?

- Hugh Paterson III

On Thu, Nov 10, 2016 at 2:54 PM, Mats Blakstad <mats.gbproject at>

> I'm continuing the discussion I started on unicode at here;
> Sorry for posting in wrong email list!
> On 10 November 2016 at 20:34, Shawn Steele <Shawn.Steele at>
> wrote:
>> I didn't really say anything because this is kinda a hopeless task, but
>> it seems like some realities are being overlooked.  I'm as curious about
>> cataloguing everything as the next OCD guy, but a general solution doesn't
>> seem practical.
>> Maybe in addition to number of speakers we could give each language
> different values for the different territories like official / unofficial,
> lingua franca / home language, recognized / not recognized, etc
> Maybe we could manage to work out some more objective categories?
> Then the dataset could cover more different needs from those that want to
> use it to extract the list they want, as example they could make a list of
> only the official languages in the world sorted by country/territory, or
> maybe a list of all non-recognized languages in different countries.
>> * There are a *lot* of languages
> Yes :) We would not get all in the start, but if we could start add data
> for all the languages it can be done a little by little.
> For myself I have many contacts that I think could be interested to help
> add information.
>> * Many countries have speakers of several languages.
>>         * In the US it's "obvious" that a list of languages for the US
>> should include "English"
> For sure! The amount of speakers and that it is the primary language used
> speakse for it.
> Beside, is not "US English" considered a variant of English?
>>         * Spanish in the US is less obvious, however it is often
>> considered important.
> It is interesting issue. Wasn't Spanish the primary language in southern
> US while being a part of Mexico?
> And is there not a lot of Spanish newspapsers/media in the US?
>>         * However, that's a slippery slope as there are many other
>> languages with large groups of speakers in the US.  If such a list includes
>> Spanish, should it not include some of the others?  San Francisco requires
>> documents in 4 languages but provides telephone help for 200 languages.
>> Where's the line?
>> * Some languages happen in many places.  There are a disproportionate #
>> of Englishes in CLDR, however Chinese is also spoken in lots of the
>> countries that have English available in CLDR.  Yet CLDR doesn't provide
>> data for those.
> Could you elaborate a little bit on this?
>> * Some language/region combinations could encounter geopolitical issues.
>> Like "it's not legal for that language to be spoken in XX" (but it
>> happens).  Or "that language isn't YY country's language, it's ours!!!"
> We could add documented amount of speakers and tag it as "not recognized"
>> * The requirement "where the language has been spoken traditionally" is
>> really, really subjective.  "Traditionally" the US is an English speaking
>> country.  However, "Traditionally", there are hundreds of languages that
>> have been spoken in the US.  What could be more "traditional" than the
>> native American languages?  Yet those often have low numbers of speakers in
>> the modern world, many are even dying languages.  There are also a number
>> of "traditional" languages spoken by the original settlers.  Which differ
>> than the set of languages spoken by modern immigrants.  So your data is
>> going to be very skewed depending on the person collecting the data's
>> definition of "traditional".
> I agree "traditional" is not a good way to collect the data.
> Native american languages should of course be mapped with territories
> despite having few speakers. The point is to map all languages.
> We could also map languages with years, English is then spoken in what is
> USA today since 1607.
> Urdu is spoken in what is today Norway since the 1970th.
>> Ethnologue has done a decent job of identifying languages and the number
>> of speakers in various areas, but it would be very difficult to draw a line
>> that selected "English and Spanish in the US" and was consistent with
>> similar real-life impacts across the other languages.  Do you pick the top
>> n languages for each country?  Languages with > x million speakers (that
>> would be very different in small and big countries).  Languages with > y%
>> of the speakers in the different countries?
> If Ethnologue have done it, I guess it should also be possible for CLDR
> also?
> However they operate with a category "Immigrant Languages", I'm not sure
> what that means, ss exmaple Turkish, the second most spoken language of
> Germany, is marked it as "Immigrant Language", I'm not sure how they make
> that distinction.
>> And then you end up with each application having to figure out it's own
>> bar.  Applications will have different market considerations and other
>> reasons to target different regions/languages.  That would skew any list
>> for their purposes.
> Okay, at least it could be possible to add number of speakers for other
> "6,300 lesser-known living languages", or why do we cut the list to 675
> languages?
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the CLDR-Users mailing list