Dataset for all ISO639 code sorted by country/territory?

Hugh Paterson hugh_paterson at
Tue Nov 22 02:05:01 CST 2016


Just a thought,

What do you gain by using the Ethnologue tables (ISO 8859-1 encoded tables)
over just using the open licensed ISO 639-3 tables (in UTF-8)? I have noticed some differences
in the names of languages in these two files. I would stick with the UTF-8
tables. The UTF-8 tables are the source of the Ethnologue data, not the
other way round.

The Ethnologue does provide a country correspondence, and this is not
necessarily changeable (due to license). However, there is another project
called Glottalog which does propose a GPS coordinate
for most languages (their
definition of a "language" is different than ISO 639-3's definition, but
their data includes the ISO 639-3 set of language distinctions). Glottalog
data is a bit more open than the Ethnologue data. Glottalog 2.7 data is
licensed under Creative Commons 3.0 Attribution-ShareAlike, and is
available on github.

Now we can't just go out and build upon the Ethnologue's data tables, but
with a GPS coordinate in an open data table, a query of of the GEOhack API
would return a county code and a secondary administrative unit for a
political entity for a GPS  coordinate. Here is an example of using the
coordinates for Frankfurt a. M. Germany.

Now, the accessible Ethnologue tables could be used to verify GPS point
data obtained from Glottalog. If there were a discrepancy between the two
data sets one would have to determine how to make an editorial choice
between the two sources. However, essentially, the functionality of the
language-country correspondence would be replicated, albeit from different
sources, and merely verified to be congruent with Ethnologue data tables.

I agree with you that there is great value in open data sets.

all the best,

Hugh Paterson III

On Mon, Nov 21, 2016 at 7:06 PM, Mats Blakstad <mats.gbproject at>

> Thanks for the replay Steven!
> Also thanks to Mark Davis for explaining more about calculation of
> language speakers within a territory.
> I'm interested to help provide data - however to me it is not clear if it
> is possible or what the criteria are.
> I initially wanted to use a language-country dataset from the Ethnologue:
> I wanted to try play with this data, like filter out only living
> languages, merge it with data from IANA subtag register and CLDR locals to
> also map different variants and standards of languages and see if I could
> make some infographics or complie it with data from other sources.
> However, even though this data is free to download, it is licensed, you
> can't change it and you can't make it available for others to download.
> I contacted the Ethnologue to hear if I could use the data. After 1 months
> I get an answer that they want to see an example of the new dataset and
> then they can give me a price for it.
> As I see it this put a lot of constrains on me. I don't have money to buy
> that dataset from the Ethnologue and I don't want to go and ask them every
> time I want to make changes or try something new (and maybe need to wait 1
> months every time for their answer). I guess this is also one of the
> advertised benefits of open source data; You can simply adapt and use it
> for your own purposes without needing to ask anyone.
> Then I asked here in the list if we could maybe manage to make a full
> language-territory mapping within CLDR, but the answers on this list until
> now is that such mapping would be very subjective (even though it is also
> stated that it is not needed as Ethnologue made a good dataset already).
> So I suggested that if so we could go for purely objective criteria, we
> map languages to territories based on evidences of the amount of people
> speaking the language in the territory, with this approach it doesn't
> matter how big or small the population is, and anyone using the data can
> extract the data they need based on their own criteria (e.g. only use
> languages with more than 5% of speakers withing a territory). Then it's
> been said that the data for the smaller languages is not useful and that it
> is unrealistic as not all languages have locale data, but of course these
> subjective comments doesn't clarify what the objective criteria are.
> I understand that it is not just a 1-2-3 to collect a full dataset, but it
> should be developed some clear criteria that applies to all languages so
> data can be structured to facilitate that it can be done in the long run:
> - What is the minimum of data needed to add support for languages in CLDR?
> - Can any language be included? And if not, what are the criteria we
> operate with? As example, I would like to add Elfdalian
> <>, it is pretty straight forward,
> 2000 speakers in Sweden in Dalarna (subdivision SE-W). Can I just open a
> ticket and get this data added to CLDR once it's been reviewed?
> - What criteria is applied for language-territory mapping? For instance,
> in the Ethnologue there is a notion of "immigrant" languages. Should there
> be used objective or subjective criteria?
> The way I see it, to start with some language-territory mapping,
> especially including mapping with subdivisions, before we have reliable
> sources of accurate population, could also help generate more data in long
> run, as it is much easier to try collect the data once it have been
> geographically mapped.
> About language status I would be happy to start add data, but maybe it
> should be clarified exactly which categorize that are most feasible?
> Mats
> On 22 November 2016 at 01:00, Steven R. Loomis <srl at>
> wrote:
>> Mats,
>>  I replied to your tickets and
>> – thank you for the good ideas
>> (as far as completeness goes), but it’s not really clear what the purpose
>> of the ticket should be.
>> El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" <
>> cldr-users-bounces at en nombre de mats.gbproject at>
>> escribió:
>> I understand it would take a lot of time to collect the full data, but it
>> also depends on how much engagement you manage to create for the work.
>> On the other side: to simply allow users to start provide the data is
>> first step in the process, and to do it would take very little time to do
>> it!
>> It’s not clear how users are hindered from providing data now?  At
>> present, the data is very meticulously collected from a number of sources,
>> including feedback comments.
>> Steven
>> On 20 November 2016 at 19:54, Doug Ewell <doug at> wrote:
>>> Mats,
>>> I think you are genuinely underestimating the time and effort that this
>>> project would take.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the CLDR-Users mailing list