Dataset for all ISO639 code sorted by country/territory?
mats.gbproject at gmail.com
Wed Nov 23 17:24:19 CST 2016
On 22 November 2016 at 20:24, Steven R. Loomis <srl at icu-project.org> wrote:
> El 11/21/16 7:06 PM, "Mats Blakstad" <mats.gbproject at gmail.com> escribió:
> Thanks for the replay Steven!
> Also thanks to Mark Davis for explaining more about calculation of
> language speakers within a territory.
> I'm interested to help provide data - however to me it is not clear if it
> is possible or what the criteria are.
> If you are talking about locale data – the criteria are here
Thanks for info! It seems like there are several languages added inside
supplementalData.xml that do not have locals so seem like we can easily add
new supplemental data for languages without locals. Also looks like there
are support for languages that have 0.0031% of the speakers so looks like
several small languages are already supported.
> If you are talking about supplemental data (such as population figures,
> etc) it would be important to know what you are actually trying to do with
> the data, and where it is insufficient. Adding more data to add more data
> is not a sufficient reason.
Yes I'm talking about the supplemental data. I don't only want to add data
"to add more data" even though I definitely think building data that can
help generate more data about, and support, more languages, is definetly a
I want to use the data for many things; More easily identify likely second
language of speakers of "lesser known languages" based on HTTP
Accept-Language and which territory or subdivision they are placed. Be able
to present information in these languages and language swicther for these
languages dependent of which territory/subdivision the user is from. Be
able to offer users to help translate into local languages depending on
their territory/sub-division. The bottom line is; be able to give a better
user experience for people speaking "lesser known languages". With a
language-territory mapping it will be possible for developers to use this
data also in new creative ways to better support multilingualism.
> I do want to see better support for all languages, certainly. But that is
> a time consuming process, involving individual people and languages— not
> bulk datasets.
I do not really understand why bulk datasets should not be accepted, to me
it seems like data is added based in evidences. So wheater the data is
added should depend on weather the data comes from a reliable source.
Besides I'm an individual people and I'm ready to be involved!
> Then I asked here in the list if we could maybe manage to make a full
> language-territory mapping within CLDR, but the answers on this list until
> now is that such mapping would be very subjective (even though it is also
> stated that it is not needed as Ethnologue made a good dataset already).
> All of this is more of a discussion to have with the Ethnologue. I browse
> the Ethnologue somewhat frequently, but I do not see the benefit in simply
> importing it into the CLDR supplemental data.
> So I suggested that if so we could go for purely objective criteria, we
> map languages to territories based on evidences of the amount of people
> speaking the language in the territory, with this approach it doesn't
> matter how big or small the population is, and anyone using the data can
> extract the data they need based on their own criteria (e.g. only use
> languages with more than 5% of speakers withing a territory). Then it's
> been said that the data for the smaller languages is not useful and that it
> is unrealistic as not all languages have locale data, but of course these
> subjective comments doesn't clarify what the objective criteria are.
> What are your objective criteria?
I would say, we map any language with territory based on evidences, where
we can document a number of speakers we add the language no matter what
status it has.
If we can't accurately say a number of speakers, but know that the
territory is the primary place the languages is spoken, we map it even
without accurate language population. As example; from Glottolog we can see
that the language Tem is spoken in Benin, Ghana and Togo, this information
can easily be verified with comparing the data from the Ethnologue:
We can't copy the Ethnologue's data for population, but at least we know
that 2 reliable sources are saying that this is the correct
Based on this evidence we can now map Tem language with Benin, Ghana and
Togo even though we do not have the exact data for the population.
I guess in many cases the mapping in itself is enough to do many things to
support "lesser known languages".
Those not interested in this mapping can of course easily extract only the
territory-language mappings that have indication of language population.
> I understand that it is not just a 1-2-3 to collect a full dataset, but it
> should be developed some clear criteria that applies to all languages so
> data can be structured to facilitate that it can be done in the long run:
> - What is the minimum of data needed to add support for languages in CLDR?
> That information is at http://cldr.unicode.org/
> - Can any language be included?
> Theoretically, yes.
> And if not, what are the criteria we operate with? As example, I would
> like to add Elfdalian <https://en.wikipedia.org/wiki/Elfdalian>, it is
> pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision
> SE-W). Can I just open a ticket and get this data added to CLDR once it's
> been reviewed?
> But, just as with ancient Latin, it’s all just an interesting thought
> exercise, unless a ticket is opened.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CLDR-Users