Dataset for all ISO639 code sorted by country/territory?
Steven R. Loomis
srl at icu-project.org
Tue Nov 22 13:24:52 CST 2016
El 11/21/16 7:06 PM, "Mats Blakstad" <mats.gbproject at gmail.com> escribió:
Thanks for the replay Steven!
Also thanks to Mark Davis for explaining more about calculation of language speakers within a territory.
I'm interested to help provide data - however to me it is not clear if it is possible or what the criteria are.
If you are talking about locale data – the criteria are here http://cldr.unicode.org/index/bug-reports#New_Locales
If you are talking about supplemental data (such as population figures, etc) it would be important to know what you are actually trying to do with the data, and where it is insufficient. Adding more data to add more data is not a sufficient reason. I do want to see better support for all languages, certainly. But that is a time consuming process, involving individual people and languages— not bulk datasets.
Then I asked here in the list if we could maybe manage to make a full language-territory mapping within CLDR, but the answers on this list until now is that such mapping would be very subjective (even though it is also stated that it is not needed as Ethnologue made a good dataset already).
All of this is more of a discussion to have with the Ethnologue. I browse the Ethnologue somewhat frequently, but I do not see the benefit in simply importing it into the CLDR supplemental data.
So I suggested that if so we could go for purely objective criteria, we map languages to territories based on evidences of the amount of people speaking the language in the territory, with this approach it doesn't matter how big or small the population is, and anyone using the data can extract the data they need based on their own criteria (e.g. only use languages with more than 5% of speakers withing a territory). Then it's been said that the data for the smaller languages is not useful and that it is unrealistic as not all languages have locale data, but of course these subjective comments doesn't clarify what the objective criteria are.
What are your objective criteria?
I understand that it is not just a 1-2-3 to collect a full dataset, but it should be developed some clear criteria that applies to all languages so data can be structured to facilitate that it can be done in the long run:
- What is the minimum of data needed to add support for languages in CLDR?
That information is at http://cldr.unicode.org/index/bug-reports#New_Locales
- Can any language be included?
And if not, what are the criteria we operate with? As example, I would like to add Elfdalian, it is pretty straight forward, 2000 speakers in Sweden in Dalarna (subdivision SE-W). Can I just open a ticket and get this data added to CLDR once it's been reviewed?
But, just as with ancient Latin, it’s all just an interesting thought exercise, unless a ticket is opened.
- What criteria is applied for language-territory mapping? For instance, in the Ethnologue there is a notion of "immigrant" languages. Should there be used objective or subjective criteria?
See http://cldr.unicode.org/translation/default-content and http://cldr.unicode.org/index/cldr-spec/minimaldata . The mapping is used to determine, for example, what territory is default for de (German) - is it Germany? Switzerland? The US? Malta? All of these are possible. Which one is chosen is a judgement call in the context of locale data.
I see Ethnologue’s term defined at https://www.ethnologue.com/about/country-info – I don’t think it’s relevant to CLDR.
The way I see it, to start with some language-territory mapping, especially including mapping with subdivisions, before we have reliable sources of accurate population, could also help generate more data in long run, as it is much easier to try collect the data once it have been geographically mapped.
I’ll ask again though, what is your use case? Is it to duplicate Ethnologue? It’s hard to see the data collection mentioned here or in the other thread (geolocation data) as being relevant to locale data – which is the purpose of CLDR.
About language status I would be happy to start add data, but maybe it should be clarified exactly which categorize that are most feasible?
I think this might be best answered when the tickets are reviewed.
On 22 November 2016 at 01:00, Steven R. Loomis <srl at icu-project.org> wrote:
I replied to your tickets http://unicode.org/cldr/trac/ticket/9915 and http://unicode.org/cldr/trac/ticket/9916 – thank you for the good ideas (as far as completeness goes), but it’s not really clear what the purpose of the ticket should be.
El 11/20/16 11:35 AM, "CLDR-Users en nombre de Mats Blakstad" <cldr-users-bounces at unicode.org en nombre de mats.gbproject at gmail.com> escribió:
I understand it would take a lot of time to collect the full data, but it also depends on how much engagement you manage to create for the work.
On the other side: to simply allow users to start provide the data is first step in the process, and to do it would take very little time to do it!
It’s not clear how users are hindered from providing data now? At present, the data is very meticulously collected from a number of sources, including feedback comments.
On 20 November 2016 at 19:54, Doug Ewell <doug at ewellic.org> wrote:
I think you are genuinely underestimating the time and effort that this project would take.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CLDR-Users