Dataset for all ISO639 code sorted by country/territory?
hugh_paterson at sil.org
Wed Nov 16 11:42:28 CST 2016
Also, after thinking about this some more: If as is the stated case with San
"San Francisco requires documents in 4 languages but provides telephone
help for 200 languages. Where's the line?"
How would you propose that Unicode database maintainers,
de-list institutional support for languages when institutional support
i.e. lets say that San Francisco falls on some hard times finically, and
can not afford to operate in 4 languages, and reduces their support to two
languages, How is this to be reflected in this proposal?
- Hugh Paterson III
On Thu, Nov 10, 2016 at 2:54 PM, Mats Blakstad <mats.gbproject at gmail.com>
> I'm continuing the discussion I started on unicode at unicode.org here;
> Sorry for posting in wrong email list!
> On 10 November 2016 at 20:34, Shawn Steele <Shawn.Steele at microsoft.com>
>> I didn't really say anything because this is kinda a hopeless task, but
>> it seems like some realities are being overlooked. I'm as curious about
>> cataloguing everything as the next OCD guy, but a general solution doesn't
>> seem practical.
>> Maybe in addition to number of speakers we could give each language
> different values for the different territories like official / unofficial,
> lingua franca / home language, recognized / not recognized, etc
> Maybe we could manage to work out some more objective categories?
> Then the dataset could cover more different needs from those that want to
> use it to extract the list they want, as example they could make a list of
> only the official languages in the world sorted by country/territory, or
> maybe a list of all non-recognized languages in different countries.
>> * There are a *lot* of languages
> Yes :) We would not get all in the start, but if we could start add data
> for all the languages it can be done a little by little.
> For myself I have many contacts that I think could be interested to help
> add information.
>> * Many countries have speakers of several languages.
>> * In the US it's "obvious" that a list of languages for the US
>> should include "English"
> For sure! The amount of speakers and that it is the primary language used
> speakse for it.
> Beside, is not "US English" considered a variant of English?
>> * Spanish in the US is less obvious, however it is often
>> considered important.
> It is interesting issue. Wasn't Spanish the primary language in southern
> US while being a part of Mexico?
> And is there not a lot of Spanish newspapsers/media in the US?
>> * However, that's a slippery slope as there are many other
>> languages with large groups of speakers in the US. If such a list includes
>> Spanish, should it not include some of the others? San Francisco requires
>> documents in 4 languages but provides telephone help for 200 languages.
>> Where's the line?
>> * Some languages happen in many places. There are a disproportionate #
>> of Englishes in CLDR, however Chinese is also spoken in lots of the
>> countries that have English available in CLDR. Yet CLDR doesn't provide
>> data for those.
> Could you elaborate a little bit on this?
>> * Some language/region combinations could encounter geopolitical issues.
>> Like "it's not legal for that language to be spoken in XX" (but it
>> happens). Or "that language isn't YY country's language, it's ours!!!"
> We could add documented amount of speakers and tag it as "not recognized"
>> * The requirement "where the language has been spoken traditionally" is
>> really, really subjective. "Traditionally" the US is an English speaking
>> country. However, "Traditionally", there are hundreds of languages that
>> have been spoken in the US. What could be more "traditional" than the
>> native American languages? Yet those often have low numbers of speakers in
>> the modern world, many are even dying languages. There are also a number
>> of "traditional" languages spoken by the original settlers. Which differ
>> than the set of languages spoken by modern immigrants. So your data is
>> going to be very skewed depending on the person collecting the data's
>> definition of "traditional".
> I agree "traditional" is not a good way to collect the data.
> Native american languages should of course be mapped with territories
> despite having few speakers. The point is to map all languages.
> We could also map languages with years, English is then spoken in what is
> USA today since 1607.
> Urdu is spoken in what is today Norway since the 1970th.
>> Ethnologue has done a decent job of identifying languages and the number
>> of speakers in various areas, but it would be very difficult to draw a line
>> that selected "English and Spanish in the US" and was consistent with
>> similar real-life impacts across the other languages. Do you pick the top
>> n languages for each country? Languages with > x million speakers (that
>> would be very different in small and big countries). Languages with > y%
>> of the speakers in the different countries?
> If Ethnologue have done it, I guess it should also be possible for CLDR
> However they operate with a category "Immigrant Languages", I'm not sure
> what that means, ss exmaple Turkish, the second most spoken language of
> Germany, is marked it as "Immigrant Language", I'm not sure how they make
> that distinction.
>> And then you end up with each application having to figure out it's own
>> bar. Applications will have different market considerations and other
>> reasons to target different regions/languages. That would skew any list
>> for their purposes.
> Okay, at least it could be possible to add number of speakers for other
> "6,300 lesser-known living languages", or why do we cut the list to 675
> CLDR-Users mailing list
> CLDR-Users at unicode.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CLDR-Users