Dataset for all ISO639 code sorted by country/territory?

Philippe Verdy verdy_p at
Mon Nov 21 13:08:29 CST 2016

2016-11-21 1:50 GMT+01:00 Richard Wordingham <
richard.wordingham at>:

> On Sun, 20 Nov 2016 22:53:54 +0100
> Philippe Verdy <verdy_p at> wrote:
> > I think it just requires a minimal dataset: ask for it, submit the
> > data, it will be made available for vetting, and if vetting makes it
> > suitable for publication with the minimal core set of properties, it
> > will be added to the published list.
> The minimal data set can be difficult to collect, and may actually be
> impossible.  There may be technical issues - can one actually specify
> that today's date is "a.d. XI Kal. Dec. a.u.c. MMDCCLXIX" in
> Classical Latin?
> If you speak about Classical Latin, "a.d. XI Kal. Dec. a.u.c. MMDCCLXIX"
is the most accurate (but historic) form. But it is no longer used aince
long in modern Latin (e.g. by the Vatican, which now uses the Gregorian
calendar). In fact when Latin was official in many countries of Europe, the
Gregorian Latin was already used and the Roman Republican calendar was
already abandonned (it predates the effecitve christian era in Rome). The
Roman Empire was christianizd in the 3rd century by Emperor Constantin,
when Latin was not only an official adminsitration language but still a
living language. How much it took in the Middle Age for the Julian calendar
(and 14 centuries after , the Gregorian calendar when the Latin language
was no longer a living or adminsitrative language except in the Episcopal
States) to replace the Roman Republican calendar is another question,

So the question for the Latin language would be to identify which calendar
is official, but not how we can bring relevant and accurante calendar
translation in Latin language for the three calendars. If you consider the
"La" locale, it should be by default bound to the current modern epoch, so
using the Gregorian calendar by default. For other historic periods, you'd
need at least other sublocales, one for the Roman Republic, another for the
Roman Empire starting at Emperor Julius Caesar, bound to the early Julian
Calendar, another after Emperor Augustus (introducing changes in month
lengths to create the month of August) bound to the modern Julian Calendar,
another for the introduction of the Gregorian Calendar: it means 4 distinct
locales in Latin. And you'd probably need further distinctions at
linguistic level for the introduction of lowercase letters in the
Middle-Age (early Classical Latin was unicameral): 5 distinguished locale
variants only for this language in the same script ! You could as well
extend this to earlier periods where Latin was still not the language of
the whole Roman Empire, and had various regional "Italic" variants some of
them still exhibiting classical Greek features.

These language variants still persist today in modern Greek (polytonic or
monotonic): monotonic Greek is a very recent introduction is now the
official form for adminsitrative purpose, but many Greek people still love
their polytonic features. But Classical Greek did not have these
distinctions (and early Classical Greek was also unicameral, and had
various regional variants or variants in how they wrote numbers, or simply
in their alphabet, which had additional letters now extinct in modern
Greek...). Here again how many variants will we encode in CLDR for Greek ?

And in fact is Classical Greek really the same language (Classical Chinese
for example uses another language code "lzh", dinctinguished from modern
Mandarin, and where the "zh" code is now no longer a single language but a
collections of languages that behave as a "macrolanguage" only in its
written form; for the oral form, there's a clear need of distinction,
notably for Cantonese and Taiwanese and other Southern Chinese languages,
even if they are unified on their written form by a script variant under
"zh-hant", whereas "Standard" Mandarin uses "zh-hans" and the "zh" language
code maps by default to this implied "hans" form, also used outside China
in Singapore, or in large minorities in the Indian Ocean or even those
living in US !). There's no doubt however that the "hant" script variant is
the only one relevant for Classical Mandarin ("lzh"), even if it also has
multiple important variants which are very difficult to unify with the
modern "Traditional" variant.

For now let's remain in scope: CLDR must first address the needs for
current modern variants, as they are used today. Many other locales (or
sublocales) are possible in data but will never reach CLDR standardization,
unless there's an active community and an autority still using the historic
forms (e.g. for "nearly official" religious or ceremonial usage, which is
IMHO a legitimate reason to encode them as, effectively, these historic
forms are not really extinct). This remark will apply as well to Biblic
Greek, Biblic/Masoretic Hebrew, Biblic Geez (in Ethiopia), Biblic
Georgian, or Coranic Arabic that have significant and important differences
with the vernacular modern "standard" languages for Greek, Hebrew,
Geez, Georgian, and Arabic: these **living** religious variants should be
IMHO encoded in CLDR.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the CLDR-Users mailing list