CLDR-UTC Liaison Report

L2/04-422

Re: Liaison Report

To: Unicode Technical Committee

From: CLDR Technical Committee

Date: 2004-11-16

CLDR 1.2. The main news is the release of CLDR 1.2. This new release contains data for 232 locales, covering 72 languages and 108 territories. There are also 63 draft locales in the process of being developed, covering an additional 27 languages and 28 territories. (At the end of this document is a list of languages and territories covered.) In this release, the major additions to CLDR are:

Data: Many names for languages, territories, and scripts have been added, as well as for time zones, calendars, and other named items such as collation; the data has been compacted by removal of inherited or aliased data; plus other fixes and additions.

Structure: The LDML specification has been enhanced substantially. It now has self-contained descriptions of date / number / choice format patterns, inheritance and validity, time zone fallbacks, and provides lists of all valid attribute values. The XML format adds structure to assist in vetting, to allow for multiple sets of exemplar characters (indicating characters in use in particular locales), to represent relative dates and times, to provide better support for time zone names, and to strengthen the alias mechanism.

Implementation: The new collation tests allow implementations to verify correct use of locale data. For comparison of data, by-type charts and vetting charts were added. The CLDR source now does not require ICU for generation.

This is a stable release and may be used as reference material or cited as a normative reference by other specifications.

To see the list of all bugs fixes and additions for this release, go to Release Changes. This is a view into the feature/bug database that presents only those closed in this release.

Process. The process is somewhat different than the UTC, so it is worth reviewing that. Instead of an action item list we manage the CLDR process using a bug database to report defects, request features, and manage changes. A source-code repository system contains all the data, so snapshots are available at any time. The meetings are also conducted differently: we have a weekly teleconference to make decisions; other business is done via email and/or bug reports.

CLDR 1.3. This is targeted at April 1, 2005. All new data or defect reports for CLDR 1.3 must be submitted no later than January 15, 2005 (for schedule, see http://www.unicode.org/cldr/). In the CLDR 1.3 release, we are having an earlier freeze date, to allow us to manage the vetting process. Our major deliverables have yet to be fully determined, but they are likely to include:

Survey Tool. A web-based tool that presents an easier-to-understand interface for translators, allowing them to see what is in the repository (and its status), and request changes. For pattern-based information, examples will show the effect of different patterns. The survey tool will pre-format as XML, reducing the work involved in incorporating changes. Work is already underway on this.

POSIX conversion tool. A tool that generates POSIX data from CLDR. Work is already underway on this as well.

Tests. Add more 'native' consistency testing, using a library of Java tools.

Site Migration. Move CVS and the bug database to the Unicode site.

Additional Mechanisms. Provide mechanisms for: more lenient date/time/number parsing; different combinations of date fields; names for dialects; translated names for measurement systems.

Data Additions/Corrections. And of course, additions of data for various locales. Some of these changes will be coordinated with UTC actions, such as CGJ in German sorting and changes in UCA. From particular countries, such as Finland (see letter below), we are expecting sizable additions in the near future.

Bugs currently targeted at 1.3 can be seen at CLDR 1.3 Bugs. Note: being targeted at a release does not mean that the requested change will be incorporated as stated -- the CLDR committee will review and assess any proposed change, and may change the target.

Letter from Erkki I. Kolehmainen, Liaison to Unicode from RILF

(This letter was addressed to a different topic, but contains useful information for this liaison report.)

...The solution that the emergence of CLDR has brought up in Finland might work well for other smaller language and cultural environments, too. (In fact, I already made a comment to that effect at the San Jose IUC.)

After the Unicode announcement of CLDR, the Ministry of Education concluded in June 2004 that providing data for the CLDR is a logical, pragmatic expansion of the language development activities that are the responsibility of the Research Institute for the Languages of Finland ("RILF" aka "Kotus"). The language environments to be covered in Finland are Finnish (language lead country: Finland), Swedish (lead: Sweden), Northern Sámi (lead: Norway), Inari Sámi (lead: Finland), Skolt Sámi (lead: Finland), and Romani (lead: Council of Europe?). For us, the means to document the national preferences in the open is of at least the same importance as the goal of being able to use the mechanism directly for implementations.

In order to set up an orderly interface for the submissions, RILF became a Liaison member of Unicode in July. In order to facilitate an open and transparent process, a fully open (and free of charge) national group on language and cultural requirements on ICT was set up in the end of September. This group represents currently over 30 different parties, including both commercial and non-commercial organizations, and individuals. The steering group that was set up in early November has kicked-off a number of task oriented working groups. The first results will be submitted for public comments in mid-December. The comments are accepted from both the established group (by invitation) and the general public via web pages to be set up at a new site, kotoistus.fi (a new Finnish term for localization).

This consensus seeking process is, admittedly, rather slow, but we already know that some topics are likely to cause considerable debate. We will document the rationale for resolving any controversial issues as well as deviations from any national standard. The results, where applicable, are also expected to lead to new or revised national standards....

CLDR 1.2 Languages and Territories

The following shows the currently available languages and territories. The amount of data present varies between different locales, especially with translated names for languages, territories, currencies, and timezones.

Languages: Afrikaans, Bahasa Indonesia, Bahasa Melayu, Català, Čeština, Cymraeg, Dansk, Deutsch, Eesti, English, Español, Esperanto, Euskara, Føroyskt, Français, Gaeilge, Gaelg, Galego, Hrvatski, Íslenska, Italiano, Kalaallisut, Kernewek, Kiswahili, Latviešu, Lietuvių, Magyar, Malti, Nederlands, Norsk Bokmål, Norsk Nynorsk, Oromoo, Polski, Português, Română, Shqipe, Slovenščina, Slovenský, Soomaali, Srpski, Srpsko-Hrvatski, Suomi, Svenska, Tiếng Việt, Türkçe, Ελληνικά, Беларускі, Български, Қазақ, Македонски, Русский, Српски, Українська, Հայերէն, ‎עברית‎, ‎العربية‎, ‎پښتو‎, ‎دری‎, ‎فارسی‎, ትግርኛ, አማርኛ, कोंकणी, मराठी, हिंदी, বাংলা, ਪੰਜਾਬੀ, ગુજરાતી, தமிழ், తెలుగు, ಕನ್ನಡ, ไทย, 한국어, 中文, 日本語

Territories: Argentina, Australia, België, Belgien, Belgique, Belgium, Bolivia, Botswana, Brasil, Brunei, Canada, Česká Republika, Chile, Colombia, Costa Rica, Danmark, Deutschland, Ecuador, Eesti, Éire, El Salvador, Espainia, España, Espanya, Estados Unidos, Finland, Føroyar, France, Guatemala, Honduras, Hong Kong S.A.R., China, Hrvatska, India, Indonesia, Ireland, Ísland, Italia, Itoobiya, Itoophiyaa, Jabuuti, Kalaallit Nunaat, Keeniyaa, Kenya, Kiiniya, Latvija, Lietuva, Luxembourg, Luxemburg, Magyarország, Malaysia, Malta, México, Nederland, New Zealand, Nicaragua, Noreg, Norge, Österreich, Panamá, Paraguay, Perú, Philippines, Polska, Portugal, Prydain Fawr, Puerto Rico, República Dominicana, România, Rywvaneth Unys, Schweiz, Shqipëria, Singapore, Slovenija, Slovenská Republika, Soomaaliya, South Africa, Srbija I Crna Gora, Suid-Afrika, Suisse, Suomi, Sverige, Svizzera, Tanzania, Türkiye, U.S. Virgin Islands, United Kingdom, United States, Uruguay, Venezuela, Việt Nam, Zimbabwe, Ελλάδα, Беларусь, България, Қазақстан, Македонија, Россия, Србија И Црна Гора, Украина, Україна, Հայաստանի Հանրապետութիւն, ‎ישראל‎, ‎افغانستان‎, ‎الاردن‎, ‎الامارات العربية المتحدة‎, ‎البحرين‎, ‎الجزائر‎, ‎السودان‎, ‎العراق‎, ‎العربية السعودية‎, ‎الكويت‎, ‎المغرب‎, ‎الهند‎, ‎اليمن‎, ‎ایران‎, ‎تونس‎, ‎سورية‎, ‎عمان‎, ‎قطر‎, ‎لبنان‎, ‎ليبيا‎, ‎مصر‎, ኢትዮጵያ, ኤርትራ, भारत, ভারত, ਭਾਰਤ, ભારત, இந்தியா, భారత దేళ౦, ಭಾರತ, ประเทศไทย, 대한민국, 中国, 中華人民共和國香港特別行政區, 新加坡, 日本, 澳門特別行政區, 臺灣

Draft languages: Afar, Assamese, Azerbaijani, Blin, Divehi, Dzongkha, Geez, Georgian, Hawaiian, Inuktitut, Khmer, Kirghiz, Lao, Mongolian, Sanskrit, Sidamo, Syriac, Tatar, Tigre, Urdu, Uzbek, Walamo, ଓଡ଼ିଆ, മലയാളം

Notes:

Territories with multiple locales may appear more than once.

Tooltips will show the language/territory code, English name, and Latin transliteration. (To set your tooltip font, right click on the desktop, pick Properties>Appearance>Advanced>Item: ToolTip, then set the font to Arial Unicode MS or other large font.)

If the above doesn't display in your browser, see Display Problems?

Re:	Liaison Report
To:	Unicode Technical Committee
From:	CLDR Technical Committee
Date:	2004-11-16