Unicode Locales and CLDR
Q. What are Unicode Locales?
A. The Unicode Locale project provides an XML data format (LDML) for
representing locale data, plus a repository of data in that format (CLDR). For more
information, see http://unicode.org/cldr/.
Q. What products use Unicode Locales?
A. We don't track the users, but we know that the data and format are used
directly by IBM's AIX, Google, ICU, Java, and OpenOffice, while Microsoft uses the format.
There are many indirect users as well. For example, through using ICU many other companies
and organizations are using Unicode Locales, including: Adobe, Apple (Mac OS X), Apache,
BEA, Boost, Business Objects, CERN, Debian Linux, Eclipse, eBay, Free BSD, Gentoo Linux,
Google, HP, Hyperion, Inktomi, Intel, Mozilla, Mandrake Linux, Progress Software, Python,
SAP, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, Symantec, Teradata (NCR),
Virage, Wine, Yahoo! (see http://icu-project.org/)
If you know of other direct or indirect users that would be useful to add to
this list, let us know by filing a
report.
Q. What is the difference between a locale
and a language?
A. The distinction in practice is very fuzzy. A locale is best thought of as
being a language plus some additional information. For more information, see
What is a Locale? and
Language and Locale
IDs.
Q. Does Unicode CLDR define special
values used to indicate an Unknown value for language, script, region (country/territory),
timezone, and currency?
A. Yes, see the table at
Unknown_or_Invalid_Identifiers in UTR#35.
Q. Are these in the source standards used by
CLDR?
A. Some of them are, and some use private use codes defined by
CLDR. For more information, see
Identifiers in UTR#35.
Q. Why is the Unknown value for script "Zzzz" and not "Zyyy"?
A. The IANA
registry for BCP 47 includes these special values for scripts:
Type: script
Subtag: Zyyy
Description: Code for undetermined script
Added: 2005-10-16
%%
Type: script
Subtag: Zzzz
Description: Code for uncoded script
Added: 2005-10-16
However, in the Unicode Standard, Zyyy marks characters that are common across
a number of scripts (like 1,2,3, . ; ? and so on). Because of that, in a Unicode context it
is viewed as being more like "in multiple scripts", and not really appropriate for locales.
Zzzz marks reserved characters which may end up becoming any script once encoded (see
Values in UTR#24). It is also the
value used for Unknown script in CLDR. In practice, it is rare to know the language without
the script (en-Zzzz would be an odd case, for example), so this is used mostly for API
return values.
Q. But what would I use for an audio tape in English?
A. The code Zxxx is for unwritten (eg, spoken) material. Thus en-Zxxx could be
used for an audio tape, if you wanted to make clear that there was no written English
content.
Q. What capitalization rule
would I use when translating display names?
A. Many LDML elements are display names. Display names describe languages,
scripts, countries, variants, units, and many other items. It is a translated
name for use in user-interfaces for displaying lists. The translator should
adopt the capitalization rules for menu lists, appropriate for the target language.
Currencies may also have a display name. With the introduction of pluralized
units, it is recognized that currencies may also be used in user-interfaces with flowing text.
For currency names, the translator may adopt a capitalization rule suitable for use in both menu
lists and flowing text, although we recognize, there may be limitations with this strategy,
at this time.