[Unicode]  Technical Reports
 

Unicode Technical Standard #35

Locale Data Markup Language (LDML)

Version 1.4.1
Authors Mark Davis (mark.davis@google.com)
Date 2006-11-03
This Version http://unicode.org/reports/tr35/tr35-7.html
Previous Version http://unicode.org/reports/tr35/tr35-6.html
Latest Version http://unicode.org/reports/tr35/
Corrigenda http://unicode.org/cldr/corrigenda.html
Latest Working Draft http://unicode.org/draft/reports/tr35/tr35.html
Namespace: http://unicode.org/cldr/
DTDs: http://unicode.org/cldr/dtd/1.4.1/ldml.dtd
http://unicode.org/cldr/dtd/1.4.1/ldmlSupplemental.dtd
Revision 7


Summary

This document describes an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Common Locale Data Repository maintained by the Unicode Consortium.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For possible errata for this document, see [Errata].

Contents

1. Introduction

Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. But there remain differences in the locale data used by different systems.

Common, recommended practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.

But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but resulting in different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [Comparisons].)

Note: There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.

This document specifies an XML format for the communication of locale data: the Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings (see [UCA]). LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.

For more information, see the Common Locale Data Repository project page [LocaleProject].

2. What is a locale?

Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use the data, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.

The first issue is basic: what is a locale? In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services.

Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's timezone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, etc.), music preference, religion, party affiliation, favorite charity, etc.

Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.

In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, etc.). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between locales and languages, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, see Appendix D: Language and Locale IDs.

We will speak of data as being "in locale X". That does not imply that a locale is a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a resource or field, and a tag indicating the key of the resource is called a resource tag.

3. Identifiers

LDML uses stable identifiers for distinguishing among locales, regions, currencies, timezones, transforms, and so on. Within each type of entity, such as locales or such as currencies, the identifiers are unique. However, across types the identifiers may not be unique: thus a currency identifier may be the same as a locale identifier (especially since identifiers are compared caselessly).

There are many systems for identifiers for these entities. The LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.

An LDML locale identifier is either "root", or has the following format:

locale_id := base_locale_id options?

base_locale_id := extended_RFC3066bis_identifiers

options := "@" key "=" type ("," key "=" type )*

As usual, x? means that x is optional; x* means that x occurs zero or more times.

For historical reasons, this is called a locale ID. However, it really functions (with few exceptions) as a language ID, and accesses language-based data. There used to be some information that was improperly included in the language-based data, like default currency and weekend ranges, but that was removed over time; moved to supplemental files. Those supplemental data files represent not so much "locale" data as non-language data. However, except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data.

A locale ID is an extension of a language ID, and thus the structure and field values are based on the successor to RFC 3066, known as RFC3066bis, which has as been approved, but not yet published. However, the registry of data for that successor is now being maintained by IANA. For that registry, and the editor's draft of the standard, see [RFC3066bis].  The canonical form of a locale ID uses "_" instead of the "-" used in RFC3066bis; however, implementations providing APIs for CLDR locale IDs should treat "-" as equivalent to "_" on input. The most common format for the base_locale_id is a series of one or more fields of the form:

language_code ("_" script_code)? ("_" territory_code)? ("_" variant_code)?

The field values are given in the following table. All field values are case-insensitive, except for the type, which is case-sensitive. However, customarily the language code is lowercase, the territory and variant codes are uppercase, the script code is titlecase (that is, first character uppercase and other characters lowercase), and variants are uppercase. This convention is used in the file names, which may be case-sensitive depending on the operating system. Customarily the currency IDs are uppercase and timezone IDs are titlecase by field (as defined in the timezone database); other key and type codes are lowercase. The type may also be referred to as a key-value, for clarity.

Note that some private use field values may be given specific values when used with LDML.

Locale Field Definitions
Field Allowable Characters Allowable values
language_code ASCII letters [RFC3066bis] subtag values marked as Type: language

Extensions: In some exceptional cases, draft [ISO639] codes may be used in CLDR, if in the judgment of the technical committee they are essentually assured of being added. These currently include:

cch Atsam
kaj Jju
kcg Tyap
kfo Koro

Users should however be aware that if these codes are not accepted into [RFC3066bis], that they will be replaced by whatever codes are used, or by private use codes.

The private use codes from qfz..qtz will never be used by CLDR, and are thus safe for use for other purposes by applications using CLDR data.

script_code ASCII letters [RFC3066bis] subtag values marked as Type: script

In most cases the script is not necessary, since the language is only customarily written in a single script. Examples of cases where it is used are:

az_Arab Azerbaijani in Arabic script
az_Cyrl Azerbaijani in Cyrillic script
az_Latn Azerbaijani in Latin script
zh_Hans Chinese, in simplified script
zh_Hant Chinese, in traditional script

CLDR allows for the use of the Unicode Script values [UAX24]:

Common Zyyy
Inherited Qaai
Unknown Zzzz

The private use codes from Qaaq..Qabx will never be used by CLDR, and are thus safe for use for other purposes by applications using CLDR data.

territory_code ASCII letters, numbers [RFC3066bis] subtag values marked as Type: region, or any UN M.49 code that doesn't correspond to a [RFC3066bis] region subtag.

There are three private use codes defined in LDML:

QO Outlying Oceania
QU European Union
ZZ Unknown or Invalid Territory

The private use codes from XA..XZ will never be used by CLDR, and are thus safe for use for other purposes by applications using CLDR data.

variant_code ASCII letters Values used in CLDR are discussed below. For information on the process for adding new standard variants or element/type pairs, see [LocaleProject].
key ASCII letters and digits
type ASCII letters, digits, and "-"

Examples:

en
fr_BE
de_DE@collation=phonebook,currency=DDM

The locale id format generally follows the description in the OpenI18N Locale Naming Guideline [NamingGuideline], with some enhancements. The main differences from the those guidelines are that the locale id:

  1. does not include a charset (since the data in LDML format always provides a representation of all Unicode characters. The repository is stored in UTF-8, although that can be transcoded to other encodings as well.),
  2. adds the ability to have a variant, as in Java
  3. adds the ability to discriminate the written language by script (or script variant).
  4. is a superset of [RFC3066bis] codes.

Note: The language + script + territory code combination can itself be considered simply a language code: For more information, see Appendix D: Language and Locale IDs.

A locale that only has a language code (and possibly a script code) is called a language locale; one with both language and territory code as well is called a territory locale (or country locale).

The variant codes specify particular variants of the locale, typically with special options. They cannot overlap with script or territory codes, so they must have either one letter or have more than 4 letters. The currently defined variants include:

Variant Definitions
variant Description
<RFC 3066bis variants> As defined in [RFC3066bis], plus:
BOKMAL Bokmål, variant of Norwegian (deprecated: use nb)
NYNORSK Nynorsk, variant of Norwegian (deprecated: use nn)
AALAND Åland, variant of Swedish used in Finland (deprecated: use AX)
POSIX A POSIX-style invariant locale.
REVISED For revised orthography
SAAHO The Saaho variant of Afar

Note: The first two of the above variants are for backwards compatibility. Typically the entire contents of these are defined by an <alias> element pointing at nb_NO (Norwegian Bokmål) and nn_NO(Norwegian Nynorsk) locale IDs. See also Appendix K: Valid Attribute Values.

The locale IDs corresponding to grandfathered [RFC3066bis] language tags are permitted, but not recommended.

The currently defined optional key/type combinations include the following. Additional type values are defined in the detail sections of this document or in Appendix K: Valid Attribute Values. The assignment of values needs to ensure that they are unique if truncated to 8 letters.

Key/Type Definitions
key type Description
collation phonebook For a phonebook-style ordering (used in German).
pinyin Pinyin ordering for Latin and for CJK characters (that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin)
traditional For a traditional-style sort (as in Spanish)
stroke Pinyin ordering for Latin, stroke order for CJK characters
direct Hindi variant
posix A "C"-based locale.
big5han Pinyin ordering for Latin, big5 charset ordering for CJK characters.
gb2312han Pinyin ordering for Latin, gb2312han charset ordering for CJK characters.
calendar* gregorian (default)
islamic

alias: arabic

Astronomical Arabic
chinese Traditional Chinese calendar
islamic-civil

alias: civil-arabic

Civil (algorithmic) Arabic calendar
hebrew Traditional Hebrew Calendar
japanese Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor)
buddhist

alias: thai-buddhist

Thai Buddhist Calendar (same as Gregorian except for the year)
persian Persian Calendar
coptic Coptic Calendar
ethiopic Ethiopic Calendar
*For information on the calendar algorithms associated with the data used with these types, see [Calendars].
collation parameters:

colStrength
colAlternate
colBackwards
colNormalization
colCaseLevel
colCaseFirst,
colHiraganaQuaternary
colNumeric
variableTop

associated values as defined in: 5.13.1 <collation> semantics as defined in: 5.13.1 <collation>
currency ISO 4217 code Currency value identified by ISO code, plus others in common use. See Appendix K: Valid Attribute Values and also [Data Formats]
timezone TZID Identification for timezone according to the TZ Database. See [Data Formats].

For more information on the allowed attribute values, see the specific elements below, and Appendix K: Valid Attribute Values.

CLDR Locale IDs can be converted to valid RFC 3066bis language tags by performing the following transformation.

Thus for example, we get the following conversion:

CLDR en_US_POSIX@calendar=islamic,collation=traditional,colStrength=secondary
RFC3066bis en-US-x-ldml-POSIX-k-calendar-islamic-k-collation-traditio-k-colStren-secondar

3.1 Unknown or Invalid Identifiers

The following identifiers are used to indicate an unknown or invalid code in CLDR. The Region and Timezone code are additional codes provided by CLDR; the others are defined by the relevant standards. When these codes are used in APIs connected with CLDR, the meaning is that either there was no identifier available, or that at some point an input identifier value was determined to be invalid or ill-formed.

Code Type Value Description in Referenced Standards
Language und Undetermined language
Script Zzzz Code for uncoded script, Unknown [UAX24]
Region   ZZ Unknown or Invalid Territory
Currency XXX The codes assigned for transactions where no currency is involved
Timezone Etc/Unknown Unknown or Invalid Timezone

When only the script or region are known, then a locale ID will use "und" as the language subtag portion. Thus the locale tag "und_Grek" represents the Greek script; "und_US" represents the US territory.

For region codes, ISO and the UN establish a mapping to three-letter codes and numeric codes. However, this does not extend to the private use codes, which are the codes 900-999 (total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092) . CLDR supplies a standard mappings to these: for the numeric codes, it uses the top of the numeric private use range; for the 3-letter codes it doubles the final letter. These are the resulting mappings for all of the private use region codes:

Region UN/ISO Numeric ISO 3-Letter
AA 958 AAA
QM..QZ 959..972 QMM..QZZ
XA..XZ 973..998 XAA..XZZ
ZZ 999 ZZZ

For script codes, ISO 15924 supplies a mapping (however, the numeric codes are not in common use):

Script Numeric
Qaaa..Qabx 900..949

4. Locale Inheritance

The XML format relies on an inheritance model, whereby the resources are collected into bundles, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as root. Wherever possible, the resources in the root are language & territory neutral. For example, the collation (sorting) order in the root is the default Unicode Collation Algorithm order (see [UCA]). Since English language collation has the same ordering, the 'en' locale data does not need to supply any collation data, nor does either the 'en_US' or the 'en_IE' locale data.

Given a particular locale id "en_US_someVariant", the search chain for a particular resource is the following.

en_US_someVariant
en_US
en
root

If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.

Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_US" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance.

Where this inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that system, by adding all inherited data to each locale data set.

For a more complete description of how inheritance applies to data, and the use of keywords, see Appendix I: Inheritance and Validity.

The locale data does not contain general character properties that are derived from the Unicode Character Database [UCD]. That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the CLDR data requires use of UCD data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.

Warning: If a locale has a different script than its parent (eg sr_Latn), then special attention must be paid to make sure that all inheritance is covered. For example, auxiliary exemplar characters may need to be empty ("[]") to block inheritance.

4.1 Multiple Inheritance

In clearly specified instances, resources may inherit from within the same locale. For example, currency format symbols inherit from the number format symbols; the Buddhist calendar inherits from the Gregorian calendar. This only happens where documented in this specification. In these special cases, the inheritance functions as normal, up to the root. If the data is not found along that path, then a second search is made, logically changing the element/attribute to the alternate values.

For example, for the locale "en_US" the month data in <calendar class="buddhist"> inherits first from <calendar class="buddhist"> in "en", then in "root". If not found there, then it inherits from <calendar type="gregorian"> in "en_US", then "en", then in "root".

5 XML Format

There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple files, which can be in multiple directory trees.

For example, the language-dependent data for Japanese in CLDR is present in the following files:

The status of the data is the same, whether or not data is split. That is, for the purpose of validation and lookup, all of the data for the above ja.xml files is treated as if it was in a single file.

Supplemental data relating to Japan or the Japanese writing system can be found in:

The following sections describe the structure of the XML format for language-dependent data. The more precise syntax is in the DTD, listed at the top of this document; however, the DTD does not describe all the constraints on the structure.

To start with, the root element is <ldml>, with the following DTD entry:

<!ELEMENT ldml (identity, (alias |(localeDisplayNames?, layout?, characters?, delimiters?, measurement?, dates?, numbers?, collations?, posix?, special*))) >

That element contains the following elements:

The structure of each of these elements and their contents will be described below. The first few elements have little structure, while dates, numbers, and collations are more involved.

The XML structure is stable over releases. Elements and attributes may be deprecated: they are retained in the DTD but their usage is strongly discouraged. In most cases, an alternate structure is provided for expressing the information.

In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as numbers or dates). The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.

There are two kinds of elements in LDML: rule elements and structure elements. For structure elements, there are restrictions to allow for effective inheritance and processing:

  1. There is no "mixed" content: if an element has textual content, then it cannot contain any elements.
  2. The XPath leading to the content is unique; no two different pieces of textual content have the same XPath.

Structure elements do not have this restriction, but also do not inherit, except as an entire block. The structure elements are listed in serialElements in the supplemental metadata. See also Appendix I: Inheritance and Validity.

Note that the data in examples given below is purely illustrative, and doesn't match any particular language. For a more detailed example of this format, see [Example]. There is also a DTD for this format, but remember that the DTD alone is not sufficient to understand the semantics, the constraints, nor  the interrelationships between the different elements and attributes. You may wish to have copies of each of these to hand as you proceed through the rest of this document.

In particular, all elements allow for draft versions to coexist in the file at the same time. Thus most elements are marked in the DTD as allowing multiple instances. However, unless an element is listed as a serialElement, or has a distinguishing attribute, it can only occur once as a subelement of a given element. Thus, for example, the following is illegal even though allowed by the DTD:

<languages>
  <language type="aa">...</language>
  <language type="aa">..</language>

There must be only one instance of these per parent, unless there are other distinguishing attributes (such as an alt element).

In general, data should be in NFC format. Exceptions to this include transforms, segmentations, and pc/sc/tc/qc/ic rules in collation. Thus LDML documents must not be normalized as a whole. To prevent problems with normalization, no element value can start with a combining backslash.

Lists, such as singleCountries are space-delimited. That means that they are separated by one or more XML whitespace characters, and that leading and trailing spaces are to be ignored (that is, they behave like NMTOKENS). These include:

5.1 Common Elements

At any level in any element, two special elements are allowed.

<special xmlns:yyy="xxx">

This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute, which specifies the XML namespace of the special data. For example, the following used the version 1.0 POSIX special element.

<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.0/ldml.dtd" [
    <!ENTITY % posix SYSTEM "http://unicode.org/cldr/dtd/1.0/ldmlPOSIX.dtd">
%posix;
]>
<ldml>
...
<special xmlns:posix="http://www.opengroup.org/regproducts/xu.htm">
        <!-- old abbreviations for pre-GUI days -->
        <posix:messages>
            <posix:yesstr>Yes</posix:yesstr>
            <posix:nostr>No</posix:nostr>
            <posix:yesexpr>^[Yy].*</posix:yesexpr>
            <posix:noexpr>^[Nn].*</posix:noexpr>
        </posix:messages>
    </special>
</ldml>

<alias source="<locale_ID>" path="..."/>

The contents of any element can be replaced by an alias, which points to another source for the data. The elements in that source are to be fetched from the corresponding location in the other source. Normal resource searching is to be used; take the following example:

<ldml>
  <collations>
    <collation type="phonebook">
      <alias source="de_DE">
    </collation>
  </collations>
</ldml>

The resource bundle at "de_DE" will be searched for a resource element at the same position in the tree with type "collation". If not found there, then the resource bundle at "de" will be searched, etc. For an example of how this works with inheritance, look at the following table (where green indicates inherited items). Note in particular that an alias "reroutes" the inheritance; nothing in the parent affects the contents of an item with an alias. Thus the red item below is blocked.

Inheritance with Aliases
en en_US Resolved
<x>
  <a>01</a>
  <b>02</a>
  <c>03</a>
</x>
<x>

  <b>12</b>

</x>
<x>
  <a>01</a>
  <b>12</b>
  <c>03</c>
</x>
de de_DE Resolved de_DE_1901 Resolved
<x>
  <a>21</a>
  <b>22</b>
  <c>23</c>
  <d>23</d>
</x>
<x>
  <alias source="en_US">
</x>
<x>
  <a>01</a>
  <b>12</b>
  <c>03</c>
</x>
<x>
  <a>41</a>



</x>
<x>
  <a>41</a>
  <b>12</b>
  <c>03</c>
</x>

If the path attribute is present, then its value is an XPath that points to a different node in the tree. For example:

<alias source="root" path="../monthWidth[@type='wide']"/>

The default value if the path is not present is the same position in the tree. All of the attributes in the XPath must be distinguishing elements. For more details, see Appendix I: Inheritance and Validity.

There is a special value for the source attribute, the constant source="locale", which is the default value. This special value is equivalent to the locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:

Inheritance with source="locale"
Root de Resolved
<x>
  <a>1</a>
  <b>2</b>
  <c>3</c>
</x>
<x>
 <a>11</a>
 <b>12</b>
 <d>14</d>
</x>
<x>
 <a>11</a>
 <b>12</b>
 <c>3</c>
 <d>14</d>
</x>
<y>
 <alias path="../x">
</y>
<y>
 <b>22</b>
 <e>25</e>
</y>
<y>
 <a>11</a>
 <b>22</b>
 <c>3</c>
 <d>14</d>
 <e>25</e>
</y>

The first row shows the inheritance within the <x> element, whereby <c> is inherited from root. The second shows the inheritance within the <y> element, whereby <a>, <c>, and <d> are inherited also from root, but from an alias there. The alias in root is logically replaced not by the elements in root itself, but by elements in the 'target' locale.

For more details on data resolution, see Appendix I: Inheritance and Validity.

It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups (including inheritance and multiple inheritance) can be followed indefinitely without terminating.

<displayName>

Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs.

  <numberFormat>
    <displayName>Prozentformat</displayName>
...
  <numberFormat>

Where present, the display names must be unique; that is, two distinct code would not get the same display name.  (There is one exception to this: in timezones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different timezone IDs.) Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].

<default type="someID"/>

In some cases, a number of elements are present. The default element can be used to indicate which of them is the default, in the absence of other information. The value of the type attribute is to match the value of the type attribute for the selected item.

<timeFormats>
  <default type="medium" /> 
  <timeFormatLength type="full">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a z</pattern> 
    </timeFormat>
  </timeFormatLength>
  <timeFormatLength type="long">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a z</pattern> 
    </timeFormat>
  </timeFormatLength>
  <timeFormatLength type="medium">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a</pattern> 
    </timeFormat>
  </timeFormatLength>
...

Like all other elements, the <default> element is inherited. Thus, it can also refer to inherited resources. For example, suppose that the above resources are present in fr, and that in fr_BE we have the following:

<timeFormats>
  <default type="long"/>
</timeFormats>

In that case, the default time format for fr_BE would be the inherited "long" resource from fr. Now suppose that we had in fr_CA:

  <timeFormatLength type="medium">
    <timeFormat type="standard">
      <pattern type="standard">...</pattern> 
    </timeFormat>
  </timeFormatLength>

In this case, the <default> is inherited from fr, and has the value "medium". It thus refers to this new "medium" pattern in this resource bundle.

5.1.1 Escaping Characters

Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, in certain instances extra syntax is required to represent those code points that cannot be otherwise represented in element content. These escapes are only allowed in certain elements, according to the DTD.

Escaping Characters
Code Point XML Example
U+0000 <cp hex="0">

5.2 Common Attributes

<... type="stroke" ...>

The attribute type is also used to indicate an alternate resource that can be selected with a matching type=option in the locale id modifiers, or be referenced by a default element. For example:

<ldml>
  ...
  <currencies>
    <currency>...</currency>
    <currency type="preEuro">...</currency>
  </currencies>
</ldml>

<... draft="unconfirmed" ...>

If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrary draft value), as per the following:

Normally draft attributes should only occur on "leaf" elements. For a more formal description of how elements are inherited, and what their draft status is, see Appendix I: Inheritance and Validity.

<... alt="descriptor" ...>

This attribute labels an alternative value for an element. The descriptor indicates what kind of alternative it is, and takes one of the following forms:

"proposed" should only be present if the draft status is not "approved". It indicates that the data is proposed replacement data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September for some language is "Settembru", and a bug report is filed that that should be "Settembro". The new data can be entered in, but marked as alt="proposed" until it is vetted.

...
<month type="9">Settembru</month>
<month type="9" draft="unconfirmed" alt="proposed">Settembro</month>
<month type="10">...

Now assume another bug report comes in, saying that the correct form is actually "Settembre". Another alternative can be added:

...
<month type="9" draft="unconfirmed" alt="proposed2">Settembre</month>
...

The allowable values for variantname at this time are "variant", "list", "email", "www", and "secondary". This may be expanded in the future.

<... validSubLocales="de_AT de_CH de_DE" ...>

The attribute validSubLocales allows sublocales in a given tree to be treated as though a file for them were present when there isn't one. It can be applied to any element. It only has an effect for locales that inherit from the current file where a file is missing, and the elements wouldn't otherwise be draft.

For a more complete description of how draft applies to data, see Appendix I: Inheritance and Validity.

<... standard="..." ...>

Note: This attribute is deprecated. Instead, use a reference element with the attribute standard="true". See Section 5.12 <references>.

The value of this attribute is a list of strings representing standards: international, national, organization, or vendor standards. The presence of this attribute indicates that the data in this element is compliant with the indicated standards. Where possible, for uniqueness, the string should be a URL that represents that standard. The strings are separated by commas; leading or trailing spaces on each string are not significant. Examples:

<collation standard="MSA 200:2002">
...
<dateFormatStyle standard=”http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=26780&amp;ICS1=1&amp;ICS2=140&amp;ICS3=30”>

<... references="..." ...>

The value of this attribute is a list of strings, separated by spaces, each representing a reference for the information in the element, including standards that it may conform to. The best format is a series of tokens, where each token corresponds to a reference element. See Section 5.12 <references>. (In older versions of CLDR, the value of the attribute was freeform text. That format is deprecated.)

Example:

<territory type="UM" references="R1 R2">USAs yttre öar</territory>

The reference element may be inherited. Thus, for example, R2 may be used in sv_SE.xml even though it is not defined there, if it is defined in sv.xml.

5.3 Identity Elements

<!ELEMENT identity (alias | (version, generation, language, script?, territory?, variant?, special*) ) >

The identity element contains information identifying the target locale for this data, and general information about the version of this data.

<version number="$Revision: 1.212 $">

The version element provides, in an attribute, the version of this file.  The contents of the element can contain textual notes about the changes between this version and the last. For example:

<version number="1.1">Various notes and changes in version 1.1</version>

This is not to be confused with the version attribute on the ldml element, which tracks the dtd version.

<generation date="$Date: 2006/10/28 03:19:43 $" />

The generation element contains the last modified date for the data. This can be in two formats: ISO 8601 format, or CVS format (illustrated by the example above).

<language type="en"/>

The language code is the primary part of the specification of the locale id, with values as described above.

<script type="Latn" />

The script field may be used in the identification of written languages, with values described above.

<territory type="US"/>

The territory code is a common part of the specification of the locale id, with values as described above.

<variant type="NYNORSK"/>

The variant code is the tertiary part of the specification of the locale id, with values as described above.

5.4 Display Name Elements

<!ELEMENT localeDisplayNames (alias | (languages?, scripts?, territories?, variants?, keys?, types?, measurementSystemNames?, special*)) >

Display names for scripts, languages, countries, and variants in this locale are supplied by this element. These supply localized names for these items for use in user-interfaces for displaying lists of locales and scripts. Examples are given below.

Note: The "en" locale may contain translated names for deprecated codes for debugging purposes. Translation of deprecated codes into other languages is discouraged.

Where present, the display names must be unique; that is, two distinct code would not get the same display name. (There is one exception to this: in timezones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different timezone IDs.)

Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].

<languages>

This contains a list of elements that provide the user-translated names for language codes, as described in Section 3, Identifiers.

<language type="ab">Abkhazian</language>
<language type="aa">Afar</language>
<language type="af">Afrikaans</language>
<language type="sq">Albanian</language>

The type can actually be any locale ID as specified above. The set of which locale IDs is not fixed, and depends on the locale. For example, in one language one could translate the following locale IDs, and in another, fall back on the normal composition.

type translation composition
nl_BE Flemish Dutch (Belgium)
zh_Hans Simplified Chinese Chinese (Simplified Han)
en_GB British English English (United Kingdom)

Thus when a complete locale ID is formed by composition, the longest match in the language type is used, and the remaining fields (if any) added using composition.

<scripts>

This element can contain an number of script elements. Each script element provides the localized name for a script code, as described in Section 3, Identifiers (see also UAX #24: Script Names [Scripts]). For example, in the language of this locale, the name for the Latin script might be "Romana", and for the Cyrillic script is "Kyrillica". That would be expressed with the following.

<script type="Latn">Romana</script>
<script type="Cyrl">Kyrillica</script>

<territories>

This contains a list of elements that provide the user-translated names for territory codes, as described in Section 3, Identifiers.

<territory type="AF">Afghanistan</territory>
<territory type="AL">Albania</territory>
<territory type="DZ">Algeria</territory>
<territory type="AD">Andorra</territory>
<territory type="AO">Angola</territory>
<territory type="US">United States</territory>

<variants>

This contains a list of elements that provide the user-translated names for the variant_code values described in Section 3, Identifiers.

<variant type="nynorsk">Nynorsk</variant>

<keys>

This contains a list of elements that provide the user-translated names for the key values described in Section 3, Identifiers.

<key type="collation">Sortierung</key>

<types>

This contains a list of elements that provide the user-translated names  for the type values described in Section 3, Identifiers. Since the translation of an option name may depend on the key it is used with, the latter is optionally supplied.

<type type="phonebook" key="collation">Telefonbuch</type>

<measurementSystemNames>

This contains a list of elements that provide the user-translated names for systems of measurement. The types currently supported are "US", "metric", and "UK".

<measurementSystemName type="US">U.S.</type>

 

Note: In the future, we may need to add display names for the particular measurement units (millimeter vs millimetre vs whatever the Greek, Russian, etc are), and a message format for positioning those with respect to numbers. E.g. "{number} {unitName}" in some languages, but "{unitName} {number}" in others.

5.5 Layout Elements

<!ELEMENT layout ( alias | (orientation?, inList*, special*) ) >

This top-level element specifies general layout features. It currently only has one possible element (other than <special>, which is always permitted).

<orientation lines="top-to-bottom" characters="left-to-right" />

The lines and characters attributes specify the default general ordering of lines within a page, and characters within a line. The values are:

Orientation Attributes
Vertical top-to-bottom
bottom-to-top
Horizontal left-to-right
right-to-left

If the lines value is one of the vertical attributes, then the characters value must be one of the horizontal attributes, and vice versa. For example, for English the lines are top-to-bottom, and the characters are left-to-right. For Mongolian (in the Mongolian Script) the lines are right-to-left, and the characters are top to bottom. This does not override the ordering behavior of bidirectional text; it does, however, supply the paragraph direction for that text (for more information, see UAX #9: The Bidirectional Algorithm [BIDI]).

For dates, times, and other data to appear in the right order, the display for them should be set to the orientation of the locale.

<inList>

The following element controls whether display names (language, territory, etc) are titlecased in GUI menu lists and the like. It is only used in languages where the normal display is lowercase, but titlecase is used in lists. There are two options:

<inList casing="titlecase-words">
<inList casing="titlecase-firstword">

In both cases, the titlecase operation is the default titlecase function defined by Chapter 3 of [Unicode]. In the second case, only the first word (using the word boundaries for that locale) will be titlecased. The results can be fine-tuned by using alt="list" on any element where titlecasing as defined by the Unicode Standard will produce the wrong value. For example, suppose that "turc de Crimée" is a value, and the titlecase should be "Turc de Crimée". Then that can be expressed using the alt="list" value.

5.6 Character Elements

<!ELEMENT characters (alias | (exemplarCharacters*, mapping*, special*)) >

The <characters> element provides optional information about characters that are in common use in the locale, and information that can be helpful in picking resources or data appropriate for the locale, such as when choosing among character encodings that are typically used to transmit data in the language of the locale. It typically only occurs in a language locale, not in a language/territory locale.

<exemplarCharacters>[a-zåæø]</exemplarCharacters>

The exemplar character set contains the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation. ("Letter" is interpreted broadly, as anything having the property Alphabetic in the [UCD], which also includes syllabaries and ideographs.) It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included.

There are two sets: the main set should contain the minimal set required for users of the language, while the auxiliary exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Major style guidelines are good references for the auxiliary set. Thus for English we have [a-z] in the main set, and [á à ă â å ä ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ß ú ù ŭ û ü ū ÿ] in the auxiliary set.

In general, the test to see whether or not a letter belongs in the main set is based on whether it is acceptable in that language to always use spellings that avoid that character. For example, the exemplar character set for en (English) is the set [a-z]. This set does not contain the accented letters that are sometimes seen in words like "résumé" or "naïve", because it is acceptable in common practice to spell those words without the accents. The exemplar character set for fr (French), on the other hand, must contain those characters: [a-z é è ù ç à â ê î ô û æ œ ë ï ÿ]. The main set typically includes those letters commonly taught in schools as the "alphabet".

The list of characters is in the Unicode Set format, which allows boolean combinations of sets of letters, including those specified by Unicode properties.

Sequences of characters that act like a single letter in the language — especially in collation — are included within braces, such as [a-z á é í ó ú ö ü ő ű {cs} {dz} {dzs} {gy} ...]. The characters should be in normalized form (NFC). Where combining marks are used generatively, and apply to a large number of base characters (such as in Indic scripts), the individual combining marks should be included. Where they are used with only a few base characters, the specific combinations should be included. Wherever there is not a precomposed character (e.g. single codepoint) for a given combination, that must be included within braces. For example, to include sequences from the Where is my Character? page on the Unicode site, one would write: [{ch} {tʰ} {x̣} {ƛ̓} {ą́} {i̇́} {ト゚}], but for French one would just write [a-z é è ù ...]. When in doubt use braces, since it does no harm to included them around single code points: e.g. [a-z {é} {è} {ù} ...].

If the letter 'z' were only ever used in the combination 'tz', then we might have [a-y {tz}] in the main set. (The language would probably have plain 'z' in the auxiliary set, for use in foreign words.) If combining characters can be used productively in combination with a large number of others (such as say Indic matras), then they are not listed in all the possible combinations, but separately, such as:

[‌ ‍ ॐ ०-९ ऄ-ऋ ॠ ऌ ॡ ऍ-क क़ ख ख़ ग ग़ घ-ज ज़ झ-ड ड़ ढ ढ़ ण-फ फ़ ब-य य़ र-ह ़ ँ-ः ॑-॔ ऽ ् ॽ ा-ॄ ॢ ॣ ॅ-ौ]

The exemplar character set for Han characters is composed somewhat differently. It is even harder to draw a clear line for Han characters, since usage is more like a frequency curve that slowly trails off to the right in terms of decreasing frequency. So for this case, the exemplar characters simply contain a set of reasonably frequent characters for the language.

The ordering of the characters in the set is irrelevant, but for readability in the XML file the characters should be in sorted order according to the locale's conventions. The set should only contain lower case characters (except for the special case of Turkish and similar languages, where the dotted capital I should be included); the uppercase letters are to be mechanically added when the set is used. For more information, see [Data Formats] and the discussion of Special Casing in the Unicode Character Database.

5.6.1. Restrictions

  1. The sets are normally restricted to those letters with a specific Script character property (that is, not the values Common or Inherited) or required Default_Ignorable_Code_Point characters (such as a non-joiner), or combining marks, or the Word_Break properties Katakana, ALetter, or MidLetter.
  2. The auxiliary set should not overlap with the main set. There is one exception to this: Hangul Syllables and CJK Ideographs can overlap between the sets.
  3. Any Default_Ignorable_Code_Points should be in the auxiliary set.

<mapping registry="iana" type="iso-2022-jp utf-8" alt="email" />

The mapping element describes character conversion mapping tables that are commonly used to encode data in the language of this locale for a particular purpose. Each encoding is identified by a name from the specified registry. If more than one encoding is used for a particular purpose, the encodings are listed in the type attribute in order, from most preferred to least. An alt tag is used to indicate the purpose ("email" or "www" being the most frequent); if it is absent, then the encoding(s) may be used for all purposes not explicitly specified.

Each locale may have at most one mapping element tagged with a particular purpose, and at most one general-purpose mapping element. Inheritance is on an element basis; an element in a sub-locale overrides an inherited element with the same purpose.

Currently the only registry that can be used is "iana", which specifies use of  an IANA name

Note: While IANA names are not precise for conversion (see UTR #22: Character Mapping Tables [CharMapML]), they are sufficient for this purpose.

5.7 Delimiter Elements

<!ELEMENT delimiters (alias | (quotationStart*, quotationEnd*, alternateQuotationStart*, alternateQuotationEnd*, special*)) >

The delimiters supply common delimiters for bracketing quotations. The quotation marks are used with simple quoted text, such as:

He said, “Don’t be absurd!”

When quotations are nested, the quotation marks and alternate marks are used in an alternating fashion:

He said, “Remember what the Mad Hatter said: ‘Not the same thing a bit! Why you might just as well say that “I see what I eat” is the same thing as “I eat what I see”!’”

<quotationStart></quotationStart>
<quotationEnd></quotationEnd>
<alternateQuotationStart></alternateQuotationStart>
<alternateQuotationEnd></alternateQuotationEnd>

5.8 Measurement Elements (deprecated)

<!ELEMENT measurement (alias | (measurementSystem?, paperSize?, special*)) >

The measurement element is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Instead, the similar element in the supplemental data file should be used.

5.9 Date Elements

<!ELEMENT dates (alias | (localizedPatternChars*, calendars?, timeZoneNames?, special*)) >

This top-level element contains information regarding the format and parsing of dates and times. The data format is based on the Java/ICU format. Most of these are fairly self-explanatory, except the week elements, localizedPatternChars, and the meaning of the pattern characters. For information on this, and more information on other elements and attributes, see Appendix F: Date Format Patterns.

5.9.1 Calendar Elements

<!ELEMENT calendar (alias | (months?, monthNames?, monthAbbr?, days?, dayNames?, dayAbbr?, quarters?, week?, am?, pm?, eras?, dateFormats?, timeFormats?, dateTimeFormats?, fields*, special*))>

This element contains multiple <calendar> elements, each of which specifies the fields used for formatting and parsing dates and times according to the given calendar. The month and quarter names are identified numerically, starting at 1. The day (of the week) names are identified with short strings, since there is no universally-accepted numeric designation.

Many calendars will only differ from the Gregorian Calendar in the year and era values. For example, the Japanese calendar will have many more eras (one for each Emperor), and the years will be numbered within that era. All calendar data inherits from the Gregorian calendar in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Section 4.1 Multiple Inheritance.

<!ELEMENT months ( alias | (default?, monthContext*, special*)) >
<!ELEMENT monthContext ( alias | (default?, monthWidth*, special*)) >
<!ELEMENT monthWidth ( alias | (month*, special*)) >

<!ELEMENT days ( alias | (default?, dayContext*, special*)) >
<!ELEMENT dayContext ( alias | (default?, dayWidth*, special*)) >
<!ELEMENT dayWidth ( alias | (day*, special*)) >

<!ELEMENT quarters ( alias | (default?, quarterContext*, special*)) >
<!ELEMENT quarterContext ( alias | (default?, quarterWidth*, special*)) >
<!ELEMENT quarterWidth ( alias | (quarter*, special*)) >

Month, day, and quarter names may vary along two axes: the width and the context. The context is either format (the default), the form used within a date format string (such as "Saturday, November 12th", or stand-alone, the form used independently, such as in Calendar headers. The width can be wide (the default), abbreviated, or narrow. The format values must be distinct; that is, "S" could not be used both for Saturday and for Sunday. The same is not true for stand-alone values; they might only be distinguished by context, especially in the narrow format. That format is typically used in calendar headers; it must be the shortest possible width, no more than one character (or grapheme cluster) in stand-alone values, and the shortest possible widths (in terms of grapheme clusters) in format values.

If the stand-alone form does not exist (in the chain up to root), then it inherits from the format form. See Section 4.1 Multiple Inheritance. If the narrow format does not exist, it inherits from the abbreviated form; if the abbreviated format does not exist, it inherits from the wide format.

The older monthNames, dayNames, and monthAbbr, dayAbbr are maintained for backwards compatibility. They are equivalent to: using the months element with the context type="format" and the width type="wide" (for ...Names) and type="narrow" (for ...Abbr), respectively. The minDays, firstDay, weekendStart, and weekendEnd elements are also deprecated; there are new elements in supplemental data for this data.

Example:

  <calendar type="gregorian">
    <months>
      <default type="format"/>
      <monthContext type="format">
         <default type="wide"/>
         <monthWidth type="wide">
            <month type="1">January</month>
            <month type="2">February</month>
...
            <month type="11">November</month>
            <month type="12">December</month>
        </monthWidth>
        <monthWidth type="abbreviated">
            <month type="1">Jan</month>
            <month type="2">Feb</month>
...
            <month type="11">Nov</month>
            <month type="12">Dec</month>
        </monthWidth>
       <monthContext type="stand-alone">
         <default type="wide"/>
         <monthWidth type="wide">
            <month type="1">Januaria</month>
            <month type="2">Februaria</month>
...
            <month type="11">Novembria</month>
            <month type="12">Decembria</month>
        </monthWidth>
        <monthWidth type="narrow">
            <month type="1">J</month>
            <month type="2">F</month>
...
            <month type="11">N</month>
            <month type="12">D</month>
        </monthWidth>
       </monthContext>
    </months>

    <days>
      <default type="format"/>
      <dayContext type="format">
         <default type="wide"/>
         <dayWidth type="wide">
            <day type="sun">Sunday</day>
            <day type="mon">Monday</day>
...
            <day type="fri">Friday</day>
            <day type="sat">Saturday</day>
        </dayWidth>
        <dayWidth type="abbreviated">
            <day type="sun">Sun</day>
            <day type="mon">Mon</day>
...
            <day type="fri">Fri</day>
            <day type="sat">Sat</day>
        </dayWidth>
        <dayWidth type="narrow">
            <day type="sun">Su</day>
            <day type="mon">M</day>
...
            <day type="fri">F</day>
            <day type="sat">Sa</day>
        </dayWidth>
      </dayContext>
      <dayContext type="stand-alone">
        <dayWidth type="narrow">
            <day type="sun">S</day>
            <day type="mon">M</day>
...
            <day type="fri">F</day>
            <day type="sat">S</day>
        </dayWidth>
      </dayContext>
    </days>

    <quarters>
      <default type="format"/>
      <quarterContext type="format">
         <default type="abbreviated"/>
         <quarterWidth type="abbreviated">
            <quarter type="1">Q1</quarter>
            <quarter type="2">Q2</quarter>
            <quarter type="3">Q3</quarter>
            <quarter type="4">Q4</quarter>
        </quarterWidth>
        <quarterWidth type="wide">
            <quarter type="1">1st quarter</quarter>
            <quarter type="2">2nd quarter</quarter>
            <quarter type="3">3rd quarter</quarter>
            <quarter type="4">4th quarter</quarter>
        </quarterWidth>
      </quarterContext>
    </quarters>

    <am>AM</am>
    <pm>PM</pm>

    <eras>
       <eraAbbr>
        <era type="0">BC</era>
        <era type="1">AD</era>
       </eraAbbr>
       <eraNames>
        <era type="0">Before Christ</era>
        <era type="1">Anno Domini</era>
       </eraNames>
       <eraNarrow>
        <era type="0">B</era>
        <era type="1">A</era>
       </eraNarrow>
    </eras>

<dateFormats>

<!ELEMENT dateFormats (alias | (default?, dateFormatLength*, special*)) >
<!ELEMENT dateFormatLength (alias | (default?, dateFormat*, special*)) >
<!ELEMENT dateFormat (alias | (pattern*, displayName?, special*)) >

Date formats have the following form:

    <dateFormats>
      <default type=”medium”/>
      <dateFormatLength type=”full”>
        <dateFormat>
          <pattern>EEEE, MMMM d, yyyy</pattern>
        </dateFormat>
       </dateFormatLength>
     <dateFormatLength type="medium">
       <default type="DateFormatsKey2">
       <dateFormat type="DateFormatsKey2">
        <pattern>MMM d, yyyy</pattern>
       </dateFormat>
       <dateFormat type="DateFormatsKey3">
         <pattern>MMM dd, yyyy</pattern>
        </dateFormat>
      </dateFormatLength>
    <dateFormats>

<timeFormats>

<!ELEMENT timeFormats (alias | (default?, timeFormatLength*, special*)) >
<!ELEMENT timeFormatLength (alias | (default?, timeFormat*, special*)) >
<!ELEMENT timeFormat (alias | (pattern*, displayName?, special*)) >

Time formats have the following form:

     <timeFormats>
       <default type="medium"/>
       <timeFormatLength type=”full”>
         <timeFormat>
           <displayName>DIN 5008 (EN 28601)</displayName>
           <pattern>h:mm:ss a z</pattern>
         </timeFormat>
       </timeFormatLength>
       <timeFormatLength type="medium">
         <timeFormat>
           <pattern>h:mm:ss a</pattern>
         </timeFormat>
       </timeFormatLength>
     </timeFormats>

The preference of 12 hour vs 24 hour for the locale should be derived from the short timeFormat. If the hour symbol is "h" or "K" (of various lengths) then the format is 12 hour; otherwise it is 24 hour.

Date/Time formats have the following form:

     <dateTimeFormats>
       <default type="medium"/>
       <dateTimeFormatLength type=”full”>
         <dateTimeFormat>
            <pattern>{0} {1}</pattern>
         </dateTimeFormat>
       </dateTimeFormatLength>
       <availableFormats>
         <dateFormatItem>d. MMM yy</dateFormatItem>
         <dateFormatItem>hh:mm:ss a</dateFormatItem>
         <dateFormatItem>MMMM yyyy</dateFormatItem>
         <dateFormatItem>MMM yy</dateFormatItem>
         . . .
       </availableFormats>
       <appendItems>
         <appendItem request="G">{0} {1}</appendItem>
         <appendItem request="w">{0} ({2}: {1})</appendItem>
         . . .
       </appendItems>
     </dateTimeFormats>
  </calendar>

  <calendar type="buddhist">
    <eras>
      <eraAbbr>
        <era type="0">BE</era>
      </eraAbbr>
    </eras>
  </calendar>

<dateTimeFormats>

<!ELEMENT dateTimeFormats (alias | (default?, dateTimeFormatLength*, availableFormats*, appendItems*, special*)) >
<!ELEMENT dateTimeFormatLength (alias | (dateTimeFormat*, special*))>
<!ELEMENT dateTimeFormat (alias | (pattern*, special*))>
<!ELEMENT availableFormats (alias | (dateFormatItem*, special*))>
<!ELEMENT appendItems (alias | (appendItem*, special*))>
<!ATTLIST appendItem request CDATA >

These formats allow for date and time formats to be composed in various ways. The dateTimeFormat element works like the dateFormats and timeFormats, except that the pattern is of the form "{1} {0}", where {0} is replaced by the time format, and {1} is replaced by the date format, with results such as "8/27/06 7:31 AM".

The availableFormats element and its subelements provide a more flexible formatting mechanism than the predefined list of patterns represented by dateFormatLength, timeFormatLength, and dateTimeFormatLength. Instead, there is an open-ended list of patterns (represented by dateFormatItem elements as well as the predefined patterns mentioned above) that can be matched against a requested set of calendar fields and field lengths. Software can look through the list and find the pattern that best matches the original request, based on the desired calendar fields and lengths. For example, the full month and year may be needed for a calendar application; the request is MMMMyyyy, but the best match may be "yyyy MMMM" or even "G yy MMMM", depending on the locale and calendar.

The id attribute is a so-called "skeleton", containing only field information, and in a canonical order. Examples are "yyyyMMMM" for year + full month, or "MMMd" for abbreviated month + day.

In case the best match does not include all the requested calendar fields, the appendItems element describes how to append needed fields to one of the existing formats. Each appendItem element covers a single calendar field. In the pattern, {0} represents the format string, {1} the data content of the field, and {2} the display name of the field (see Calendar Fields).

<week>

<!ELEMENT week (alias | (minDays?, firstDay?, weekendStart?, weekendEnd?, special*))>

The week element is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Instead, the similar element in the supplemental data file should be used.

Calendar Fields

<!ELEMENT fields ( alias | (field*, special*)) >
<!ELEMENT field ( alias | (displayName?, relative*, special*)) >

Translations may be supplied for names of calendar fields (elements of a calendar, such as Day, Month, Year, Hour, etc.), and for relative values for those fields (for example, the day with
relative value -1 is "Yesterday"). Where there is not a convenient, customary word or phrase in a particular language for a relative value, it should be omitted.

Here are examples for English and German. Notice that the German has more fields than the English does.

<calendar>
  <fields>
...
   <field type='day'>
    <displayName>Day</displayName>
    <relative type='-1'>Yesterday</relative>
    <relative type='0'>Today</relative>
    <relative type='1'>Tomorrow</relative>
   </field>
...
  </fields>
</calendars>
<calendar>
  <fields>
...
   <field type='day'>
    <displayName>Tag</displayName>
    <relative type='-2'>Vorgestern</relative>
    <relative type='-1'>Gestern</relative>
    <relative type='0'>Heute</relative>
    <relative type='1'>Morgen</relative>
    <relative type='2'>Übermorgen</relative>
   </field>
...
  </fields>
</calendars>

5.9.2 Timezone Names

<!ELEMENT timeZoneNames (alias | (hourFormat*, hoursFormat*, gmtFormat*, regionFormat*, fallbackFormat*, abbreviationFallback*, preferenceOrdering*, singleCountries*, default*, zone*, special*)) >
<!ELEMENT zone (alias | ( long*, short*, exemplarCity*, special*)) >

The timezone IDs (tzid) are language-independent, and follow the TZ timezone database [Olson]. However, the display names for those IDs can vary by locale. The generic time is so-called wall-time; what clocks use when they are correctly switched from standard to daylight time at the mandated time of the year.

Unfortunately, the canonical tzid's (those in zone.tab) are not stable: may change in each release of the TZ Timezone database. In CLDR, however, stability of identifiers is very important. So the canonical IDs in CLDR are kept stable as described in Appendix L: Canonical Form.

The following is an example of timezone data. Although this is an example of possible data, in most cases only the exemplarCity is needs translation. And that does not even need to be present, if a country only has a single timezone. As always, the type field for each zone is the identification of that zone. It is not to be translated.

<zone type="America/Los_Angeles" >
    <long>
        <generic>Pacific Time</generic>
        <standard>Pacific Standard Time</standard>
        <daylight>Pacific Daylight Time</daylight>
    </long>
    <short>
        <generic>PT</generic>
        <standard>PST</standard>
        <daylight>PDT</daylight>
    </short>
    <exemplarCity>San Francisco</exemplarCity>
</zone>

<zone type="Europe/London">
     <long>
        <generic>British Time</generic>
        <standard>British Standard Time</standard>
        <daylight>British Daylight Time</daylight>
    </long>
    <exemplarCity>York</exemplarCity>
</zone>

Note: Transmitting "14:30" with no other context is incomplete unless it contains information about the time zone. Ideally one would transmit neutral-format date/time information, commonly in UTC, and localize as close to the user as possible. (For more about UTC, see [UTCInfo].)

The conversion from local time into UTC depends on the particular time zone rules, which will vary by location. The standard data used for converting local time (sometimes called wall time) to UTC and back is the TZ Data [Olson], used by Linux, UNIX, Java, ICU, and others. The data includes rules for matching the laws for time changes in different countries. For example, for the US it is:

"During the period commencing at 2 o'clock antemeridian on the first Sunday of April of each year and ending at 2 o'clock antemeridian on the last Sunday of October of each year, the standard time of each zone established by sections 261 to 264 of this title, as modified by section 265 of this title, shall be advanced one hour..." (United States Law - 15 U.S.C. §6(IX)(260-7)).

Each region that has a different timezone or daylight savings time rules, either now or at any time back to 1970, is given a unique internal ID, such as Europe/Paris. (Some IDs are also distinguished on the basis of differences before 1970.) As with currency codes, these are internal codes. A localized string associated with these is provided for users (such as in the Windows Control Panels>Date/Time>Time Zone).

Unfortunately, laws change over time, and will continue to change in the future, both for the boundaries of timezone regions and the rules for daylight savings. Thus the TZ data is continually being augmented. Any two implementations using the same version of the TZ data will get the same results for the same IDs (assuming a correct implementation). However, if implementations use different versions of the data they may get different results. So if precise results are required then both the TZ ID and the TZ data version must be transmitted between the different implementations.

For more information, see [Data Formats].

The following subelements of timezoneNames are used to control the fallback process described in Appendix J: Time Zone Display Names.

Element Name Data Examples Results/Comment
hourFormat "+HHmm;-HHmm" "+1200"
"-1200"
hoursFormat "{0}/{1}" "-0800/-0700"
gmtFormat "GMT{0}" "GMT-0800"
"{0}ВпГ" "-0800ВпГ"
regionFormat "{0} Time" "Japan Time"
"Tiempo de {0}" "Tiempo de Japón"
fallbackFormat "Tiempo de «{0}»" "Tiempo de «Tokyo»"
abbreviationFallback type="GMT" causes any "long" match to be skipped in Timezone fallbacks
preferenceOrdering type="America/Mexico_City America/Chihuahua America/New_York" a preference ordering among modern zones
singleCountries list="America/Godthab America/Santiago America/Guayaquil Europe/Madrid Pacific/Auckland  Pacific/Tahiti Europe/Lisbon..." uses country name alone

5.10 Number Elements

<!ELEMENT numbers (alias | (symbols?, decimalFormats?, scientificFormats?, percentFormats?, currencyFormats?, currencies?, special*)) >

The numbers element supplies information for formatting and parsing numbers and currencies. It has the following sub-elements: <symbols>, <decimalFormats>, <scientificFormats>, <percentFormats>, <currencyFormats>, and <currencies>. The currency IDs are from [ISO4217] (plus some additional common-use codes). For more information, including the pattern structure, see Appendix G: Number Pattern Format.

5.10.1 Number Symbols

<!ELEMENT symbols (alias | (decimal?, group?, list?, percentSign?, nativeZeroDigit?, patternDigit?, plusSign?, minusSign?, exponential?, perMille?, infinity?, nan?, special*)) >

<symbols>
      <decimal>.</decimal>
      <group>,</group>
      <list>;</list>
      <percentSign>%</percentSign>
      <nativeZeroDigit>0</nativeZeroDigit>
      <patternDigit>#</patternDigit>
      <plusSign>+</plusSign>
      <minusSign>-</minusSign>
      <exponential>E</exponential>
      <perMille></perMille>
      <infinity></infinity>
      <nan></nan>
</symbols>

<!ELEMENT decimalFormats (alias | (default?, decimalFormatLength*, special*))>
<!ELEMENT decimalFormatLength (alias | (default?, decimalFormat*, special*))>
<!ELEMENT decimalFormat (alias | (pattern*, special*)) >
(scientificFormats, percentFormats, and currencyFormats have the same structure)

<decimalFormats>
  <decimalFormatLength type="long">
    <decimalFormat>
      <pattern>#,##0.###</pattern>
    </decimalFormat>
  </decimalFormatLength>
</decimalFormats>
<scientificFormats>
  <default type="long"/>
  <scientificFormatLength type="long">
    <scientificFormat>
      <pattern>0.000###E+00</pattern>
    </scientificFormat>
  </scientificFormatLength>
  <scientificFormatLength type="medium">
    <scientificFormat>
      <pattern>0.00##E+00</pattern>
    </scientificFormat>
  </scientificFormatLength>
</scientificFormats>
<percentFormats>
  <percentFormatLength type="long">
    <percentFormat>
      <pattern>#,##0%</pattern>
    </percentFormat>
  </percentFormatLength>
</percentFormats>
<currencyFormats>
  <currencyFormatLength type="long">
    <currencyFormat>
      <pattern>¤ #,##0.00;(¤ #,##0.00)</pattern>
    </currencyFormat>
  </currencyFormatLength>
</currencyFormats>

5.10.2 Currencies

<!ELEMENT currency (alias | (pattern*, displayName*, symbol*, pattern*, decimal*, group*, special*)) >

Note: pattern appears twice in the above. The first is for consistency with all other cases of pattern + displayName; the second is for backwards compatibility.

<currencies>
    <currency type="USD">
        <displayName>Dollar</displayName>
        <symbol>$</symbol>
    </currency>
    <currency type ="JPY">
        <displayName>Yen</displayName>
        <symbol>¥</symbol>
    </currency>
    <currency type ="INR">
        <displayName>Rupee</displayName>
        <symbol choice="true">0≤Rf|1≤Ru|1&lt;Rf</symbol>
    </currency>
    <currency type="PTE">
        <displayName>Escudo</displayName>
        <symbol>$</symbol>
    </currency>
</currencies>

In formatting currencies, the currency number format is used with the appropriate symbol from <currencies>, according to the currency code. The <currencies> list can contain codes that are no longer in current use, such as PTE. The choice attribute can be used to indicate that the value uses a pattern interpreted as in Appendix H: Choice Patterns.

When the currency symbol is substituted into a pattern, there may be some further modifications, according to the following.

<currencySpacing>
  <beforeCurrency>
    <currencyMatch>[:letter:]</currencyMatch>
    <surroundingMatch>[:digit:]</surroundingMatch>
    <insertBetween>&#x00a0;</insertBetween>
  </beforeCurrency>
  <afterCurrency>
    <currencyMatch>[:letter:]</currencyMatch>
    <surroundingMatch>[:digit:]</surroundingMatch>
    <insertBetween>&#x00a0;</insertBetween>
  </afterCurrency>
</currencySpacing>

This element controls whether additional characters are inserted on the boundary between the symbol and the pattern. For example, in the above, inserting the symbol "US$" into the pattern "#,##0.00¤" would result in an extra no-break space inserted before the symbol, eg "#,##0.00 US$", while inserting into the pattern "¤#,##0.00" would not, eg "US$#,##0.00". That is because the afterCurrency condition matches and the beforeCurrency condition doesn't. For more information on the matching used in the currencyMatch and surroundingMatch elements, see Appendix E: Unicode Sets.

Currencies can also contain optional grouping, decimal data, and pattern elements. This data is inherited from the <symbols> in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Section 4.1 Multiple Inheritance.

Note: Currency values should never be interchanged without a known currency code. You never want the number 3.5 interpreted as $3.5 by one user and ¥3.5 by another. Locale data contains localization information for currencies, not a currency value for a country. A currency amount logically consists of a numeric value, plus an accompanying currency code (or equivalent). The currency code may be implicit in a protocol, such as where USD is implicit. But if the raw numeric value is transmitted without any context, then it has no definitive interpretation.

Notice that the currency code is completely independent of the end-user's language or locale. For example, RUR is the code for Russian Rubles. A currency amount of <RUR, 1.23457×10³> would be localized for a Russian user into "1 234,57р." (using U+0440 (р) cyrillic small letter er). For an English user it would be localized into the string "Rub 1,234.57" The end-user's language is needed for doing this last localization step; but that language is completely orthogonal to the currency code needed in the data. After all, the same English user could be working with dozens of currencies.Notice also that the currency code is also independent of whether currency values are inter-converted, which requires more interesting financial processing: the rate of conversion may depend on a variety of factors.

Thus logically speaking, once a currency amount is entered into a system, it should be logically accompanied by a currency code in all processing. This currency code is independent of whatever the user's original locale was. Only in badly-designed software is the currency code (or equivalent) not present, so that the software has to "guess" at the currency code based on the user's locale.

Note: The number of decimal places and the rounding for each currency is not locale-specific data, and is not contained in the Locale Data Markup Language format. Those values override whatever is given in the currency numberFormat. For more information, see Appendix C: Supplemental Data.

For background information on currency names, see [CurrencyInfo].

5.11 POSIX Elements

<!ELEMENT posix (alias | (messages*, special*)) >
<!ELEMENT messages (alias | ( yesstr?, nostr?)) >

The following are included for compatibility with POSIX.

 <posix>
       <posix:messages>
            <posix:yesstr>ja</posix:yesstr>
            <posix:nostr>nein</posix:nostr>
        </posix:messages>
 <posix>

  1. The values for yesstr and nostr contain a colon-separated list of strings that would normally be recognized as "yes" and "no" responses. For cased languages, this shall include only the lowercase version. POSIX locale generation tools must generate the uppercase equivalents, and the abbreviated versions, and add the English words wherever they do not conflict. Examples:
    • ja → ja:Ja:j:J:yes:Yes:y:Y
    • ja → ja:Ja:j:J:yes:Yes // exclude y:Y if it conflicts with the native "no".
  2. The older elements yesexpr and noexpr are deprecated. They should instead be generated from yesstr and nostr so that they match all the responses.

So for English, the appropriate strings and expressions would be as follows:

yesstr "yes:y"
nostr "no:n"

The generated yesexpr and noexpr would be:

yesexpr "^([yY]([eE][sS])?)"
This would match y,Y,yes,yeS,yEs,yES,Yes,YeS,YEs,YES.

noexpr "^([nN][oO]?)"
This would match n,N,no,nO,No,NO.

5.12 Reference Element

<!ELEMENT references ( reference* ) >
<!ELEMENT reference ( #PCDATA ) >
<!ATTLIST reference type NMTOKEN #REQUIRED>
<!ATTLIST reference standard ( true | false ) #IMPLIED >
<!ATTLIST reference uri CDATA #IMPLIED >

The references section supplies a central location for specifying references and standards. The uri should be supplied if at all possible. If not online, then a ISBN number should be supplied, such as in the following example:

<reference type="R2" uri="http://www.ur.se/nyhetsjournalistik/3lan.html">Landskoder på Internet</reference>
<reference type="R3" uri="URN:ISBN:91-47-04974-X">Svenska skrivregler</reference>

5.13 Collation Elements

<!ELEMENT collations (alias | (default?, collation*, special*)) >

This section contains one or more collation elements, distinguished by type. Each collation contains rules that specify a certain sort-order, as a tailoring of the UCA table defined in UTS #10: Unicode Collation Algorithm [UCA]. (For a chart view of the UCA, see Collation Chart [UCAChart].) This syntax is an XMLized version of the Java/ICU syntax. For illustration, the rules are accompanied by the corresponding basic ICU rule syntax [ICUCollation] (used in ICU and Java) and/or the ICU parameterizations, and the basic syntax may be used in examples.

Note: ICU provides a concise format for specifying orderings, based on tailorings to the UCA. For example, to specify that k and q follow 'c', one can use the rule: "& c < k < q". The rules also allow people to set default general parameter values, such as whether uppercase is before lowercase or not. (Java contains an earlier version of ICU, and has not been updated recently. It does not support any of the basic syntax marked with [...], and its default table is not the UCA.)

However, it is not necessary for ICU to be used in the underlying implementation. The features are simply related to the ICU capabilities, since that supplies more detailed examples.

Note: there is an on-line demonstration of collation at [LocaleExplorer] (pick the locale and scroll to "Collation Rules").

5.13.1 Version

The version attribute is used in case a specific version of the UCA is to be specified. It is optional, and is specified if the results are to be identical on different systems. If it is not supplied, then the version is assumed to be the same as the Unicode version for the system as a whole. In general, tailorings should be defined so as to minimize dependence on the underlying UCA version, by explicitly specifying the behavior of all characters used to write the language in question.

Note: For version 3.1.1 of the UCA, the version of Unicode must also be specified with any versioning information; an example would be "3.1.1/3.2" for version 3.1.1 of the UCA, for version 3.2 of Unicode. This has been changed by decision of the UTC, so that it will no longer be necessary as of UCA 4.0. So for 4.0 and beyond, the version just has a single number.

5.13.2 Collation Element

<!ELEMENT collation (alias | (base?, settings?, suppress_contractions?, optimize?, rules?, special*)) >

Like the ICU rules, the tailoring syntax is designed to be independent of the actual weights used in any particular UCA table. That way the same rules can be applied to UCA versions over time, even if the underlying weights change. The following describes the overall document structure of a collation:

<collation>
 <settings caseLevel="on"/>
 <rules>
  <!-- rules go here -->
 </rules>
</collation>

The optional base element <base>...</base>, contains an alias element that points to another data source that defines a base collation. If present, it indicates that the settings and rules in the collation are modifications applied on top of the respective elements in the base collation. That is, any successive settings, where present, override what is in the base as described in Setting Options. Any successive rules are concatenated to the end of the rules in the base. The results of multiple rules applying to the same characters is covered in Orderings.

5.13.3 Setting Options

In XML, these are attributes of <settings>. For example, <setting strength="secondary"> will only compare strings based on their primary and secondary weights.

If the attribute is not present, the default (or for the base url's attribute, if there is one) is used. The default is listed in italics.

Collation Settings
Attribute Options Basic Example   XML Example Description
strength primary (1)
secondary (2)
tertiary (3)
quaternary (4)
identical (5)
[strength 1] strength = "primary" Sets the default strength for comparison, as described in the UCA.
alternate non-ignorable
shifted
[alternate non-ignorable] alternate = "non-ignorable" Sets alternate handling for variable weights, as described in UCA
backwards on
off
[backwards 2]   backwards = "on" Sets the comparison for the second level to be backwards ("French"), as described in UCA
normalization on
off
[normalization on]  normalization = "off" If on, then the normal UCA algorithm is used. If off, then all strings that are in [FCD] will sort correctly, but others won't necessarily sort correctly. So should only be set off if the the strings to be compared are in FCD.
caseLevel on
off
[caseLevel on] caseLevel = "off" If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on
caseFirst upper
lower
off
[caseFirst off] caseFirst = "off" If set to upper, causes upper case to sort before lower case. If set to lower, lower case will sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels.
hiraganaQuaternary on
off
[hiraganaQ on] hiragana­Quaternary = "on" Controls special treatment of Hiragana code points on quaternary level. If turned on, Hiragana codepoints will get lower values than all the other non-variable code points. The strength must be greater or equal than quaternary if you want this attribute to take effect.
numeric on
off
[numeric on] numeric = "on" If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UCD]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123".
variableTop uXXuYYYY & \u00XX\uYYYY < [variable top] variableTop = "uXXuYYYY" The parameter value is an encoded Unicode string, with code points in hex, leading zeros removed, and 'u' inserted between successive elements.

Sets the default value for the variable top. All the code points with primary strengths less than variable top will be considered variable, and thus affected by the alternate handling.

 

5.13.4 Collation Rule Syntax

<!ELEMENT rules (alias | ( reset, ( reset | p | pc | s | sc | t | tc | q | qc | i | ic | x)* )) >

The goal for the collation rule syntax is to have clearly expressed rules with a concise format, that parallels the Basic syntax as much as possible.  The rule syntax uses abbreviated element names for primary (level 1), secondary (level 2), tertiary (level 3), and identical, to be as short as possible. The reason for this is because the tailorings for CJK characters are quite large (tens of thousands of elements), and the extra overhead would have been considerable. Other elements and attributes do not occur as frequently, and have longer names.

Note: The rules are stated in terms of actions that cause characters to change their ordering relative to other characters. This is for stability; assigning characters specific weights would not work, since the exact weight assignment in UCA (or ISO 14651) is not required for conformance — only the relative ordering of the weights. In addition, stating rules in terms of relative order is much less sensitive to changes over time in the UCA itself.

5.13.5 Orderings

The following are the normal ordering actions used for the bulk of characters. Each rule contains a string of ordered characters that starts with an anchor point or a reset value. The reset value is an absolute point in the UCA that determines the order of other characters. For example, the rule & a < g, places "g" after "a" in a tailored UCA: the "a" does not change place. Logically, subsequent rule after a reset indicates a change to the ordering (and comparison strength) of the characters in the UCA. For example, the UCA has the following sequence (abbreviated for illustration):

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

Whenever a character is inserted into the UCA sequence, it is inserted at the first point where the strength difference will not disturb the other characters in the UCA. For example, & a < g puts g in the above sequence with a strength of L1. Thus the g must go in after any lower strengths,  as follows:

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 g <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

The rule & a << g, which uses a level-2 strength, would produce the following sequence:

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 g <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

And the rule & a <<< g, which uses a level-3 strength, would produce the following sequence:

... a <3 g <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

Since resets always work on the existing state, the rule entries must be in the proper order. A character or sequence may occur multiple times; each subsequent occurrence causes a different change. The following shows the result of serially applying a three rules.

  Rules   Result Comment  
1 & a < g ... a <1 g ... Put g after a.
2 & a < h < k ... a <1 h <1 k <1 g ... Now put h and k after a (inserting before the g).
3 & h << g ... a <1 h <1 g <1 k ... Now put g after h (inserting before k).

Notice that characters can occur multiple times, and thus override previous rules.

Except for the case of expansion sequence syntax, every sequence after a reset is equivalent in action to breaking up the sequence into an atomic rule: a reset + relation pair. The tailoring is then equivalent to applying each of the atomic rules to the UCA in order, according to the above description.

Example:

Rules Equivalent Atomic Rules
& b < q <<< Q
& a < x <<< X << q <<< Q < z
& b < q
& q <<< Q
& a < x
& x <<< X
& X << q
& q <<< Q
& Q < z

In the case of expansion sequence syntax, the equivalent atomic sequence can be derived by first transforming the expansion sequence syntax into normal expansion syntax. (See Expansions.)

<!ELEMENT reset ( #PCDATA | cp | ... )* >
<!ELEMENT p ( #PCDATA | cp | last_variable )* >
(Elements pc, s, sc, t, tc, q, qc, i, and ic have the same structure as p.)

Specifying Collation Ordering
Basic Symbol Basic Example XML Symbol XML Example Description
&   & Z   <reset> <reset>Z</reset> Don't change the ordering of Z, but place subsequent characters relative to it.
<   & a
< b  
<p> <reset>a<reset>
<p>b</p>
Make 'b' sort after 'a', as a primary (base-character) difference
<<   & a
<< ä  
<s> <reset>a<reset>
<s>ä</s>
Make 'ä' sort after 'a' as a secondary (accent) difference
<<<   & a
<<< A  
<t> <reset>a<reset>
<t>A</t>
Make 'A' sort after 'a' as a tertiary (case/variant) difference
=   & x
= y  
<i> <reset>v<reset>
<i>w</i>
Make 'w' sort identically to 'v'

Resets only need to be at the start of a sequence, to position the characters relative a character that is in the UCA (or has already occurred in the tailoring). For example: <reset>z</reset><p>a</p><p>b</p><p>c</p><p>d</p>.

Some additional elements are provided to save space with large tailorings. The addition of a 'c' to the element name indicates that each of the characters in the contents of that element are to be handled as if they were separate elements with the corresponding strength:

Abbreviating Ordering Specifications
XML Symbol XML Example Equivalent
<pc> <pc>bcd</pc> <p>b</p><p>c</p><p>d</p>
<sc> <sc>àáâã</sc> <s>à</s><s>á</s><s>â</s><s>ã</s>
<tc> <tc>PpP</tc> <t>P</t><t></t><t></t>
<ic> <ic>VwW</ic> <i>V</i><i>w</i><i>W</i>

5.13.6 Contractions

To sort a sequence as a single item (contraction), just use the sequence, e.g.

Specifying Contractions
BASIC Example XML Example Description
& k
< ch
<reset>k</reset>
<p>ch</p>
Make the sequence 'ch' sort after 'k', as a primary (base-character) difference

5.13.7 Expansions

<!ELEMENT x (context?, ( p | pc | s | sc | t | tc | q | qc | i | ic )*, extend? ) >

There are two ways to handle expansions (where a character sorts as a sequence) with both the basic syntax and the XML syntax. The first method is to reset to the sequence of characters. This is called sequence expansion syntax. The second is to use the extension sequence. Both are equivalent in practice (unless the reset sequence happens to be a contraction). This is called normal expansion syntax.

Specifying Expansions
Basic XML Description
& c 
<< k / h
<reset>c</reset>
<x><s>k</s> <extend>h</extend></x>
normal expansion syntax:
Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'.
& ch
<< k
<reset>ch</reset>
<s>k</s>
sequence expansion syntax:
Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'.

(unless 'ch' is defined beforehand as a contraction).

If an <extend> element is necessary, it requires the rule to be embedded in an <x> element.

The sequence expansion syntax can be quite tricky, so it should be avoided where possible. In particular:

Each extension replaces the one before it; it does not append to it. So

& ab << c
& cd << e

is equivalent to:

& a << c / b << e / d

and produces the following weights (where p(x) is the primary weight and s(a) is the secondary weight):

Character Weights
c p(a), p(b); s(a)+1, s(b); ...
e p(a), p(d); s(a)+2, s(d); ...

When expressing rules as atomic rules, the sequences must first be transformed into normal expansion syntax:

Expansion Sequence Normal Expansion Equivalent Atomic Rules
& ab << q <<< Q
& ad <<< AD < x <<< X
& a << q / b <<< Q / b
& a <<< AD / d < x <<< X
& b << q / b
& q <<< Q / b
& a < AD / d
& AD < x
& x<<< X

5.13.8 Context Before

The context before a character can affect how it is ordered, such as in Japanese. This could be expressed with a combination of contractions and expansions, but is faster using a context. (The actual weights produced are different, but the resulting string comparisons are the same.) If a context element occurs, it must be the first item in the rule, and requires an <x> element.

For example, suppose that "-" is sorted like the previous vowel. Then one could have rules that take "a-", "e-", and so on. However, that means that every time a very common character (a, e, ...) is encountered, a system will slow down as it looks for possible contractions. An alternative is to indicate that when "-" is encountered, and it comes after an 'a', it sorts like an 'a', etc.

Specifying Previous Context
Basic XML
& a <<< a | - 
& e <<< e | -  
...
<reset>a</reset><x><context>a</context><s>-</s></x>
<reset>e</reset><x><context>e</context><s>-</s></x>
...

Both the context and extend elements can occur in an <x> element. For example, the following are allowed:

5.13.9 Placing Characters Before Others

There are certain circumstances where characters need to be placed before a given character, rather than after. This is the case with Pinyin, for example, where certain accented letters are positioned before the base letter. That is accomplished with the following syntax.

Placing Characters Before Others
Item Options Basic Example   XML Example
before  primary
secondary
tertiary
& [before 2] a
<< à
<reset before="secondary">a</reset>
<s>à</s>

It is an error if the strength of the before relation is not identical to the relation after the reset. Thus the following are errors, since the value of the before attribute does not agree with the relation <s>.

Basic Example   XML Example
& [before 2] a
< à
<reset before="primary">a</reset>
<s>à</s>
Error
& [before 2] a
<<< à
<reset before="tertiary">a</reset>
<s>à</s>
Error

5.13.10 Logical Reset Positions

<!ELEMENT reset ( ... | first_variable| last_variable | first_tertiary_ignorable | last_tertiary_ignorable | first_secondary_ignorable | last_secondary_ignorable | first_primary_ignorable | last_primary_ignorable | first_non_ignorable | last_non_ignorable | first_trailing | last_trailing )* >

The UCA has the following overall structure for weights, going from low to high.

Specifying Logical Positions
Name Description UCA Examples
first tertiary ignorable
...
last tertiary ignorable
p, s, t = ignore Control Codes
Format Characters
Hebrew Points
Tibetan Signs
...
first secondary ignorable
...
last secondary ignorable
p, s = ignore None in UCA
first primary ignorable
...
last primary ignorable
p = ignore Most combining marks
first variable
...
last variable
if alternate = non-ignorable
p != ignore,
if alternate = shifted
p, s, t = ignore
Whitespace,
Punctuation,
Symbols
first non-ignorable
...
last non-ignorable
p != ignore Small number of exceptional symbols
[e.g. U+02D0 MODIFIER LETTER TRIANGULAR COLON]
Numbers
Latin
Greek
...
implicits p != ignore, assigned automatically CJK, CJK compatibility (those that are not decomposed)
CJK Extension A, B
Unassigned
first trailing
...
last trailing
p != ignore,
used for trailing syllable components
Jamo Trailing
Jamo Leading

Each of the above Names (except implicits) can be used with a reset to position characters relative to that logical position. That allows characters to be ordered before or after a logical position rather than a specific character.

Note: The reason for this is so that tailorings can be more stable. A future version of the UCA might add characters at any point in the above list. Suppose that you set character X to be after Y. It could be that you want X to come after Y, no matter what future characters are added; or it could be that you just want Y to come after a given logical position, e.g. after the last primary ignorable.

Here is an example of the syntax:

Sample Logical Position
Basic XML
& [first tertiary ignorable]
<< à 
<reset><first_tertiary_ignorable/></reset>
<s>à</s>

For example, to make a character be a secondary ignorable, one can make it be immediately after (at a secondary level) a specific character (like a combining dieresis), or one can make it be immediately after the last secondary ignorable.

The last-variable element indicates the "highest" character that is treated as punctuation with alternate handling. Unlike the other logical positions, it can be reset as well as referenced. For example, it can be reset to be just above spaces if all visible punctuation are to be treated as having distinct primary values.

Specifying Last-Variable
Attribute Options Basic Example   XML Example
variableTop at & x
= [last variable]
<reset>x</reset>
<i><last_variable/></i>
after & x
< [last variable]
<reset>x</reset>
<p><last_variable/></p>
before & [before 1] x
< [last variable]
<reset before="primary">x</reset>
<p><last_variable/></p>

The default value for variable-top depends on the UCA setting. For example, in 3.1.1, the value is at:

U+1D7C3 MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL

The <last_variable/> cannot occur inside an <x> element, nor can there be any element content. Thus there can be no <context> or <extend> or text data in the rule. For example, the following are all disallowed:

5.13.11 Special-Purpose Commands

The suppress contractions tailoring command turns off any existing contractions that begin with those characters. It is typically used to turn off the Cyrillic contractions in the UCA, since they are not used in many languages and have a considerable performance penalty. The argument is a Unicode Set.

The optimize tailoring command is purely for performance. It indicates that those characters are sufficiently common in the target language for the tailoring that their performance should be enhanced.

Special-Purpose Commands
Basic XML
[suppress contractions [Љ-ґ]] <suppress_contractions>[Љ-ґ]</suppress_contractions>
[optimize [Ά-ώ]] <optimize>[Ά-ώ]</optimize>


The reason that these are not settings is so that their contents can be arbitrary characters.


Example Collation

The following is a simple example that takes portions of the Swedish tailoring plus part of a Japanese tailoring, for illustration. For more complete examples, see the actual locale data: Japanese, Chinese, Swedish, Traditional German are particularly illustrative.

<collation version="3.1.1">
  <settings caseLevel="on"/>
  <rules>
        <reset>Z</reset>
        <p>æ</p>
        <t>Æ</t>
        <p></p>
        <t></t>
        <t>aa</t>
        <t>aA</t>
        <t>Aa</t>
        <t>AA</t>
        <p>ä</p>
        <t>Ä</t>
        <p>ö</p>
        <t>Ö</t>
        <s>ű</s>
        <t>Ű</t>
        <p>ő</p>
        <t>Ő</t>
        <s>ø</s>
        <t>Ø</t>
        <reset>V</reset>
        <tc>wW</tc>
        <reset>Y</reset>
        <tc>üÜ</tc>
        <reset><last_non_ignorable/></reset>
        <!-- following is equivalent to <p>亜</p><p>唖</p><p>娃</p>... -->
        <pc>亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦</pc>
        <pc>鯵梓圧斡扱</pc>
  </rules>
</collation>

5.14 Segmentations

The segmentations element provides for segmentation of text into words, lines, or other segments. The structure is based on [UAX29] notation, but adapted to be machine-readable. It uses a list of variables (representing character classes) and a list of rules. Each must have an id attribute.

The rules in root implement the segmentations found in [UAX29] and [UAX14], for grapheme clusters, words, sentences, and lines. They can be overriden by rules in child locales.

Here is an example:

<segmentations>
  <segmentation type="GraphemeClusterBreak">
    <variables>
      <variable id="$CR">\p{Grapheme_Cluster_Break=CR}</variable>
      <variable id="$LF">\p{Grapheme_Cluster_Break=LF}</variable>
      <variable id="$Control">\p{Grapheme_Cluster_Break=Control}</variable>
      <variable id="$Extend">\p{Grapheme_Cluster_Break=Extend}</variable>
      <variable id="$L">\p{Grapheme_Cluster_Break=L}</variable>
      <variable id="$V">\p{Grapheme_Cluster_Break=V}</variable>
      <variable id="$T">\p{Grapheme_Cluster_Break=T}</variable>
      <variable id="$LV">\p{Grapheme_Cluster_Break=LV}</variable>
      <variable id="$LVT">\p{Grapheme_Cluster_Break=LVT}</variable>
    </variables>
    <segmentRules>
      <rule id="3"> $CR × $LF </rule>
      <rule id="4"> ( $Control | $CR | $LF ) ÷ </rule>
      <rule id="5"> ÷ ( $Control | $CR | $LF ) </rule>
      <rule id="6"> $L × ( $L | $V | $LV | $LVT ) </rule>
      <rule id="7"> ( $LV | $V ) × ( $V | $T ) </rule>
      <rule id="8"> ( $LVT | $T) × $T </rule>
      <rule id="9"> × $Extend </rule>
    </segmentRules>
  </segmentation>
...

Variables: All variable ids must start with a $, and otherwise be valid identifiers according to the Unicode definitions in [UAX31]. The contents of a variable is a regular expression using variables and UnicodeSets. The ordering of variables is important; they are evaluated in order from first to last (see Section 5.14.1 Segmentation Inheritance). It is an error to use a variable before it is defined.

Rules: The contents of a rule uses the syntax of [UAX29]. The rules are evaluated in numeric id order (which may not be the order in which the appear in the file). The first rule that matches determines the status of a boundary position, that is, whether it breaks or not. Thus ÷ means a break is allowed; × means a break is forbidden. It is an error if the rule does not contain exactly one of these characters (except where a rule has no contents at all, or if the rule uses a variable that has not been defined.

There are some implicit rules:

Note: A rule like X Format* -> X in [UAX29] and [UAX14] is not supported. Instead, this needs to be expressed as normal regular expressions. The normal way to support this is to modify the variables, such as in the following example:

<variable id="$Format">\p{Word_Break=Format}</variable>
<variable id="$Katakana">\p{Word_Break=Katakana}</variable>
...
<!-- In place of rule 3, add format and extend to everything -->
<variable id="$X">[$Format $Extend]*</variable>
<variable id="$Katakana">($Katakana $X)</variable>
<variable id="$ALetter">($ALetter $X)</variable>
...

5.14.1 Segmentation Inheritance

Variables and rules both inherit from the parent.

Variables: The child's variable list is logically appended to the parent's, and evaluated in that order. For example:

// in parent
<variable id="$AL">[:linebreak=AL:]</variable>
<variable id="$YY">[[:linebreak=XX:]$AL]</variable> // adds $AL

// in child
<variable id="$AL">[$AL && [^a-z]]</variable> // changes $AL, doesn't affect $YY
<variable id="$ABC">[abc]</variable> // adds new rule

Rules: The rules are also logically appended to the parent's. Because rules are evaluated in numeric id order, to insert a rule in between others just requires using an intermediate number. For example, to insert a rule before id="10.1" and after id="10.2", just use id="10.15". To delete a rule, use empty contents, such as:

<rule id="3"/> // deletes rule 3

5.15 Transforms

Transforms provide a set of rules for transforming text via a specialized set of context-sensitive matching rules. They are commonly used for transliterations or transcriptions, but also other transformations such as full-width to half-width (for katakana characters). The rules can be simple one-to-one relationships between characters, or involve more complicated mappings. Here is an example:

<transform source="Greek" target="Latin" variant="UNGEGN" direction="both">
...
  <comment>Useful variables</comment>
  <tRule>$gammaLike = [ΓΚΞΧγκξχϰ] ;</tRule>
  <tRule>$egammaLike = [GKXCgkxc] ;</tRule>
...
  <comment>Rules are predicated on running NFD first, and NFC afterwards</comment>
  <tRule>::NFD (NFC) ;</tRule>
...
  <tRule>λ ↔ l ;</tRule>
  <tRule>Λ ↔ L ;</tRule>
...
  <tRule>γ } $gammaLike ↔ n } $egammaLike ;</tRule>
  <tRule>γ ↔ g ;</tRule>
...
  <tRule>::NFC (NFD) ;</tRule>
...
</transform>

The source and target values are valid locale identifiers, where 'und' means an unspecified language, plus some additional extensions.

There is currently one variant used in CLDR: UNGEGN. There is an additional attribute private="true" which is used to indicate that the transform is used in other transforms, but should not be listed when presented to users.

There are many different systems of transliteration. The goal for the "unqualified" script transliterations are

  1. to be lossless when going to Latin and back
  2. to be as lossless as possible when going to other scripts
  3. to abide by a common standard as much as possible (possibly supplemented to meet goals 1 and 2).

Additional transliterations may also be defined, such as customized language-specific transliterations (such as between Russian and French), or those that match a particular transliteration standard, such as

The rules for transforms are described in Appendix N: Transform Rules.

Appendix A: Sample Special Elements

The elements in this section are not part of the Locale Data Markup Language 1.0 specification. Instead, they are special elements used for application-specific data to be stored in the Common Locale Repository. They may change or be removed future versions of this document, and are present her more as examples of how to extend the format. (Some of these items may move into a future version of the Locale Data Markup Language specification.)

The above examples are old versions: consult the documentation for the specific application to see which should be used.

These DTDs use namespaces and the special element. To include one or more, use the following pattern to import the special DTDs that are used in the file:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.1/ldml.dtd" [
    <!ENTITY % icu SYSTEM "http://unicode.org/cldr/dtd/1.1/ldmlICU.dtd">
    <!ENTITY % openOffice SYSTEM "http://unicode.org/cldr/dtd/1.1/ldmlOpenOffice.dtd">
%icu;
%openOffice;
]>

Thus to include just the ICU DTD, one uses:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.1/ldml.dtd" [
    <!ENTITY % icu SYSTEM "http://unicode.org/cldr/dtd/1.1/ldmlICU.dtd">
%icu;
]>

Note: A previous version of this document contained a special element for ISO TR 14652 compatibility data. That element has been withdrawn, pending further investigation, since 14652 is a Type 1 TR: "when the required support cannot be obtained for the publication of an International Standard, despite repeated effort". See the ballot comments on 14652 Comments for details on the 14652 defects. For example, most of these patterns make little provision for substantial changes in format when elements are empty, so are not particularly useful in practice. Compare, for example, the mail-merge capabilities of production software such as Microsoft Word or OpenOffice.

Note: While the CLDR specification guarantees backwards compatibility, the definition of specials is up to other organizations. Any assurance of backwards compatibility is up to those organizations.

A.1 ICU

There is one main areas where ICU has capabilities that go beyond what is shown above.

A.1.1 Rule-Based Number Formats

The rule-based number format (RBNF) encapsulates a set of rules for mapping binary numbers to and from a readable representation. They are typically used for spelling out numbers, but can also be used for other number systems like roman numerals, or for ordinal numbers (1st, 2nd, 3rd,...). The rules are fairly sophisticated; for details see Rule-Based Number Formatter [RBNF].

Example:

    <special xmlns:icu="http://ibm.com/software/globalization/icu/">
        <icu:ruleBasedNumberFormats>
            <icu:ruleBasedNumberFormat type="spellout">
                %%and:
                    and =%default=;
                    100: =%default=;
                %%commas:
                    ' and =%default=;
                    100: , =%default=;
                    1000: ,
            </icu:ruleBasedNumberFormat>
            <icu:ruleBasedNumberFormat type="ordinal">
                %main:
                    =#,##0==%%abbrev=;
                %%abbrev:
                    th; st; nd; rd; th;
                    20: &gt;&gt;;
                    100: &gt;&gt;;
            </icu:ruleBasedNumberFormat>
            <icu:ruleBasedNumberFormat type="duration">
            %with-words:
                0 seconds; 1 second; =0= seconds;
                60/60:
            </icu:ruleBasedNumberFormat>
        </icu:ruleBasedNumberFormats>

A.2 openoffice.org

A number of the elements above can have extra information for openoffice.org, such as the following example:

    <special xmlns:openOffice="http://www.openoffice.org">
        <openOffice:search>
            <openOffice:searchOptions>
                <openOffice:transliterationModules>IGNORE_CASE</openOffice:transliterationModules>
            </openOffice:searchOptions>
        </openOffice:search>
    </special>

Appendix B: Transmitting Locale Information

In a world of on-demand software components, with arbitrary connections between those components, it is important to get a sense of where localization should be done, and how to transmit enough information so that it can be done at that appropriate place. End-users need to get messages localized to their languages, messages that not only contain a translation of text, but also contain variables such as date, time, number formats, and currencies formatted according to the users' conventions. The strategy for doing the so-called JIT localization is made up of two parts:

  1. Store and transmit neutral-format data wherever possible.
    • Neutral-format data is data that is kept in a standard format, no matter what the local user's environment is. Neutral-format is also (loosely) called binary data, even though it actually could be represented in many different ways, including a textual representation such as in XML.
    • Such data should use accepted standards where possible, such as for currency codes.
    • Textual data should also be in a uniform character set (Unicode/10646) to avoid possible data corruption problems when converting between encodings.
  2. Localize that data as "close" to the end-user as possible.

There are a number of advantages to this strategy. The longer the data is kept in a neutral format, the more flexible the entire system is. On a practical level, if transmitted data is neutral-format, then it is much easier to manipulate the data, debug the processing of the data, and maintain the software connections between components.

Once data has been localized into a given language, it can be quite difficult to programmatically convert that data into another format, if required. This is especially true if the data contains a mixture of translated text and formatted variables. Once information has been localized into, say, Romanian, it is much more difficult to localize that data into, say, French. Parsing is more difficult than formatting, and may run up against different ambiguities in interpreting text that has been localized, even if the original translated message text is available (which it may not be).

Moreover, the closer we are to end-user, the more we know about that user's preferred formats. If we format dates, for example, at the user's machine, then it can easily take into account any customizations that the user has specified. If the formatting is done elsewhere, either we have to transmit whatever user customizations are in play, or we only transmit the user's locale code, which may only approximate the desired format. Thus the closer the localization is to the end user, the less we need to ship all of the user's preferences arond to all the places that localization could possibly need to be done.

Even though localization should be done as close to the end-user as possible, there will be cases where different components need to be aware of whatever settings are appropriate for doing the localization. Thus information such as a locale code or timezone needs to be communicated between different components.

B.1 Message Formatting and Exceptions

Windows (FormatMessage, String.Format), Java (MessageFormat) and ICU (MessageFormat, umsg) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take that case as representative of this class of issues.

There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is understood outside of the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany the numeric code so that when the localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization.

More often, the exact messages that could originate from within the component are not known outside of the component itself; or at least they may not be known by the component that is finally displaying text to the user. In such a case, the information as to the user's locale needs to be communicated in some way to the component that is doing the localization. That locale information does not necessarily need to be communicated deep within the component; ideally, any exceptions should bundle up some language-neutral message ID, plus the arguments needed to format the message (e.g. datetime), but not do the localization at the throw site. This approach has the advantages noted above for JIT localization.

In addition, exceptions are often caught at a higher level; they don't end up being displayed to any end-user at all. By avoiding the localization at the throw site, it the cost of doing formatting, when that formatting is not really necessary. In fact, in many running programs most of the exceptions that are thrown at a low level never end up being presented to an end-user, so this can have considerable performance benefits.

Appendix C: Supplemental Data

The following represents the format for supplemental information. This is information that is important for proper formatting, but is not contained in the locale hierarchy. It is not localizable, nor is it overridden by locale data. It uses the following format, where the data here is solely for illustration:

<supplementalData>
  <currencyData>
    <fractions>
      ...
      <info iso4217="CHF" digits="2" rounding="5"/>
      ...
      <info iso4217="ITL" digits="0"/>
      ...
    </fractions>
    ...
    <region iso3166="IT">
      <currency iso4217="EUR" from="1999-01-01"/>
      <currency iso4217="ITL" from="1862-8-24" to="2002-02-28"/>
    </region>
    ...
    <region iso3166="CS">
      <currency iso4217="EUR" from="2003-02-04"/>
      <currency iso4217="CSD" from="2002-05-15"/>
      <currency iso4217="YUM" from="1994-01-24" to="2002-05-15"/>
    </region>
    ...
  </currencyData>
</supplementalData>

Each currencyData element contains one fractions element followed by one or more region elements. The fractions element contains any number of info elements, with the following attributes:

Each region element contains one attribute:

And can have any number of currency elements, with the ordered subelements.

    <region iso3166="IT"> <!-- Italy -->
      <currency iso4217="EUR" from="2002-01-01"/>
      <currency iso4217="ITL" to="2001-12-31"/>
    </region>

That is, each currency element will list an interval in which it was valid. Theordering of the elements in the list tells us which was the primary currency during any period in time. Here is an example of such an overlap:

<currency iso4217="CSD" to="2002-05-15"/>
<currency iso4217="YUD" from="1994-01-24" to="2002-05-15"/>
<currency iso4217="YUN" from="1994-01-01" to="1994-07-22"/>

If the from element is missing, it is assumed to be as far backwards in time as we have data for; if the to element is missing, then it is from this point onwards. The from element is also limited by the fact that ISO 4217 does not go very far back in time, so there may be no ISO code for the previous currency.

<languageData>

The following is used for consistently checking and testing. The coverage will improve over time. At this point, the territories and scripts are limited to those that are official languages of the region as a whole, or are major commercial languages.

	<languageData>
		<language type="af" scripts="Latn" territories="ZA"/>
		<language type="am" scripts="Ethi" territories="ET"/>
		<language type="ar" scripts="Arab" territories="AE BH DZ EG IN IQ JO KW LB
LY MA OM PS QA SA SD SY TN YE"/>
                ...

This element can also be used to indicate secondary languages and/or scripts used in a territory.

		<language type="fr" scripts="Latn" territories="IT US" alt="secondary" />
                ...

<timezoneData>

The following is data that can be used to get a single timezone id from a set of modern equivalents

<timezoneData>
	<size ordering="America/New_York America/Detroit America/Louisville
America/Kentucky/Monticello">
		...

The following subelement of <timezoneData> supplies information used by Appendix J: Time Zone Display Names.

<zoneFormatting multizone="001 AQ AR AU BR CA CD CL CN EC ES FM GB GL ID KI KZ MH ML MN MX MY NZ PF PT RU SJ UA UM US UZ">
  <zoneItem type="Africa/Abidjan" territory="CI"/>
  <zoneItem type="Africa/Accra" territory="GH"/>
  <zoneItem type="America/Adak" territory="US" aliases="America/Atka US/Aleutian"/>
  <zoneItem type="Africa/Addis_Ababa" territory="ET"/>
  <zoneItem type="Australia/Adelaide" territory="AU" aliases="Australia/South"/>

The multizone attribute lists the territories that that have multiple TZIDs, which is used in step #5 of Appendix J: Time Zone Display Names. The zoneItem type is the canonical ID for CLDR. The aliases map to that canonical ID; this is used in step #1 in Appendix J: Time Zone Display Names. The territory is also used in step #5.

C.1 Territory Containment

The following data provides information that allows GUIs to break up a very long list of country names into a progressive list. The data is based on the information found at [UNM49]. There is one special code, QO, which is used for outlying areas that are typically uninhabited.

<territoryContainment>
<group type="001" contains="002 009 019 142 150"/> <!--World -->
<group type="011" contains="BF BJ CI CV GH GM GN GW LR ML MR NE NG SH SL SN TG"/> <!--Western Africa -->
<group type="013" contains="BZ CR GT HN MX NI PA SV"/> <!--Central America -->
<group type="014" contains="BI DJ ER ET KE KM MG MU MW MZ RE RW SC SO TZ UG YT ZM ZW"/> <!--Eastern Africa -->
<group type="142" contains="030 035 062 145"/> <!--Asia -->
<group type="145" contains="AE AM AZ BH CY GE IL IQ JO KW LB OM PS QA SA SY TR YE"/> <!--Western Asia -->
<group type="015" contains="DZ EG EH LY MA SD TN"/> <!--Northern Africa -->
...

<mapTimezones>

The following data can be used to provide mappings between TZ IDs and other platforms. The purpose is to assist with migration and vetting.

<mapTimezones type="windows">
			<mapZone other="Dateline" type="Etc/GMT+12">
			<mapZone other="Samoa" type="Pacific/Midway">
			<mapZone other="Hawaiian" type="Pacific/Honolulu">
...

<alias>

This element provides information as to parts of locale IDs that should be substituted when accessing CLDR data. This logical substitution should be done to both the locale id, and to any lookup for display names of languages, territories, etc. As with the display names, the language type and replacement may be any prefix of a valid locale id, such as "no_NO".

<alias>
  <language type="in" replacement="id">
  <language type="sh" replacement="sr">
  <language type="sh_YU" replacement="sr_Latn_YU">
...
  <territory type="BU" replacement="MM">
...
</alias>
<!ELEMENT deprecated ( deprecatedItems* ) >
<!ATTLIST deprecated draft ( true | false ) #IMPLIED >

<!ELEMENT deprecatedItems EMPTY >
<!ATTLIST deprecatedItems draft ( true | false ) #IMPLIED >
<!ATTLIST deprecatedItems type ( standard | supplemental ) #IMPLIED >
<!ATTLIST deprecatedItems elements NMTOKENS #IMPLIED >
<!ATTLIST deprecatedItems attributes NMTOKENS #IMPLIED >
<!ATTLIST deprecatedItems values CDATA #IMPLIED >

The deprecated items can be used to indicate elements, attributes, and attribute values that are deprecated. This means that the items are valid, but that their usage is strongly discouraged. When the same deprecatedItems element contains combinations of elements, attributes, and values, then the "least significant" items are only deprecated if they occur with the "more significant" items. For example:

Deprecated Items
<deprecatedItems elements="A B"> A and B are deprecated
<deprecatedItems attributes="C D"> C and D are deprecated on all elements
<deprecatedItems elements="A B" attributes="C D"> C and D are deprecated, but only if they occur on elements A or B.
<deprecatedItems elements="A B" attributes="C D" values="E"> E is deprecated, but only if it is a value of C in an element A or B

In each case, multiple items are space-delimited.

C.2 <characters>

The characters element provides a way for non-Unicode systems, or systems that only support a subset of Unicode characters, to transform CLDR data. It gives a list of characters with alternative values that can be used if the main value is not available. For example:

<characters>
     <character-fallback>
	<character value = "ß">
		<substitute>ss</substitute>
	</character>
	<character value = "Ø">
		<substitute>Ö</substitute>
		<substitute>O</substitute>
	</character>
	<character value = "">
		<substitute>Pts</substitute>
	</character>
	<character value = "">
		<substitute>Fr.</substitute>
	</character>
     </character-fallback> 
</characters>

The ordering of the substitute elements indicates the preference among them.

C.3 Calendar Data

<calendarData>
  <!-- gregorian is assumed, so these are all in addition -->
  <calendar type="japanese" territories="JP"/>
  <calendar type="islamic-civil" territories="AE BH DJ DZ EG EH ER IL IQ JO KM KW
     LB LY MA MR OM PS QA SA SD SY TD TN YE AF IR"/>
  ...

The common values provide a list of the calendars that are in common use, and thus should be shown in UIs that provide choice of calendars. (An 'Other...' button could give access to the other available calendars.)

<weekData>
  <minDays count="1" territories="001"/>
  <minDays count="4" territories="AT BE CA CH DE DK FI FR IT LI LT LU MC MT NL NO SE SK"/>
  <minDays count="4" territories="CD" draft="true"/>
  <firstDay day="mon" territories="001"/>
...

These values provide information on how a calendar is used in a particular territory. It may also be used in computing week boundaries for other purposes. The default is provided by the element with territories="001".

The minDays indicates the minimum number of days to count as the first week (of a month or year). The first day of the week is typically used for calendar presentation.

What is meant by the weekend varies from country to country. It is typically when most non-retail businesses are closed. The time should not be specified unless it is a well-recognized part of the day.

The weekendStart day defaults to "sat", and weekendEnd day defaults to "sun".

The weekendStart time defaults to "00:00:00" (midnight at the start of the day). The weekendEnd time defaults to "24:00:00" (midnight at the end of the day). (That is, Friday at 24:00:00 is the same time as Saturday at 00:00:00.) Thus the following are equivalent:

<weekendStart day="sat"/>
<weekendEnd day="sun"/>
<weekendStart day="sat" time="00:00"/>
<weekendEnd day="sun" time="24:00"/>
<weekendStart day="fri" time="24:00"/>
<weekendEnd day="mon" time="00:00"/>

The week information was formerly in the main LDML file.

C.4 Measurement System

<measurementData>
  <measurementSystem type="metric" territories="001"/>
  <measurementSystem type="US" territories="US"/>
  <paperSize type="A4" territories="001"/>
  <paperSize type="US-Letter" territories="US"/>
</measurementData>

The measurement system is the normal measurement system in common everyday use (except for date/time). The values are "metric" (= ISO 1000), "US", or "UK"; others may be added over time. The "US" value indicates the customary system of measurement with feet, inches, pints, quarts, etc. as used in the United States. The "UK" value indicates the customary system of measurement with feet, inches, pints, quarts, etc. as used in the United Kingdom. It is also called the Imperial system: the pint, quart, etc. are different sizes than in "US".

The paperSize attribute gives the height and width of paper used for normal business letters. The values are A4 and US.

The measurement information was formerly in the main LDML file, and had a somewhat different format.

Appendix D: Language and Locale IDs

People have very slippery notions of what distinguishes a language code vs. a locale code. The problem is that both are somewhat nebulous concepts.

In practice, many people use [RFC3066] codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because [RFC3066] includes an explicit region (territory) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives an [RFC3066] code, it will use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-" vs "_" (e.g. zh-TW for language code, zh_TW for locale code), but in practice that does not work because of the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one is forced to treat "-" and "_" as equivalent when interpreting either one on input.

Another reason for the conflation of these codes is that very little data in most systems is distinguished by region alone; currency codes and measurement systems being some of the few. Sometimes date or number formats are mentioned as regional, but that really doesn't make much sense. If people see the sentence "You will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say that sentence is simply not English. Number format is far more closely associated with language than it is with region. The same is true for date formats: people would never expect to see intermixed a date in the format "2003年4月1日" (using Kanji) in text purporting to be purely English. There are regional differences in date and number format — differences which can be important — but those are different in kind than other language differences between regions.

As far as we are concerned — as a completely practical matter — two languages are different if they require substantially different localized resources. Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange. Unfortunately, this is not the principle used in [ISO639], which has the fairly unproductive notion (for data interchange) that only spoken language matters (it is also not completely consistent about this, however).

[RFC3066] can express a difference if the use of written languages happens to correspond to region boundaries expressed as [ISO3166] region codes, and has recently added codes that allow it to express some important cases that are not distinguished by [ISO3166] codes. These written languages include simplified and traditional Chinese (both used in Hong Kong S.A.R.); Serbian in Latin script; Azerbaijani in Arab script, and so on.

Notice also that currency codes are different than currency localizations. The currency localizations should largely be in the language-based resource bundles, not in the territory-based resource bundles. Thus, the resource bundle en contains the localized mappings in English for a range of different currency codes: USD → US$, RUR → Rub, AUD → $A etc. Of course, some currency symbols are used for more than one currency, and in such cases specializations appear in the territory-based bundles. Continuing the example, en_US would have USD → $, while en_AU would have AUD → $. (In protocols, the currency codes should always accompany any currency amounts; otherwise the data is ambiguous, and software is forced to use the user's territory to guess at the currency. For some informal discussion of this, see JIT Localization.)

D.1 Written Language

Criteria for what makes a written language should be purely pragmatic; what would copy-editors say? If one gave them text like the following, they would respond that is far from acceptable English for publication, and ask for it to be redone:

  1. "Theatre Center News: The date of the last version of this document was 2003年3月20日. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."

So one would change it to either B or C below, depending on which orthographic variant of English was the target for the publication:

  1. "Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
  2. "Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."

Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first vs. last name sorting in the list, but clearly the first list was not acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in the source orthography even if it differs from the publication target orthography. And so on. However, just as clearly, there limits on what is acceptable English, and "2003年3月20日", for example, is not.

Note that the language of locale data may differ from the language of localized software or web sites, when those latter are not localized into the user's preferred language. In such cases, the kind of incongruous juxtapositions described above may well appear, but this situation is usually preferable to forcing unfamiliar date or number formats on the user as well.

Appendix E: Unicode Sets

A UnicodeSet is a set of Unicode characters (and possibly strings) determined by a pattern, following UTS #18: Unicode Regular Expressions [URegex], Level 1 and RL2.5, including the syntax where given. For an example of a concrete implementation of this, see [ICUUnicodeSet].

Patterns are a series of characters bounded by square brackets that contain lists of characters and Unicode property sets. Lists are a sequence of characters that may have ranges indicated by a '-' between two characters, as in "a-z". The sequence specifies the range of all characters from the left to the right, in Unicode order. For example, [a c d-f m] is equivalent to [a c d e f m]. Whitespace can be freely used for clarity, as [a c d-f m] means the same as [acd-fm].

Unicode property sets are specified by any Unicode property and a value of that property, such as [:General_Category=Letter:]. The property names are defined by the PropertyAliases.txt file and the property values by the PropertyValueAliases.txt file. For more information, see [UCD]. The syntax for specifying the property sets is an extension of either POSIX or Perl syntax, by the addition of "=<value>". For example, you can match letters by using the POSIX-style syntax:

[:General_Category=Letter:]

or by using the Perl-style syntax

\p{General_Category=Letter}.

Property names and values are case-insensitive, and whitespace, "-", and "_" are ignored. The property name can be omitted for the Category and Script properties, but is required for other properties. If the property value is omitted, it is assumed to represent a boolean property with the value "true". Thus [:Letter:] is equivalent to [:General_Category=Letter:], and [:Wh-ite-s pa_ce:] is equivalent to [:Whitespace=true:].

The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative", which is a property that excludes all characters of a given kind. For example, [:^Letter:] matches all characters that are not [:Letter:].

  Positive  Negative 
POSIX-style Syntax  [:type=value:]  [:^type=value:] 
Perl-style Syntax  \p{type=value}  \P{type=value} 

These following low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection):

The binary operators '&', '-', and the implicit union have equal precedence and bind left-to-right. Thus [[:letter:]-[a-z]-[\u0100-\u01FF]] is equal to [[[:letter:]-[a-z]]-[\u0100-\u01FF]]. Another example is the set [[ace][bdf] - [abc][def]], which is not the empty set, but instead equal to [[[[ace] [bdf]] - [abc]] [def]], which equals [[[abcdef] - [abc]] [def]], which equals [[def] [def]], which equals [def].

One caution: the '&' and '-' operators operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern [[:Lu:]-A] is illegal, since it is interpreted as the set [:Lu:] followed by the incomplete range -A. To specify the set of uppercase letters except for 'A', enclose the 'A' in a set: [[:Lu:]-[A]].

A multi-character string can be in a Unicode set, to represent a tailored grapheme cluster for a particular language. The syntax uses curly braces for that case.

In Unicode Sets, there are two ways to quote syntax characters and whitespace:

E.1 Single Quote

Two single quotes represents a single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way (except for two adjacent single quotes). It is taken as literal text (special characters become non-special).

E.2 Backslash Escapes

Outside of single quotes, certain backslashed characters have special meaning:

\uhhhh  Exactly 4 hex digits; h in [0-9A-Fa-f] 
\Uhhhhhhhh  Exactly 8 hex digits 
\xhh  1-2 hex digits 
\ooo  1-3 octal digits; o in [0-7]  
\a  U+0007 (BELL) 
\b  U+0008 (BACKSPACE) 
\t  U+0009 (HORIZONTAL TAB) 
\n  U+000A (LINE FEED) 
\v  U+000B (VERTICAL TAB) 
\f  U+000C (FORM FEED) 
\r  U+000D (CARRIAGE RETURN) 
\\  U+005C (BACKSLASH) 
\N{name} The Unicode character named "name".

Anything else following a backslash is mapped to itself, except in an environment where it is defined to have some special meaning. For example, \p{uppercase} is the set of uppercase letters in Unicode.

Any character formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \u and \U escapes create literal characters. (In contrast, Java treats Unicode escapes as just a way to represent arbitrary characters in an ASCII source file, and any resulting characters are not tagged as literals.)

The following table summarizes the syntax that can be used.

Example Description
[a]  The set containing 'a' alone 
[a-z]  The set containing 'a' through 'z' and all letters in between, in Unicode order.
Thus it is the same as [\u0061-\u007A].
[^a-z]  The set containing all characters but 'a' through 'z'.
Thus it is the same as [\u0000-\u0061 \u007B..\U0010FFFF].
[[pat1][pat2]]  The union of sets specified by pat1 and pat2 
[[pat1]&[pat2]]  The intersection of sets specified by pat1 and pat2 
[[pat1]-[pat2]]  The asymmetric difference of sets specified by pat1 and pat2 
[a {ab} {ac}] The character 'a' and the multi-character strings "ab" and "ac"
[:Lu:]  The set of characters with a given property value, as defined by PropertyValueAliases.txt. In this case, these are the Unicode uppercase letters. The long form for this is [:General_Category=Uppercase_Letter:]
[:L:]  The set of characters belonging to all Unicode categories starting with 'L', that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is [:General_Category=Letter:]

Appendix F: Date Format Patterns

A date pattern is a string of characters, where specific strings of characters are replaced with date and time data from a calendar when formatting or used to generate data for a calendar when parsing. The following are the characters used in patterns to show the appropriate formats for a given locale. The following are examples:

Pattern Result (in a particular locale)
yyyy.MM.dd G 'at' HH:mm:ss zzz 1996.07.10 AD at 15:08:56 PDT
EEE, MMM d, ''yy Wed, July 10, '96
h:mm a 12:08 PM
hh 'o''clock' a, zzzz 12 o'clock PM, Pacific Daylight Time
K:mm a, z 0:00 PM, PST
yyyyy.MMMM.dd GGG hh:mm aaa 01996.July.10 AD 12:08 PM

Characters may be used multiple times. For example, if y is used for the year, 'yy' might produce '99', whereas 'yyyy' produces '1999'. For most numerical fields, the number of characters specifies the field width. For example, if h is the hour, 'h' might produce '5', but 'hh' produces '05'. For some characters, the count specifies whether an abbreviated or full form should be used, but may have other choices, as given below.

Two single quotes represents a literal single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way (except for two adjacent single quotes). Otherwise all ASCII letter from a to z and A to Z are reserved as syntax characters, and require quoting if they are to represent literal characters. In addition, certain ASCII punctuation characters may become variable in the future (eg ":" being interpreted as the time separator and '/' as a date separator, and replaced by respective locale-sensitive characters in display).

Note: the counter-intuitive use of 5 letters for the narrow form of weekdays and months is forced by backwards compatibility.

Date Field Symbol Table
Field Sym. No. Example Description
era G 1..3 AD Era - Replaced with the Era string for the current date. One to three letters for the abbreviated form, four letters for the long form, five for the narrow form.
4 Anno Domini
5 A
year y 1..n 1996 Year. Normally the length specifies the padding, but for two letters it also specifies the maximum length. Example:
Year y yy yyy yyyy yyyyy
AD 1 1 01 001 0001 00001
AD 12 12 12 012 0012 00012
AD 123 123 23 123 0123 00123
AD 1234 1234 34 1234 1234 01234
AD 12345 12345 45 12345 12345 12345
Y 1..n 1997 Year (of "Week of Year"), used in ISO year-week calendar. May differ from calendar year.
u 1..n 4601 Extended year. This is a single number designating the year of this calendar system, encompassing all supra-year fields. For example, for the Julian calendar system, year numbers are positive, with an era of BCE or CE. An extended year value for the Julian calendar system assigns positive values to CE years and negative values to BCE years, with 1 BCE being year 0.
quarter Q 1..2 02 Quarter - Use one or two for the numerical quarter, three for the abbreviation, or four for the full name.
3 Q2
4 2nd quarter
q 1..2 02 Stand-Alone Quarter - Use one or two for the numerical quarter, three for the abbreviation, or four for the full name.
3 Q2
4 2nd quarter
month M 1..2 09 Month - Use one or two for the numerical month, three for the abbreviation, or four for the full name, or five for the narrow name.
3 Sept
4 September
5 S
L 1..2 09 Stand-Alone Month - Use one or two for the numerical month, three for the abbreviation, or four for the full name, or 5 for the narrow name.
3 Sept
4 September
5 S
week w 1..2 27 Week of Year.
W 1 3 Week of Month
day d 1..2 1 Date - Day of the month
D 1..3 345 Day of year
F 1 2
 
Day of Week in Month. The example is for the 2nd Wed in July
g 1..n 2451334 Modified Julian day. This is different from the conventional Julian day number in two regards. First, it demarcates days at local zone midnight, rather than noon GMT. Second, it is a local number; that is, it depends on the local time zone. It can be thought of as a single number that encompasses all the date-related fields.
week
day
E 1..3 Tues Day of week - Use one through three letters for the short day, or four for the full name, or five for the narrow name.
4 Tuesday
5 T
e 1..2 2 Local day of week. Same as E except adds a numeric value that will depend on the local starting day of the week, using one or two letters. For this example, Monday is the first day of the week.
3 Tues
4 Tuesday
5 T
c 1 2 Stand-Alone local day of week - Use one letter for the local numeric value (same as 'e'), three for the short day, or four for the full name, or five for the narrow name.
3 Tues
4 Tuesday
5 T
period a 1 AM AM or PM
hour h 1..2 11 Hour [1-12].
H 1..2 13 Hour [0-23].
K 1..2 0 Hour [0-11].
k 1..2 24 Hour [1-24].
minute m 1..2 59 Minute. Use one or two for zero padding.
second s 1..2 12 Second. Use one or two for zero padding.
S 1..n 3457 Fractional Second - rounds to the count of letters. (example is for 12.34567)
A 1..n 69540000 Milliseconds in day. This field behaves exactly like a composite of all time-related fields, not including the zone fields. As such, it also reflects discontinuities of those fields on DST transition days. On a day of DST onset, it will jump forward. On a day of DST cessation, it will jump backward. This reflects the fact that is must be combined with the offset field to obtain a unique local time value.
zone z 1..3 PDT Timezone - Use one to three letters for the short timezone or four for the full name. For more information, see Appendix J: Time Zone Display Names
4 Pacific Daylight Time
Z 1..3 -0800 Use one to three letters for RFC 822, four letters for GMT format.
4 GMT-08:00
v 1 PT Use one letter for short wall (generic) time, four for long wall time. For more information, see Appendix J: Time Zone Display Names
4 Pacific Time

All non-letter character represent themselves in a pattern, except for the single quote. It is used to 'escape' letters. Two single quotes in a row, whether inside or outside a quoted sequence, represent a 'real' single quote.

F.1 Localized Pattern Characters (deprecated)

These are characters that can be used when displaying a date pattern to an end user. This can occur, for example, when a spreadsheet allows users to specify date patterns. Whatever is in the string is substituted one-for-one with the characters "GyMdkHmsSEDFwWahKzYe", with the above meanings. Thus, for example, if "J" is to be used instead of "Y" to mean Year, then the string would be: "GyMdkHmsSEDFwWahKzJe".

This element is deprecated. It is recommended instead that a more sophisticated UI be used for localization, such as using icons to represent the different formats (and lengths) in the Date Field Symbol Table.

F.2 AM / PM

Even for countries where the customary date format only has a 24 hour format, both the am and pm localized strings must be present and must be distinct from one another. Note that as long as the 24 hour format is used, these strings will normally never be used, but for testing and unusual circumstances they must be present.

F.3 Eras

There are only two values for an era in a Gregorian calendar, "BC" and "AD". These values can be translated into other languages, like "a.C." and and "d.C." for Spanish, but there are no other eras in the Gregorian calendar. Other calendars have a different numbers of eras. Care should be taken when translating the era names for a specific calendar.

F.4 Week of Year

Values calculated for the Week of Year field range from 1 to 53 for the Gregorian calendar (they may have different ranges for other calendars). Week 1 for a year is the first week that contains at least the specified minimum number of days from that year. Weeks between week 1 of one year and week 1 of the following year are numbered sequentially from 2 to 52 or 53 (if needed). For example, January 1, 1998 was a Thursday. If the first day of the week is MONDAY and the minimum days in a week is 4 (these are the values reflecting ISO 8601 and many national standards), then week 1 of 1998 starts on December 29, 1997, and ends on January 4, 1998. However, if the first day of the week is SUNDAY, then week 1 of 1998 starts on January 4, 1998, and ends on January 10, 1998. The first three days of 1998 are then part of week 53 of 1997.

Values are similarly calculated for the Week of Month.

F.5 Week Elements

firstDay
A number indicating which day of the week is considered the 'first' day, for calendar purposes. Because the ordering of days may vary between calendar, keywords are used for this value, such as sun, mon,... These values will be replaced by the localized name when they are actually used.
minDays (Minimal Days in First Week)
Minimal days required in the first week of a month or year. For example, if the first week is defined as one that contains at least one day, this value will be 1. If it must contain a full seven days before it counts as the first week, then the value would be 7.
weekendStart, weekendEnd
Indicates the day and time that the weekend starts or ends. As with firstDay, keywords are used instead of numbers.

Appendix G: Number Format Patterns

G.1 Number Patterns

The NumberElements resource affects how these patterns are interpreted in a localized context. Here are some examples, based on the French locale. The "." shows where the decimal point should go. The "," shows where the thousands separator should go. A "0" indicates zero-padding: if the number is too short, a zero (in the locale's numeric set) will go there. A "#" indicates no padding: if the number is too short, nothing goes there. A "¤" shows where the currency sign will go. The following illustrates the effects of different patterns for the French locale, with the number "1234.567". Notice how the pattern characters ',' and '.' are replaced by the characters appropriate for the locale.

Pattern Currency Text
#,##0.## n/a 1 234,57
#,##0.### n/a 1 234,567
###0.##### n/a 1234,567
###0.0000# n/a 1234,5670
00000.0000 n/a 01234,5670
# ##0.00 ¤ EUR 1 234,57 €
JPY 1 235 ¥

The number of # placeholder characters before the decimal do not matter, since no limit is placed on the maximum number of digits. There should, however, be at least one zero someplace in the pattern. In currency formats, the number of digits after the decimal also do not matter, since the information in the supplemental data (see Appendix C: Supplemental Data) is used to override the number of decimal places — and the rounding — according to the currency that is being formatted. That can be seen in the above chart, with the difference between Yen and Euro formatting.

G.2 Special Pattern Characters

Many characters in a pattern are taken literally; they are matched during parsing and output unchanged during formatting. Special characters, on the other hand, stand for other characters, strings, or classes of characters. For example, the '#' character is replaced by a localized digit. Often the replacement character is the same as the pattern character; in the U.S. locale, the ',' grouping character is replaced by ','. However, the replacement is still happening, and if the symbols are modified, the grouping character changes. Some special characters affect the behavior of the formatter by their presence; for example, if the percent character is seen, then the value is multiplied by 100 before being displayed.

To insert a special character in a pattern as a literal, that is, without any special meaning, the character must be quoted. There are some exceptions to this which are noted below.

Symbol Location Localized? Meaning
0 Number Yes Digit
1-9 Number Yes '1' through '9' indicate rounding.
@ Number No Significant digit
# Number Yes Digit, zero shows as absent
. Number Yes Decimal separator or monetary decimal separator
- Number Yes Minus sign
, Number Yes Grouping separator
E Number Yes Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
+ Exponent Yes Prefix positive exponents with localized plus sign. Need not be quoted in prefix or suffix.
; Subpattern boundary Yes Separates positive and negative subpatterns
% Prefix or suffix Yes Multiply by 100 and show as percentage

(\u2030)
Prefix or suffix Yes Multiply by 1000 and show as per mille
¤ (\u00A4) Prefix or suffix No Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If tripled, uses the long form of the decimal symbol. If present in a pattern, the monetary decimal separator and grouping separators (if available) are used instead of the numeric ones.
' Prefix or suffix No Used to quote special characters in a prefix or suffix, for example, "'#'#" formats 123 to "#123". To create a single quote itself, use two in a row: "# o''clock".
* Prefix or suffix boundary Yes Pad escape, precedes pad character

A pattern contains a postive and may contain a negative subpattern, for example, "#,##0.00;(#,##0.00)". Each subpattern has a prefix, a numeric part, and a suffix. If there is no explicit negative subpattern, the negative subpattern is the localized minus sign prefixed to the positive subpattern. That is, "0.00" alone is equivalent to "0.00;-0.00". If there is an explicit negative subpattern, it serves only to specify the negative prefix and suffix; the number of digits, minimal digits, and other characteristics are ignored in the negative subpattern. That means that "#,##0.0#;(#)" has precisely the same result as "#,##0.0#;(#,##0.0#)".

Note: The thousands separator and decimal separator in this pattern are always ',' and '.'. They are substituted by the code with the correct local values according to other fields in CLDR.

The prefixes, suffixes, and various symbols used for infinity, digits, thousands separators, decimal separators, etc. may be set to arbitrary values, and they will appear properly during formatting. However, care must be taken that the symbols and strings do not conflict, or parsing will be unreliable. For example, either the positive and negative prefixes or the suffixes must be distinct for any parser using this data to be able to distinguish positive from negative values. Another example is that the decimal separator and thousands separator should be distinct characters, or parsing will be impossible.

The grouping separator is a character that separates clusters of integer digits to make large numbers more legible. It commonly used for thousands, but in some locales it separates ten-thousands. The grouping size is the number of digits between the grouping separators, such as 3 for "100,000,000" or 4 for "1 0000 0000". There are actually two different grouping sizes: One used for the least significant integer digits, the primary grouping size, and one used for all others, the secondary grouping size. In most locales these are the same, but sometimes they are different. For example, if the primary grouping interval is 3, and the secondary is 2, then this corresponds to the pattern "#,##,##0", and the number 123456789 is formatted as "12,34,56,789". If a pattern contains multiple grouping separators, the interval between the last one and the end of the integer defines the primary grouping size, and the interval between the last two defines the secondary grouping size. All others are ignored, so "#,##,###,####" == "###,###,####" == "##,#,###,####".

When parsing using a number format, a more lenient parse should be used where possible. In particular, it should implement at least the following rules.

For more on parsing, see Lenient Parsing.

For consistency in the CLDR data, the following conventions should be observed so as to have a canonical representation:

G.3 Formatting

Formatting is guided by several parameters, all of which can be specified either using a pattern or using the API. The following description applies to formats that do not use scientific notation or significant digits.

Special Values

NaN is represented as a single character, typically (\uFFFD). This character is determined by the localized number symbols. This is the only value for which the prefixes and suffixes are not used.

Infinity is represented as a single character, typically (\u221E), with the positive or negative prefixes and suffixes applied. The infinity character is determined by the localized number symbols.

G.4 Scientific Notation

Numbers in scientific notation are expressed as the product of a mantissa and a power of ten, for example, 1234 can be expressed as 1.234 x 103. The mantissa is typically in the half-open interval [1.0, 10.0) or sometimes [0.0, 1.0), but it need not be. In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation. Example: "0.###E0" formats the number 1234 as "1.234E3".

G.5 Significant Digits

There are two ways of controlling how many digits are shows: (a) significant digits counts, or (b) integer and fraction digit counts. Integer and fraction digit counts are described above. When a formatter is using significant digits counts, the number of integer and fraction digits is not specified directly, and the formatter settings for these counts are ignored. Instead, the formatter uses however many integer and fraction digits are required to display the specified number of significant digits. Examples:

Pattern Minimum significant digits Maximum significant digits Number Output
@@@ 3 3 12345 12300
@@@ 3 3 0.12345 0.123
@@## 2 4 3.14159 3.142
@@## 2 4 1.23004 1.23

G.6 Padding

Patterns support padding the result to a specific width. In a pattern the pad escape character, followed by a single pad character, causes padding to be parsed and formatted. The pad escape character is '*'. For example, "$*x#,##0.00" formats 123 to "$xx123.00", and 1234 to "$1,234.00".

Rounding

Patterns support rounding to a specific increment. For example, 1230 rounded to the nearest 50 is 1250. Mathematically, rounding to specific increments is performed by multiplying by the increment, rounding to an integer, then dividing by the increment. To take a more bizarre example, 1.234 rounded to the nearest 0.65 is 1.3, as follows:

Original: 1.234
Divide by increment (0.65): 1.89846...
Round: 2
Multiply by increment (0.65): 1.3

To specify a rounding increment in a pattern, include the increment in the pattern itself. "#,#50" specifies a rounding increment of 50. "#,##0.05" specifies a rounding increment of 0.05.

 
decimalFormats
The normal locale specific way to write a base 10 number.
currencyFormats
Use \u00A4 where the local currency symbol should be. Doubling the currency symbol (\u00A4\u00A4) will output the international currency symbol (a 3-letter code).
percentFormats
Pattern for use with percentage formatting
scientificFormats
Pattern for use with scientific (exponent) formatting.

G.7 Quoting rules

Single quotes, ('), enclose bits of the pattern that should be treated literally. Inside a quoted string, two single quotes ('') are replaced with a single one ('). For example: 'X '#' Q ' -> X 1939 Q (Literal strings underlined.)

G.8 Number Elements

Localized symbols used in number formatting and parsing.

decimal
- separates the integer and fractional part of the number.
group
- groups (for example) units of thousands: 106 = 1,000,000. The grouping separator is commonly used for thousands, but in some countries for ten-thousands. The interval is a constant number of digits between the grouping characters, such as 100,000,000 or 1,0000,0000. If you supply a pattern with multiple grouping characters, the interval between the last one and the end of the integer is the one that is used. So "#,##,###,####" == "######,####" == "##,####,####".
list
- separates lists of numbers
percentSign
- symbol used to indicate a percentage (1/100th) amount. (If present, the value is also multiplied by 100 before formatting. That way 1.23 → 123%)
nativeZeroDigit
- Symbol used to indicate a digit in the pattern, or zero if that place would otherwise be empty. For example, with the digit of '0', the pattern "000" would format "34" as "034", but the pattern "0" would format "34" as just "34". As well, the digits 1-9 are expected to follow the code point of this specified 0 value.
patternDigit
- Symbol used to indicate any digit value, typically #. When that digit is zero, then it is not shown.
minusSign
- Symbol used to denote negative value.
plusSign
- Symbol used to denote negative value.
exponential
- Symbol separating the mantissa and exponent values.
perMille
- symbol used to indicate a per-mille (1/1000th) amount. (If present, the value is also multiplied by 1000 before formatting. That way 1.23 → 1230 [1/000])
infinity
- The infinity sign. Corresponds to the IEEE infinity bit pattern.
nan - Not a number
- The NaN sign. Corresponds to the IEEE NaN bit pattern.
currencySeparator
This is used as the decimal separator in currency formatting/parsing, instead of the DecimalSeparator from the NumberElements list. This item is optional in the CLDR.
currencyGroup
This is used as the grouping separator in currency formatting/parsing, instead of the DecimalSeparator from the NumberElements list. This item is optional in the CLDR.

Appendix H: Choice Patterns

A choice pattern is a string that chooses among a number of strings, based on numeric value. It has the following form:

<choice_pattern> = <choice> ( '|' <choice> )*
<choice> = <number><relation><string>
<number> = ('+' | '-')? ('∞' | [0-9]+ ('.' [0-9]+)?)
<relation> = '<' | '
≤'

The interpretation of a choice pattern is that given a number N, the pattern is scanned from right to left, for each choice evaluating <number> <relation> N. The first choice that matches results in the corresponding string. If no match is found, then the first string is used. For example:

Pattern N Result
0≤Rf|1≤Ru|1<Re -∞, -3, -1, -0.000001 Rf (defaulted to first string)
0, 0.01, 0.9999 Rf
1 Ru
1.00001, 5, 99, Re

Quoting is done using ' characters, as in date or number formats.

Appendix I: Inheritance and Validity

The following describes in more detail how to determine the exact inheritance of elements, and the validity of a given element in LDML.

I.1 Definitions

Attributes that serve to distinguish multiple elements at the same level are called distinguishing attributes. These currently consist of the following:

Note: the type attribute on the following elements was not distinguishing: abbreviationFallback, default, mapping, measurementSystem, preferenceOrdering. That usage of the type attribute has been deprecated in favor of the choice attribute.

Blocking elements are those whose subelements do not inherit from parent locales. For example, a <collation> element is a blocking element: everything in a <collation> element is treated as a single lump of data, as far as inheritance is concerned.

Certain elements are called attribute-information elements. They do not have element content; their information is carried in their attribute values. This is unlike the other elements, whose attributes are used to distinguish different types of data.

A list of blocking and attribute-information elements is found in Appendix K: Valid Attribute Values.

For any element in an XML file, an element chain is a resolved XPath leading from the root to an element, with attributes on each element in alphabetical order. So in, say, http://unicode.org/cldr/data/common/main/el.xml we may have:

<ldml version="1.1">
  <identity>
    <version number="1.1" />
    <generation date="2004-06-04" />
    <language type="el" />
  </identity>
  <localeDisplayNames>
    <languages>
      <language type="ar">Αραβικά</language>
...

Which gives the following element chains (among others):

An element chain A is an extension of an element chain B if B is equivalent to an initial portion of A. For example, #2 below is an extension of #1. (Equivalent, depending on the tree, may not be "identical to". See below for an example.)

  1. //ldml[@version="1.1"]/localeDisplayNames
  2. //ldml[@version="1.1"]/localeDisplayNames/languages/language[@type="ar"]

An LDML file can be thought of as an ordered list of element pairs: <element chain, data>, where the element chains are all the chains for the end-nodes. (This works because of restrictions on the structure of LDML, including that it doesn't allow mixed content.) The ordering is the ordering that the element chains are found in the file, and thus determined by the DTD.

For example, some of those pairs would be the following. Notice that the first has the null string as element contents.

Note: There are two exceptions to this:

  1. Blocking nodes and their contents are treated as a single end note.
  2. For attribute-information elements, in terms of computing inheritance, the element pair consists of the element chain minus the attributes in the final element and the value is the list of attributes for that final element.

Thus instead of the element pair being (a) below, it is (b):

  1. <//ldml[@version="1.1"]/dates/calendars/calendar[@type='gregorian']/week/weekendStart[@day='sun'][@time='00:00'],
    "">
  2. <//ldml[@version="1.1"]/dates/calendars/calendar[@type='gregorian']/week/weekendStart,
    [@day='sun'][@time='00:00']>

Two LDML element chains are equivalent when they would be identical if all attributes and their values were removedexcept for distinguishing attributes. Thus the following are equivalent:

For any locale ID, an locale chain is an ordered list starting with the root and leading down to the ID. For example:

<root, de, de_DE, de_DE_xxx>

I.2 Resolved Data File

To produce fully resolved locale data file from CLDR for a locale ID L, you start with L, and successively add unique items from the parent locales until you get up to root. More formally, this can be expressed as the following procedure.

  1. Let Result be initially empty.
  2. For each Li in the locale chain for L, starting at L and going up to root:
    1. Let Temp be a copy of the pairs in the LDML file for Li
    2. Replace each alias in Temp by the list of pairs it points to.
      1. That alias now blocks any inheritance from the parent. (See Section 5.1 Common Elements for an example.)
    3. For each element pair P in Temp:
      1. If P does not contain a blocking element, and Result does not have an element pair Q with an equivalent element chain, add P to Result.

Note: when adding an element pair to a result, it has to go in the right order for it to be valid according to the DTD.

I.3 Valid Data

The attribute draft="unconfirmed" or draft="provisional" in LDML means that the data has not been approved for release. However, some data that is not explicitly marked as unconfirmed or provisional may be implicitly unconfirmed or provisional, either because it inherits it from a parent, or from an enclosing element.

Example 2. Suppose that new locale data is added for af (Afrikans). To indicate that all of the data is unconfirmed, the attribute can be added to the top level.

<ldml version="1.1" draft="unconfirmed">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="af" />
 </identity>
 <characters>...</characters>
 <localeDisplayNames>...</localeDisplayNames>
</ldml>

Any data can be added to that file, and the status will all be draft="unconfirmed". Once an item is vetted -- whether it is inherited or explicitly in the file -- then its status can be changed to approved. This can be done either by leaving draft="unconfirmed" on the enclosing element and marking the child with draft="approved", such as:

<ldml version="1.1" draft="unconfirmed">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="af" />
 </identity>
 <characters draft="approved">...</characters>
 <localeDisplayNames>...</localeDisplayNames>
 <dates/>
 <numbers/>
 <collations/>
</ldml>

However, normally the draft status should be canonicalized, which means it is pushed down to leaf nodes: see Appendix L: Canonical Form.

Note: A missing draft attribute is not the same as either a true or false value. A missing attribute means instead: inherit the draft status from enclosing elements and parent locales.

The attribute validSubLocales allows sublocales in a given tree to be treated as though a file for them were present when there isn't one. It can be applied to any element. It only has an effect for locales that inherit from the current file where a file is missing, and the elements wouldn't otherwise be draft.

Example 1. Suppose that in a particular LDML tree, there are no region locales for German, e.g. there is a de.xml file, but no files for de_AT.xml, de_CH.xml, or de_DE.xml. Then no elements are valid for any of those region locales. If we want to mark one of those files as having valid elements, then we introduce an empty file, such as the following.

<ldml version="1.1">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="de" />
  <territory type="AT" />
 </identity>
</ldml>

With the validSubLocales attribute, instead of adding the empty files for de_AT.xml, de_CH.xml, and de_DE.xml, in the de file we can add to the parent locale a list of the child locales that should behave as if files were present.

<ldml version="1.1" validSubLocales="de_AT de_CH de_DE">
 <identity>
  <version number="1.1" />
  <generation date="2004-06-04" />
  <language type="de" />
 </identity>
...
</ldml>

More formally, here is how to determine whether data for an element chain E is implicitly or explicitly draft, given a locale L. Sections 1, 2, and 4 are simply formalizations of what is in LDML already. Item 3 adds the new element.

I.4 Checking for Draft Status:

  1. Parent Locale Inheritance
    1. Walk through the locale chain until you find a locale ID L' with a data file D. (L' may equal L).
    2. Produce the fully resolved data file D' for D.
    3. In D', find the first element pair whose element chain E' is either equivalent to or an extension of E.
    4. If there is no such E', return true
    5. If E' is not equivalent to E, truncate E' to the length of E.
  2. Enclosing Element Inheritance
    1. Walk through the elements in E', from back to front.
      1. If you ever encounter draft=x, return x
    2. If L' = L, return false
  3. Missing File Inheritance
    1. Otherwise, walk again through the elements in E', from back to front.
      1. If you encounter a validSubLocales attribute:
        1. If L is in the attribute value, return false
        2. Otherwise return true
  4. Otherwise
    1. Return true

The validSubLocales in the most specific (farthest from root file) locale file "wins" through the full resolution step (data from more specific files replacing data from less specific ones).

I.5 Keyword and Default Resolution

When accessing data based on keywords, the following process is used. Consider the following example:

The locale 'de' has collation types A, B, C, and no <default> element
The locale 'de_CH' has <default type='B'>

Here are the searches for various combinations.

1. de_CH not found
de not found
root not found: so get the default type in de_CH
de@collation=B found
2. de not found
root not found: so get the default type in de, which itself falls back to root
de@collation=standard not found
root@collation=standard found
3. de@collation=A found
4. de@collation=standard not found
root@collation=standard found

Note: It is an invariant that the default in root for a given element must
always be a value that exists in root. So you can't have the following in root:

<someElements>
  <default type='a'/>
  <someElement type='b'>...</someElement>
  <someElement type='c'>...</someElement>
  <!-- no 'a' -->
</someElements>

For identifiers, such as language codes, script codes, region codes, variant codes, types, keywords, currency symbols or currency display names, the default value is the identifier itself whenever if no value is found in the root. Thus if there is no display name for the region code 'QA' in root, then the display name is simply 'QA'.

Appendix J: Time Zone Display Names

There are three types of formats for zone identifiers: GMT, generic (wall time), and standard/daylight. Standard and daylight are equivalent to a particular offset from GMT, and can be represented by a GMT offset as a fallback. In general, this is not true for the generic format, which is used for picking
timezones or for conveying a timezone for specifying a recurring time (such as a meeting in a calendar). For either purpose, a GMT offset would lose information.

When a timezone is to be displayed, the following process is used. It uses explicit display names where they are available, and otherwise uses a fallback to GMT for non-wall time (standard and daylight). For generic, it falls back to the exemplar city if available, otherwise the country if possible, and otherwise the last field of the zone ID. Only the generic time (or its fallback) should be used in menus, in order to avoid possible collisions in the display names of standard and daylight time.

Each step is followed until a "return" is reached. Some of the examples are drawn from real data, while for illustration the region format is "Tampo de {0}". The fallback format is "{0} ({1})", which is what is in root.

  1. Canonicalize the TZ ID according to the <timezoneData> table in supplemental data. Use that canonical TZID in each of the following steps.
    • America/Atka → America/Adak
    • Australia/ACT → Australia/Sydney
  2. For RFC 822 format ("Z") return the results according to the RFC.
    • America/Los_Angeles → "-0800"

    Note: The digits in this case are always from the western digits, 0..9.

  3. If there is an explicit translation for the TZID according to type (generic, standard, or daylight) in the resolved locale, return it.
    • America/Los_Angeles → "Heure du Pacifique (ÉUA)" // generic
    • America/Los_Angeles → 太平洋標準時 // standard
    • America/Los_Angeles → Yhdysvaltain Tyynenmeren kesäaika // daylight

    Note: This translation may not at all be literal: it would be what is most recognizable for people using the target language.

  4. For non-wall-time (ie, GMT, daylight, or standard) or where there is no country for the TZID (eg, Etc/GMT+3), use the localized GMT format.
    • America/Los_Angeles → "GMT-08:00" //  standard time
    • America/Los_Angeles → "HMG-07:00" //  daylight time
    • Etc/GMT+3 → "GMT-03.00" // note that TZ tzids have inverse polarity!

    Note: The digits should be whatever are appropriate for the locale used to format the time zone, not necessarily from the western digits, 0..9. For example, they might be from ०..९.

  5. Thus the remaining steps are only applicable to the generic format. In these steps, use as the country name the an explicitly localized country if available, otherwise the raw country code. If the localized exemplar city is not available, use as the exemplar city the last field of the raw TZID, stripping off the prefix and turning _ into space.
    • CU → "CU" // no localized country name for Cuba
    • America/Los_Angeles → "Los Angeles" // no localized exemplar city
  6. From <timezoneData> get the country code for the zone, and determine whether there is only one timezone in the country. If there is only one timezone or the zone id is in the singleCountries list, format the country name with the region format, and return it.
    • Africa/Monrovia → LR → "Tampo de Liberja"
    • America/Havana → CU → "Tampo de CU" // if CU is not localized

    Note: If a language does require grammatical changes when composing strings, then it should either use a neutral format such as what is in root, or put all exceptional cases in explicitly translated strings.

  7. Get the exemplar city and country name, and format them with the fallback format (as parameters 0 and 1, respectively).
    • America/Buenos_Aires → "Буэнос-Айрес (Аргентина)"
    • America/Buenos_Aires → "Буэнос-Айрес (AR)" // if Argentina isn't translated
    • America/Buenos_Aires → "Buenos Aires (Аргентина)" // if Buenos Aires isn't
    • America/Buenos_Aires → "Buenos Aires (AR)" // if both aren't

    Note: As with the region format, exceptional cases need to be explicitly translated.

In parsing, an implementation will be able to either determine the zone id, or a simple offset from GMT for anything formatting according to the above process. The following process should be used, stopping in the first step that matches.

  1. Check for explicitly localized strings.
    • "Tampo de Pacifica" → America/Los_Angeles
  2. 2. Check for RFC 822 and localized GMT formats
    • "-0800" → Etc/GMT+8
    • "GMT-03:00" → Etc/GMT+3
  3. Check for <city, country> using the fallback format. Remember to check for fallback localizations (last field of zone id and the raw country code).
    • “Sydney (Australia)” → Australia/Sydney
  4. Check for localized <country> using the region format. Remember to check for fallback localizations (raw country code).
    • "Tampo de CU" → America/Havana

Using this process, a correct parse will roundtrip the generic format (v and vvvv) back to the canonical zoneid.

The GMT formats (Z and ZZZZ) will return back an offset, and thus lose the original canonical zone id.

The daylight and standard time formats (z and zzzz) may either roundtrip back to the original canonical zone id, or to just an offset, depending on the available translation data. Thus:

Parsing can be more lenient than the above, allowing for different spacing, punctuation, or other variation.

Many time zone IDs only represent differences that are important historically, but do not make any difference in modern times. The preferenceOrdering element can be used to select the preferred modern IDs when desired, either in presenting a list of localized timezone names in a user interface, or in formatting. (The choice of the period to use as "modern" when determining when two time zone IDs are equivalent is left to the implementation.)

Whenever two timezone IDs are equivalent in effect and are in the same country, the preference ordering list is examined according to the following process. When used in formatting, this process is used to add additional canonicalization in Step 1 above.

  1. If x, y are in the list, then the earlier one in the list is preferred.
  2. Else if x is in the list and y isn't, then x is preferred
  3. Else if not in root, repeat #1 and #2 using the parent locale's list
  4. If all else fails, use a case-insensitive comparison of the timezone IDs.

For example, the following table lists the modern equivalents for Mexico on separate rows. If the preference ordering has one element: "America/Mexico_City", then the bolded items would be chosen as the preferred timezone IDs.

America/Merida, America/Mexico_City, America/Monterrey, America/Cancun
America/Chihuahua, America/Mazatlan
America/Hermosillo
America/Tijuana

Note: The hoursFormat and abbreviationFallback used in earlier versions of this appendix are deprecated.

Appendix K: Valid Attribute Values

The valid attribute values, as well as other validity information is contained in the metadata.xml file. (Some, but not all, of this information could have been represented in XML Schema or a DTD.)

The following specify the ordering of elements / attributes in the file

  <elementOrder>ldml identity alias localeDisplayNames layout ...</elementOrder>
  <attributeOrder>type key registry alt source path day date...</attributeOrder>

The suppress elements are those that are suppressed in canonicalization.

The serialElements are those that do not inherit, and may have ordering

<serialElements>variable comment tRule reset p pc s sc t tc q qc i ic x extend first_variable last_variable first_tertiary_ignorable last_tertiary_ignorable first_secondary_ignorable last_secondary_ignorable first_primary_ignorable last_primary_ignorable first_non_ignorable last_non_ignorable first_trailing last_trailing</serialElements>

The validity elements give the possible attribute values. They are in the format of a series of variables, followed by attributeValues.

<variable id="$calendar" type="choice">
buddhist coptic ethiopic chinese gregorian hebrew islamic islamic-civil japanese arabic civil-arabic thai-buddhist persian
</variable>

The types indicate the style of match:

If the attribute order="given" is supplied, it indicates the order of elements when canonicalizing (see below).

The <deprecated> element lists elements, attributes, and attribute values that are deprecated. If any deprecatedItems element contains more than one attribute, then only the listed combinations are deprecated. Thus the following means not that the draft attribute is deprecated, but that the true and false values for that attribute are:

<deprecatedItems attributes="draft" values="true false"/> 

 Similarly, the following means that the type attribute is deprecated, but only for the listed elements:

<deprecatedItems elements="abbreviationFallback default ... preferenceOrdering" attributes="type"/> 

Appendix L: Canonical Form

The following are restrictions on the format of LDML files to allow for easier parsing and comparison of files.

Peer elements have consistent order. That is, if the DTD or this specification requires the following order in an element foo:

<foo>
  <pattern>
  <somethingElse>
</foo>

It can never require the reverse order in a different element bar.

<foo>
  <somethingElse>
  <pattern>
</foo>

Note that there was one case that had to be corrected in order to make this true. For that reason, pattern occurs twice under currency:

<!ELEMENT currency (alias | (pattern*, displayName?, symbol?, pattern*,
decimal?, group?, special*)) >

XML files can have a wide variation in textual form, while representing precisely the same data. By putting the LDML files in the repository into a canonical form, this allows us to use the simple diff tools used widely (and in CVS) to detect differences when vetting changes, without those tools being confused. This is not a requirement on other uses of LDML; just simply a way to manage repository data more easily.

L.1 Content

  1. All start elements are on their own line, indented by depth tabs.
  2. All end elements (except for leaf nodes) are on their own line, indented by depth tabs.
  3. Any leaf node with empty content is in the form <foo/>.
  4. There are no blank lines except within comments or content.
  5. Spaces are used within a start element. There are no extra spaces within elements.
    • <version number="1.2"/>, not <version  number = "1.2" />
    • </identity>, not </identity >
  6. All attribute values use double quote ("), not single (').
  7. There are no CDATA sections, and no escapes except those absolutely required.
    • no &apos; since it is not necessary
    • no '&#x61;', it would be just 'a'
  8. All attributes with defaulted values are suppressed. See the Defaulted Attributes TableXXX
  9. The draft and alt="proposed.*" attributes are only on leaf elements.
  10. The tzid are canonicalized in the following way:
    1. All tzids as of as CLDR 1.1 (2004.06.08) in zone.tab are canonical.
    2. After that point, the first time a tzid is introduced, that is the canonical form.

    That is, new IDs are added, but existing ones keep the original form. The TZ timezone database keeps a set of equivalences in the "backward" file. These are used to map other tzids to the canonical form. For example, when America/Argentina/Catamarca was introduced as the new name for the previous America/Catamarca, a link was added in the backward file.

    Link America/Argentina/Catamarca America/Catamarca

Example:

<ldml draft="unconfirmed" >
	<identity>
		<version number="1.2"/>
		<generation date="2004-06-04"/>
		<language type="en"/>
		<territory type="AS"/>
	</identity>
	<numbers>
		<currencyFormats>
			<currencyFormatLength>
				<currencyFormat>
					<pattern>¤#,##0.00;(¤#,##0.00)</pattern>
				</currencyFormat>
			</currencyFormatLength>
		</currencyFormats>
	</numbers>
</ldml>

L.2 Ordering

  1. Element names are ordered by the Element Order Table
  2. Attribute names are ordered by the Attribute Order Table
  3. Attribute value comparison is a bit more complicated, and may depend on the attribute and type. Compare two values by using the following steps:
    1. If two values are in the Value Order Table, compare according to the order in the table. Otherwise if just one is, it goes first.
    2. If two values are numeric [0-9], compare numerically (2 < 12). Otherwise if just one is numeric, it goes first.
    3. Otherwise values are ordered alphabetically
  4. An attribute-value pair is ordered first by attribute name, and then if the attribute names are identical, by the value.
  5. An element is ordered first by the element name, and then if the element names are identical, by the sorted set of attribute-value pairs (sorted by #4). For the latter, compare the first pair in each (in sorted order by attribute pair). If not identical, go to the second pair, etc.
  6. Any future additions to the DTD must be structured so as to allow compatibility with this ordering.
  7. See also Appendix K: Valid Attribute Values

L.3 Comments

  1. Comments are of the form <!-- stuff -->.
  2. They are logically attached to a node. There are 4 kinds:
    1. Inline always appear after a leaf node, on the same line at the end. These are a single line.
    2. Preblock comments always precede the attachment node, and are indented on the same level.
    3. Postblock comments always follow the attachment node, and are indented on the same level.
    4. Final comment, after </ldml>
  3. Multiline comments (except the final comment) have each line after the first indented to one deeper level.

Examples:

<eraAbbr>
	<era type="0">BC</era> <!-- might add alternate BDE in the future -->
...
<timeZoneNames>
	<!-- Note: zones that don't use daylight time need further work --> 
	<zone type="America/Los_Angeles">
	...
	<!-- Note: the following is known to be sparse,
		and needs to be improved in the future -->
	<zone type="Asia/Jerusalem">

L.4 Canonicalization

The process of canonicalization is fairly straightforward, except for comments. Inline comments will have any linebreaks replaced by a space. There may be cases where the attachment node is not permitted, such as the following.

		</dayWidth>
		<!-- some comment -->
	</dayContext>
</days>

In those cases, the comment will be made into a block comment on the last previous leaf node, if it is at that level or deeper. (If there is one already, it will be appended, with a line-break between.) If there is no place to attach the node (for example, as a result of processing that removes the attachment node), the comment and its node's xpath will be appended to the final comment in the document.

Multiline comments will have leading tabs stripped, so any indentation should be done with spaces.


L.5 Element Order Table

The order of attributes is given by the elementOrder table in the supplemental metadata.

L.6 Attribute Order Table

The order of attributes is given by the attributeOrder table in the supplemental metadata.

L.7 Value Order Table

The order of attribute values is given by the order of the values in the attributeValues elements that have the attibute order="given". Numeric values are sorted in numeric order, while tzids are ordered by country, then longitude, then latitude.

L.8 Defaulted Values Table

The defaulted attributes are given by the suppress table in the supplemental metadata. There is one special value _q; that is used on serial elements internally to preserve ordering.

Appendix M: Coverage Levels

The following defines the coverage levels:

Level Description
level 100 comprehensive Has complete localizations (or valid inheritance) for every possible field
level 80 modern Localizations (or valid inheritance) as given below
level 60 moderate
level 40 basic
level 20 posix Only what is required for POSIX generation; example, only one country name, only one currency symbol, etc.
level 0 rudimentary Doesn't meet any of the above levels. (default, if nothing specified)

Levels 40 and 60 are based on the following definitions and specifications.

M.1 Definitions

Machine-readable information for this is contained in the <coverageAdditions> element in the supplemental metadata file.

M.2 Data Requirements

The required data to qualify for the level is then the following.

  1. identity
  2. localeDisplayNames
    1. languages: localized names for all languages in Language-List.
    2. scripts: localized names for all scripts in Script-List.
    3. territories: localized names for all territories in Territory-List.
    4. variants, keys, types: localized names for any in use in Target-Territories; e.g. a translation for PHONEBOOK in a German locale.
  3. layout, orientation
  4. exemplarCharacters
  5. measurementSystem, paperSize
  6. dates: all of the following for each calendar in Calendar-List.
    1. calendars: localized names
    2. monthNames & dayNames
      • context=format and width=narrow, wide, & abbreviated
      • plus context=standAlone and width=narrow, wide, & abbreviated, if the grammatical forms of these are different than for context=format.
    3. week: minDays, firstDay, weekendStart, weekendEnd
      • if some of these vary in territories in Territory-List, include territory locales for those that do.
    4. am, pm, eraNames, eraAbbr
    5. dateFormat, timeFormat: full, long, medium, short
  7. timeZoneNames:
    1. exemplar cities the timezones in Timezone-List
    2. hourFormat, hoursFormat, gmtFormat, regionFormat, fallbackFormat
  8. numbers: symbols, decimalFormats, scientificFormats, percentFormats, currencyFormats
  9. currencies: displayName and symbol for all currencies in Currency-List
  10. collation sequence
  11. yesstr, nostr
  12. transforms:
    • basic: none
    • moderate: transliteration between Latin and each other script in Target-Scripts.

M.3 Default Values

Items should only be included if they are not the same as the default, which is:

Appendix N: Transform Rules

The transform rules are similar to regular-expression substitutions, but adapted to the specific domain of text transformations. The rules and comments in this discussion will be intermixed, with # marking the comments. In the xml format these in separate elements: comment and tRule. The simplest rule is a conversion rule, which replaces one string of characters with another. The conversion rule takes the following form:

xy → z ;

This converts any substring "xy" into "z". Rules are executed in order; consider the following rules:

sch → sh ;
ss → z ;

This conversion rule transforms "bass school" into "baz shool". The transform walks through the string from start to finish. Thus given the rules above "bassch" will convert to "bazch", because the "ss" rule is found before the "sch" rule in the string (later, we'll see a way to override this behavior). If two rules can both apply at a given point in the string, then the transform applies the first rule in the list.

All of the ASCII characters except numbers and letters are reserved for use in the rule syntax, as are the characters →, ←, ↔. Normally, these characters do not need to be converted. However, to convert them use either a pair of single quotes or a slash. The pair of single quotes can be used to surround a whole string of text. The slash affects only the character immediately after it. For example, to convert from an arrow signs to the word "arrow", use one of the following rules:

\←   →  arrow\ sign ;
'←'   →   'arrow sign' ;
'←'   →   arrow' 'sign ;

Spaces may be inserted anywhere without any effect on the rules. Use extra space to separate items out for clarity without worrying about the effects. This feature is particularly useful with combining marks; it is handy to put some spaces around it to separate it from the surrounding text. The following is an example:

 → i ; # an iota-subscript diacritic turns into an i.

For a real space in the rules, place quotes around it. For a real backslash, either double it \\, or quote it '\'. For a real single quote, double it '', or place a backslash before it \'.

Any text that starts with a hash mark and concludes a line is a comment. Comments help document how the rules work. The following shows a comment in a rule:

x → ks ; # change every x into ks

The "\u" notation can be used instead of any letter. For instance, instead of using the Greek π, one could write:

\u03C0 → p ;

One can also define and use variables, such as:

$pi = \u03C0 ;
$pi → p ;

N.1 Dual Rules

Rules can also specify what happens when an inverse transform is formed. To do this, we reverse the direction of the "←" sign. Thus the above example becomes:

$pi ← p ;

With the inverse transform, "p" will convert to the Greek p. These two directions can be combined together into a dual conversion rule by using the "↔" operator, yielding:

$pi ↔ p ;

N.2 Context

Context can be used to have the results of a transformation be different depending on the characters before or after. The following means "Remove hyphens, but only when they follow lowercase letters":

[:lowercase letter:] } '-' → '' ;

The context itself ([:lowercase letter:]) is unaffected by the replacement; only the text between the curly braces is changed.

N.3 Revisiting

If the resulting text contains a vertical bar "|", then that means that processing will proceed from that point and that the transform will revisit part of the resulting text. Thus the | marks a "cursor" position. For example, if we have the following, then the string "xa" will convert to "w".

x → y | z ;
z a → w;

First, "xa" is converted to "yza". Then the processing will continue from after the character "y", pick up the "za", and convert it. Had we not had the "|", the result would have been simply "yza". The '@' character can be used as filler character to place the revisiting point off the start or end of the string. Thus the following causes x to be replaced, and the cursor to be backed up by two characters.

x → |@@y;

N.4 Example

The following shows how these features are combined together in the Transliterator "Any-Publishing". This transform converts the ASCII typewriter conventions into text more suitable for desktop publishing (in English). It turns straight quotation marks or UNIX style quotation marks into curly quotation marks, fixes multiple spaces, and converts double-hyphens into a dash.

# Variables

$single = \' ;
$space = ' ' ;
$double = \" ;
$back = \` ;
$tab = '\u0008' ;

# the following is for spaces, line ends, (, [, {, ...
$makeRight = [[:separator:][:start punctuation:][:initial punctuation:]] ;

# fix UNIX quotes

$back $back → “ ; # generate right d.q.m. (double quotation mark)
$back → ‘ ;

# fix typewriter quotes, by context

$makeRight { $double ↔ “ ; # convert a double to right d.q.m. after certain chars
^ { $double → “ ; # convert a double at the start of the line.
$double ↔ ” ; # otherwise convert to a left q.m.

$makeRight {$single} ↔ ‘ ; # do the same for s.q.m.s
^ {$single} → ‘ ;
$single ↔ ’;

# fix multiple spaces and hyphens

$space {$space} → ; # collapse multiple spaces
'--' ↔ — ; # convert fake dash into real one

N.5 Rule Syntax

The following describes the full format of the list of rules used to create a transform. Each rule in the list is terminated by a semicolon. The list consists of the following:

The filter rule, if present, must appear at the beginning of the list, before any of the other rules.  The inverse filter rule, if present, must appear at the end of the list, after all of the other rules.  The other rules may occur in any order and be freely intermixed.

The rule list can also generate the inverse of the transform. In that case, the inverse of each of the rules is used, as described below.

N.6 Transform Rules

Each transform rule consists of two colons followed by a transform name, which is of the form source-target. For example:

:: NFD ;
:: und_Latn-und_Greek ;
:: Latin-Greek; # alternate form

If either the source or target is 'und', it can be omitted, thus 'und_NFC' is equivalent to 'NFC'. For compatibility, the English names for scripts can be used instead of the und_Latn locale name, and "Any" can be used instead of "und". Case is not signficant.

The following transforms are defined not by rules, but by the operations in the Unicode Standard, and may be used in building any other transform:

Any-NFC, Any-NFD, Any-NFKD, Any-NFKC - the normalization forms defined by [UAX15]

Any-Lower, Any-Upper, Any-Title - full case transformations, defined by [Unicode] Chapter 3.

In addition, the following special cases are defined:

Any-Null - has no effect; that is, each character is left alone.
Any-Remove - maps each character to the empty string; this, removes each character.

The inverse of a transform rule uses parentheses to indicate what should be done when the inverse transform is used. For example:

:: lower () ; # only executed for the normal
:: (lower) ; # only executed for the inverse
:: lower ; # executed for both the normal and the inverse

N.7 Variable Definition Rules

Each variable definition is of the following form:

$variableName = contents ;

The variable name can contain letters and digits, but must start with a letter. More precisely, the variable names use Unicode identifiers as defined by [UAX31]. The identifier properties allow for the use of foreign letters and numbers.

The contents of a variable definition is any sequence of Unicode sets and characters or characters. For example:

$mac = M [aA] [cC] ;

Variables are only replaced within other variable definition rules and within conversion rules. They have no effect on transliteration rules.

N.8 Filter Rules

A filter rule consists of two colons followed by a UnicodeSet. This filter is global in that only the characters matching the filter will be affected by any transform rules or conversion rules. The inverse filter rule consists of two colons followed by a UnicodeSet in parentheses. This filter is also global for the inverse transform.

For example, the Hiragana-Latin transform can be implemented by "pivoting" through the Katakana converter, as follows:

:: [:^Katakana:] ; # don't touch any katakana that was in the text!
:: Hiragana-Katakana;
:: Katakana-Latin;
:: ([:^Katakana:]) ; # don't touch any katakana that was in the text
                     # for the inverse either!

The filters keep the transform from mistakenly converting any of the "pivot" characters. Note that this is a case where a rule list contains no conversion rules at all, just transform rules and filters.

N.9 Conversion Rules

Conversion rules can be forward, backward, or double. The complete conversion rule syntax is described below:

N.9.1 Forward

A forward conversion rule is of the following form:

before_context { text_to_replace } after_context → completed_result | result_to_revisit ;

If there is no before_context, then the "{" can be omitted. If there is no after_context, then the "}" can be omitted. If there is no result_to_revisit, then the "|" can be omitted. A forward conversion rule is only executed for the normal transform and is ignored when generating the inverse transform.

N.9.2 Backward

A backward conversion rule is of the following form:

completed_result | result_to_revisit ← before_context { text_to_replace } after_context ;

The same omission rules apply as in the case of forward conversion rules. A backward conversion rule is only executed for the inverse transform and is ignored when generating the normal transform.

N.9.3 Dual

A dual conversion rule combines a forward conversion rule and a backward conversion rule into one, as discussed above. It is of the form:

a { b | c } d ↔ e { f | g } h ;

When generating the normal transform and the inverse, the revisit mark "|" and the before and after contexts are ignored on the sides where they don't belong. Thus, the above is exactly equivalent to the sequence of the following two rules:

a { b c } d  →  f | g  ;
b | c  ←  e { f g } h ;  

N.10 Intermixing Transform Rules and Conversion Rules

Transform rules and conversion rules may be freely intermixed. Inserting a transform rule into the middle of a set of conversion rules has an important side effect.

Normally, conversion rules are considered together as a group.  The only time their order in the rule set is important is when more than one rule matches at the same point in the string.  In that case, the one that occurs earlier in the rule set wins.  In all other situations, when multiple rules match overlapping parts of the string, the one that matches earlier wins.

Transform rules apply to the whole string.  If you have several transform rules in a row, the first one is applied to the whole string, then the second one is applied to the whole string, and so on.  To reconcile this behavior with the behavior of conversion rules, transform rules have the side effect of breaking a surrounding set of conversion rules into two groups: First all of the conversion rules before the transform rule are applied as a group to the whole string in the usual way, then the transform rule is applied to the whole string, and then the conversion rules after the transform rule are applied as a group to the whole string.  For example, consider the following rules:

abc → xyz;
xyz → def;
::Upper;

If you apply these rules to “abcxyz”, you get “XYZDEF”.  If you move the “::Upper;” to the middle of the rule set and change the cases accordingly, then applying this to “abcxyz” produces “DEFDEF”.

abc → xyz;
::Upper;
XYZ → DEF;

This is because “::Upper;” causes the transliterator to reset to the beginning of the string. The first rule turns the string into “xyzxyz”, the second rule uppercases the whole thing to “XYZXYZ”, and the third rule turns this into “DEFDEF”.

This can be useful when a transform naturally occurs in multiple “passes.”  Consider this rule set:

[:Separator:]* → ' ';
'high school' → 'H.S.';
'middle school' → 'M.S.';
'elementary school' → 'E.S.';

If you apply this rule to “high school”, you get “H.S.”, but if you apply it to “high  school” (with two spaces), you just get “high school” (with one space).  To have “high  school” (with two spaces) turn into “H.S.”, you'd either have to have the first rule back up some arbitrary distance (far enough to see “elementary”, if you want all the rules to work), or you have to include the whole left-hand side of the first rule in the other rules, which can make them hard to read and maintain:

$space = [:Separator:]*;
high $space school → 'H.S.';
middle $space school → 'M.S.';
elementary $space school → 'E.S.';

Instead, you can simply insert “::Null;” in order to get things to work right:

[:Separator:]* → ' ';
::Null;
'high school' → 'H.S.';
'middle school' → 'M.S.';
'elementary school' → 'E.S.';

The “::Null;” has no effect of its own (the null transform, by definition, doesn't do anything), but it splits the other rules into two “passes”: The first rule is applied to the whole string, normalizing all runs of white space into single spaces, and then we start over at the beginning of the string to look for the phrases.  “high    school” (with four spaces) gets correctly converted to “H.S.”.

This can also sometimes be useful with rules that have overlapping domains.  Consider this rule set from before:

sch → sh ;
ss → z ;

Apply this rule to “bassch” results in “bazch” because “ss” matches earlier in the string than “sch”.  If you really wanted “bassh”-- that is, if you wanted the first rule to win even when the second rule matches earlier in the string, you'd either have to add another rule for this special case...

sch → sh ;
ssch → ssh;
ss → z ;

...or you could use a transform rule to apply the conversions in two passes:

sch → sh ;
::Null;
ss → z ;

N.11 Inverse Summary

The following table shows how the same rule list generates two different transforms, where the inverse is restated in terms of forward rules (this is a contrived example, simply to show the reordering):

Original Rules Forward Inverse
:: [:Uppercase Letter:] ;
:: latin-greek ;
:: greek-japanese ;
x ↔ y ;
z → w ;
r ← m ;
:: upper;
a → b ;
c ↔ d ;
:: any-publishing ;
:: ([:Number:]) ;
:: [:Uppercase Letter:] ;
:: latin-greek ;
:: greek-japanese ;
x → y ;
z → w ;
:: upper ;
a → b ;
c → d ;
:: any-publishing ;
 
:: [:Number:] ;
:: publishing-any ;
d → c ;
:: lower ;
y → x ;
m → r ;
:: japanese-greek ;
:: greek-latin ;
 

Note how the irrelevant rules (the inverse filter rule and the rules containing ←) are omitted (ignored, actually) in the forward direction, and notice how things are reversed: the transform rules are inverted and happen in the opposite order, and the groups of conversion rules are also executed in the opposite relative order (although the rules within each group are executed in the same order).

Appendix O: Lenient Parsing

O.1 Motivation

User input is frequently messy. Attempting to parse it by matching it exactly against a pattern is likely to be unsuccessful, even when the meaning of the input is clear to a human being. For example, for a date pattern of "MM/dd/yy", the input "June 1, 2006" will fail.

The goal of lenient parsing is to accept user input whenever it is possible to decipher what the user intended. Doing so requires using patterns as data to guide the parsing process, rather than an exact template that must be matched. This informative section suggests some heuristics that may be useful for lenient parsing of dates, times, and numbers.

O.2 Loose Matching

Loose matching ignores attributes of the strings being compared that are not important to matching. It involves the following steps:

Loose matching involves (logically) applying the above transform to both the input text and to each of the field elements used in matching, before applying the specific heuristics below. For example, if the input number text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before processing. The currency signs are also transformed, so "NA f." is converted to "naf" for purposes of matching. As with other Unicode algorithms, this is a logical statement of the process; actual implementations can optimize, such as by applying the transform incrementally during matching.

O.3 Parsing Numbers

The following elements are relevant to determining the value of a parsed number:

Other characters should either be ignored, or indicate the end of input, depending on the application. The key point is to disambiguate the sets of characters that might serve in more than one position, based on context. For example, a period might be either the decimal separator, or part of a currency symbol (e.g., "NA f."). Similarly, an "E" could be an exponent indicator, or a currency symbol (the Swaziland Lilangeni uses "E" in the "en" locale). An apostrophe might be the decimal separator, or might be the grouping separator.

Here is a set of heuristic rules that may be helpful:

O.4 Parsing Dates and Times

Lenient parsing of date and time strings is more complicated, due to the large number of possible fields and formats. The fields fall into two categories: numeric fields (hour, day of month, year, numeric month, etc.) and symbolic fields (era, quarter, month, weekday, AM/PM, time zone). In addition, the user may type in a date or time in a form that is significantly different from the normal format for the locale, and the application must use the locale information to figure out what the user meant. Input may well consist of nothing but a string of numbers with separators, e.g. "09/05/02 09:57:33".

The input can be separated into tokens: numbers, symbols, and literal strings. Some care must be taken due to ambiguity, e.g. in the Japanese locale the symbol for March is "3 月", which looks like a number followed by a literal. To avoid these problems, symbols should be checked first, and spaces should be ignored (except to delimit the tokens of the input string).

The meaning of symbol fields should be easy to determine; the problem is determining the meaning of the numeric fields. Disambiguation will likely be most successful if it is based on heuristics. Here are some rules that can help:

References

Ancillary Information To properly localize, parse, and format data requires ancillary information, which is not expressed in Locale Data Markup Language. Some of the formats for values used in Locale Data Markup Language are constructed according to external specifications. The sources for this data and/or formats include the following:
 
[Charts] The online code charts can be found at http://unicode.org/charts/ An index to characters names with links to the corresponding chart is found at http://unicode.org/charts/charindex.html
[DUCET] The Default Unicode Collation Element Table (DUCET)
For the base-level collation, of which all the collation tables in this document are tailorings.
http://unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
[FAQ] Unicode Frequently Asked Questions
http://unicode.org/faq/
For answers to common questions on technical issues.
[FCD] As defined in UTN #5 Canonical Equivalences in Applications
http://unicode.org/notes/tn5/
[Bugs] CLDR Bug Reporting form
http://www.unicode.org/cldr/filing_bug_reports.html
[Glossary] Unicode Glossary
http://unicode.org/glossary/
For explanations of terminology used in this and other documents.
[JavaChoice] Java ChoiceFormat
http://java.sun.com/j2se/1.4.2/docs/api/java/text/ChoiceFormat.html
[Olson] The TZID Database (aka Olson timezone database)
Timezone and daylight savings information.
ftp://elsie.nci.nih.gov/pub/

For archived data, see
ftp://munnari.oz.au/pub/oldtz/

For general information, see
http://www.twinsun.com/tz/tz-link.htm

[Reports] Unicode Technical Reports
http://unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[UCA] UTS #10: Unicode Collation Algorithm
http://unicode.org/reports/tr10/
[UCD] The Unicode Character Database (UCD)
For character properties, casing behavior, default line-, word-, cluster-breaking behavior, etc.
http://unicode.org/ucd/
[Unicode] The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.
[Versions] Versions of the Unicode Standard
http://unicode.org/standard/versions
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.
Other Standards Various standards define codes that are used as keys or values in Locale Data Markup Language. These include:
[ISO639] ISO Language Codes
http://lcweb.loc.gov/standards/iso639-2/
Actual List:
http://www.loc.gov/standards/iso639-2/langcodes.html
[ISO3166] ISO Region Codes
http://www.iso.org/iso/en/prods-services/iso3166ma/index.html
Actual List
http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html
[ISO4217] ISO Currency Codes
http://www.iso.org/iso/en/prods-services/popstds/currencycodeslist.html

(Note that as of this point, there are significant problems with this list. The supplemental data file contains the best compendium of currency information available.)

[ISO15924] ISO Script Codes
 http://www.evertype.com/standards/iso15924/
Older version with Actual List:
http://www.evertype.com/standards/iso15924/document/dis15924.pdf
[RFC3066] IETF Language Codes
http://www.ietf.org/rfc/rfc3066.txt
Registered Exception List (those not of the form language + region)
http://www.evertype.com/standards/iso639/iana-lang-assignments.html
[RFC3066bis] While RFC 3066bis was approved in 2005, at time of publication it has not yet been published or given a number. The following is the latest editor's draft:
http://inter-locale.com/ID/draft-ietf-ltru-registry-14.html

However, the registry is functioning, and located at:
http://www.iana.org/assignments/language-subtag-registry

[UNM49] UN M.49: UN Statistics Division

Country or area & region codes
http://unstats.un.org/unsd/methods/m49/m49.htm

Composition of macro geographical (continental) regions, geographical sub-regions, and selected economic and other groupings
http://unstats.un.org/unsd/methods/m49/m49regin.htm

[XML Schema] W3C XML Schema
http://www.w3.org/XML/Schema
General The following are general references from the text:
[BIDI] UAX #9: The Bidirectional Algorithm
http://unicode.org/reports/tr9/
[ByType] CLDR Comparison Charts
http://www.unicode.org/cldr/comparison_charts.html
[Calendars] Calendrical Calculations: The Millennium Edition by Edward M. Reingold, Nachum Dershowitz; Cambridge University Press; Book and CD-ROM edition (July 1, 2001); ISBN: 0521777526. Note that the algorithms given in this book are copyrighted.
[CharMapML] UTR #22: Character Mapping Tables
http://unicode.org/reports/tr22/
[Comparisons] Comparisons between locale data from different sources
http://unicode.org/cldr/data/diff/
[CurrencyInfo] UNECE Currency Data
http://www.unece.org/etrades/unedocs/repository/codelists/xml/CurrencyCodeList.xml
[DataFormats] CLDR Data Formats
http://unicode.org/cldr/data_formats.html
[Example] A sample in Locale Data Markup Language
http://unicode.org/cldr/dtd/1.1/ldml-example.xml
[ICUCollation] ICU rule syntax:
http://icu.sourceforge.net/userguide/Collate_Customization.html
[ICUTransforms] Transforms
http://icu.sourceforge.net/userguide/Transformations.html
Transforms Demo
http://www.ibm.com/software/globalization/icu/demo/transform
[ICUUnicodeSet] ICU UnicodeSet
http://icu.sourceforge.net/userguide/unicodeSet.html
API:
http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html
[LocaleExplorer] ICU Locale Explorer
http://www.ibm.com/software/globalization/icu/demo/locales
[LocaleProject] Common Locale Data Repository Project
http://unicode.org/cldr/
[NamingGuideline] OpenI18N Locale Naming Guideline
http://www.li18nux.org/docs/text/LocNameGuide-V10.txt
[RBNF] Rule-Based Number Format
http://icu.sourceforge.net/apiref/icu4c/classRuleBasedNumberFormat.html#_details
[RBBI] Rule-Based Break Iterator
http://icu.sourceforge.net/userguide/boundaryAnalysis.html
[Scripts] UAX #24: Script Names
http://unicode.org/reports/tr24/
[UCAChart] Collation Chart
http://unicode.org/charts/collation/
[UAX14] UAX #14: Line Breaking Properties
http://www.unicode.org/reports/tr14/
[UAX24] UAX #24: Script Names
http://www.unicode.org/reports/tr24/
[UAX29] UAX #29: Text Boundaries
http://www.unicode.org/reports/tr29/
[UAX31] UAX #31: Identifier and Pattern Syntax
http://www.unicode.org/reports/tr31/
[URegex] UTR #18: Unicode Regular Expression Guidelines
http://unicode.org/reports/tr18/
[UTR36] UTR #36: Unicode Security Considerations
http://www.unicode.org/reports/tr36/
[UTCInfo] NIST Time and Frequency Division Home Page
http://www.boulder.nist.gov/timefreq/
U.S. Naval Observatory: What is Universal Time?
http://aa.usno.navy.mil/AA/faq/docs/UT.html
[WindowsCulture] Windows Culture Info (with  mappings from [RFC3066]-style codes to LCIDs)
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfSystemGlobalizationCultureInfoClassTopic.asp

Acknowledgments

Special thanks to the following people for their continuing overall contributions to the CLDR project, and for their specific contributions in the following areas:

Other contributors to CLDR are listed on the CLDR Project Page.

Modifications

The following summarizes modifications from the previous revision of this document.

Revision 7

Revision 6

Revision 5