[Unicode]  Technical Reports
 

Unicode Technical Standard #35

Locale Data Markup Language (LDML)

Version 1.3R1
Authors Mark Davis
Date 2005-06-02
This Version http://unicode.org/reports/tr35/tr35-5.html
Previous Version http://unicode.org/reports/tr35/tr35-4.html
Latest Version http://unicode.org/reports/tr35/
Corrigenda http://unicode.org/cldr/corrigenda.html
Latest Working Draft http://unicode.org/cldr/data/docs/web/tr35.html
Namespace: http://unicode.org/cldr/
DTDs: http://unicode.org/cldr/dtd/1.3/ldml.dtd
http://unicode.org/cldr/dtd/1.3/ldmlSupplemental.dtd
Revision 5


Summary

This document describes an XML format (vocabulary) for the exchange of structured locale data.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. Each UTS specifies a base version of the Unicode Standard. Conformance to the UTS requires conformance to that version or higher.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For possible errata for this document, see [Errata].

Contents

1. Introduction

Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. But there remain differences in the locale data used by different systems.

Common, recommended practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.

But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but resulting in different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [Comparisons].)

There are a number of steps that can be taken to improve the situation. The first is to provide an XML format for locale data interchange. This provides a common format for systems to interchange data so that they can get the same results. The second is to gather up locale data from different systems, and compare that data to find any differences. The third is to provide an online repository for such data. The fourth is to have an open process for reconciling differences between the locale data used on different systems and validating the data, to come up with a useful, common, consistent base of locale data.

Note: There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.

This document describes one of those pieces, an XML format for the communication of locale data. With it, for example, collation rules can be exchanged, allowing two implementations to exchange a specification of collation. Using the same specification, the two implementations will achieve the same results in comparing strings.

For more information, see the Common Locale Data Repository project page [LocaleProject].

2. What is a locale?

Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use the data, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.

The first issue is basic: what is a locale? In this document, a locale is an id that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services.

Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric vs. Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's timezone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, etc.), music preference, religion, party affiliation, favorite charity, etc.

Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.

In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, etc.). The format in this document does not attempt to collect together all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between locales and languages, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data. For more information, see Appendix D: Language and Locale IDs.

We will speak of data as being "in locale X". That does not imply that a locale is a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a resource, and a tag indicating the key of resource is called a resource tag.

3. Locale IDs

A CLDR locale id consists of the following format:

locale_id := base_locale_id options?

base_locale_id := language_code ("_" script_code)? ("_" territory_code)? ("_" variant_code)?

options := "@" key "=" type ("," key "=" type )*

As usual, x? means that x is optional; x* means that x occurs zero or more times.

Note: The successor to RFC 3066 is being currently developed. Once that standard has been approved, the goal is to update this locale id definition to correspond to that. This would be a correspondence, not necessarily precisely the same syntax.

The latest draft text and registry for that successor is at [RFC3066bis].

The field values are given in the following table. All field values are case-insensitive, except for the type, which is case-sensitive. However, customarily the language code is lowercase, the territory and variant codes are uppercase, the script code is titlecase (that is, first character uppercase and other characters lowercase), and variants are uppercase. This convention is used in the file names, which may be case-sensitive depending on the operating system. Customarily the currency IDs are uppercase and timezone IDs are titlecase by field (as defined in the timezone database); other key and type codes are lowercase. The type may also be referred to as a key-value, for clarity.

Locale Field Definitions
Field Allowable Characters Allowable values
language_code ASCII letters [ISO639] 2-letter codes where they exist; otherwise 3-letter codes (the mapping between 2-letter codes and 3-letter codes is not part of this format.), or [RFC3066] codes that do not contain script / territory codes.
script_code ASCII letters [ISO15924] 4-letter codes. In most cases the script is not necessary, since the language is only customarily written in a single script. Examples of usage are:
az-Arab Azerbaijani in Arabic script
az-Cyrl Azerbaijani in Cyrillic script
az-Latn Azerbaijani in Latin script
zh-Hans Chinese, in simplified script
zh-Hant Chinese, in traditional script
territory_code ASCII letters [ISO3166] 2-letter codes. Also known as a country_code, although the territories may not be countries. The territory code can also be any of the numeric UN M.49 region codes, excluding Selected economic and other groupings. The data for the area codes is found at [UNM49].
variant_code ASCII letters Values used in CLDR are listed below. For information on the process for adding new standard variants or element/type pairs, see [LocaleProject].
key ASCII letters and digits
type ASCII letters, digits, and "-"

Examples:

en
fr_BE
de_DE@collation=phonebook,currency=DDM

The locale id format generally follows the description in the OpenI18N Locale Naming Guideline [NamingGuideline], with some enhancements. The main differences from the those guidelines are that the locale id:

  1. does not include a charset (since the data in LDML format always provides a representation of all Unicode characters. The repository is stored in UTF-8, although that can be transcoded to other encodings as well.),
  2. adds the ability to have a variant, as in Java
  3. adds the ability to discriminate the written language by script (or script variant).
  4. is a superset of [RFC3066] codes.

Note: The language + script + territory code combination can itself be considered simply a language code: For more information, see Appendix D: Language and Locale IDs.

A locale that only has a language code (and possibly a script code) is called a language locale; one with both language and territory codes is called a territory locale (or country locale).

The variant codes specify particular variants of the locale, typically with special options. They cannot overlap with script or territory codes, so they must have either one letter or have more than 4 letters. The currently defined variants include:

Variant Definitions
variant Description
<RFC 3066 variants> As defined in RFC 3066
BOKMAL Bokmål, variant of Norwegian (deprecated: use nb)
NYNORSK Nynorsk, variant of Norwegian (deprecated: use nn)
AALAND Åland, variant of Swedish used in Finland (deprecated: use AX)
POSIX A POSIX-style invariant locale.
REVISED For revised orthography
SAAHO The Saaho variant of Afar

Note: The first two of the above variants are for backwards compatibility. Typically the entire contents of these are defined by an <alias> element pointing at nb_NO (Norwegian Bokmål) and nn_NO(Norwegian Nynorsk) locale IDs. See also Appendix K: Valid Attribute Values.

The currently defined optional key/type combinations include the following. Additional type values are defined in the detail sections of this document or in Appendix K: Valid Attribute Values.

Key/Type Definitions
key type Description
collation phonebook For a phonebook-style ordering (used in German).
pinyin Pinyin ordering for Latin and for CJK characters (that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin)
traditional For a traditional-style sort (as in Spanish)
stroke Pinyin ordering for Latin, stroke order for CJK characters
direct Hindi variant
posix A "C"-based locale.
big5han Pinyin ordering for Latin, big5 charset ordering for CJK characters.
gb2312han Pinyin ordering for Latin, gb2312han charset ordering for CJK characters.
calendar* gregorian (default)
islamic

alias: arabic

Astronomical Arabic
chinese Traditional Chinese calendar
islamic-civil

alias: civil-arabic

Civil (algorithmic) Arabic calendar
hebrew Traditional Hebrew Calendar
japanese Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor)
buddhist

alias: thai-buddhist

Thai Buddhist Calendar (same as Gregorian except for the year)
persian Persian Calendar
coptic Coptic Calendar
*For information on the calendar algorithms associated with the data used with these types, see [Calendars].
currency ISO 4217 code Currency value identified by ISO code, plus others in common use. See Appendix K: Valid Attribute Values and also [Data Formats]
timezone Olson ID Identification for timezone according to the Olson Database ID. See [Data Formats].

For more information on the allowed attribute values, see the specific elements below, and Appendix K: Valid Attribute Values.

Note: There is no standard system (or rather, there are many standard systems) for defining locale ID syntax. This definition of Locale IDs may not match the locale ID syntax used on a particular system, so some process of ID translation may be required.

4. Locale Inheritance

The XML format relies on an inheritance model, whereby the resources are collected into bundles, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as root. Wherever possible, the resources in the root are language & territory neutral. For example, the collation order in the root is the UCA (see UAX #10). Since English language collation has the same ordering, the 'en' locale data does not need to supply any collation data, nor does either the 'en_US' or the 'en_IE' locale data.

Given a particular locale id "en_US_someVariant", the search chain for a particular resource is the following.

en_US_someVariant
en_US
en
root

If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.

Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_US" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance.

Where this inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that system, by adding all inherited data to each locale data set.

For a more complete description of how inheritance applies to data, and the use of keywords, see Inheritance_and_Validity.

The locale data does not contain general character properties that are derived from the Unicode Character Database [UCD]. That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the following data requires use of UCD data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.

4.1 Multiple Inheritance

In clearly specified instances, resources may inherit from within the same locale. For example, currency format symbols inherit from the number format symbols; the Buddhist calendar inherits from the Gregorian calendar. This only happens where documented in this specification. In these special cases, the inheritance functions as normal, up to the root. If the data is not found along that path, then a second search is made, logically changing the element/attribute to the alternate values.

For example, for the locale "en_US" the month data in <calendar class="buddhist"> inherits first from <calendar class="buddhist"> in "en", then in "root". If not found there, then it inherits from <calendar type="gregorian"> in "en_US", then "en", then in "root".

5 XML Format

The following sections describe the structure of the XML format for locale data. The more precise syntax is in the DTD, listed at the top of this document. To start with, the root element is <ldml>, with the following DTD entry:

<!ELEMENT ldml (identity, (alias |(localeDisplayNames?, layout?, characters?, delimiters?, measurement?, dates?, numbers?, collations?, posix?, special*))) >

That element contains the following elements:

The structure of each of these elements and their contents will be described below. The first few elements have little structure, while dates, numbers, and collations are more involved.

In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as numbers or dates). The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.

There are two kinds of elements in LDML: rule elements and structure elements. For structure elements, there are restrictions to allow for effective inheritance and processing:

  1. There is no "mixed" content: if an element has textual content, then it cannot contain any elements.
  2. The XPath leading to the content is unique; no two different pieces of textual content have the same XPath.

Structure elements do not have this restriction, but also do not inherit, except as an entire block. In this version of LDML, the collation <rules> elements is a rules element; all others are structure elements. See also Appendix I: Inheritance and Validity.

Note that the data in examples given below is purely illustrative, and doesn't match any particular language. For a more detailed example of this format, see [Example]. There is also a DTD for this format, but remember that the DTD alone is not sufficient to understand the semantics, the constraints, nor  the interrelationships between the different elements and attributes. You may wish to have copies of each of these to hand as you proceed through the rest of this document.

In particular, all elements allow for draft versions to coexist in the file at the same time. Thus certain elements, which otherwise must only occur once within their parent, are marked in the DTD as allowing multiple instances. These are:

There must be only one instance of these per parent that doesn't have an alternate attribute.

5.1 Common Elements

At any level in any element, two special elements are allowed.

<special xmlns:yyy="xxx">

This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute, which specifies the XML namespace of the special data. For example, the following used the version 1.0 POSIX special element.

<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.0/ldml.dtd" [
    <!ENTITY % posix SYSTEM "http://unicode.org/cldr/dtd/1.0/ldmlPOSIX.dtd">
%posix;
]>
<ldml>
...
<special xmlns:posix="http://www.opengroup.org/regproducts/xu.htm">
        <!-- old abbreviations for pre-GUI days -->
        <posix:messages>
            <posix:yesstr>Yes</posix:yesstr>
            <posix:nostr>No</posix:nostr>
            <posix:yesexpr>^[Yy].*</posix:yesexpr>
            <posix:noexpr>^[Nn].*</posix:noexpr>
        </posix:messages>
    </special>
</ldml>

<alias source="<locale_ID>" path="..."/>

The contents of any element can be replaced by an alias, which points to another source for the data. The elements in that source are to be fetched from the corresponding location in the other source. Normal resource searching is to be used; take the following example:

<ldml>
  <collations>
    <collation type="phonebook">
      <alias source="de_DE">
    </collation>
  </collations>
</ldml>

The resource bundle at "de_DE" will be searched for a resource element at the same position in the tree with type "collation". If not found there, then the resource bundle at "de" will be searched, etc. For an example of how this works with inheritance, look at the following table (where green indicates inherited items). Note in particular that an alias "reroutes" the inheritance; nothing in the parent affects the contents of an item with an alias. Thus the red item below is blocked.

Inheritance with Aliases
en en_US Resolved
<x>
  <a>01</a1>
  <b>02</a2>
  <c>03</a3>
</x>
<x>

  <b>12</b>

</x>
<x>
  <a>01</a>
  <b>12</b>
  <c>03</c>
</x>
de de_DE Resolved de_DE_1901 Resolved
<x>
  <a>21</a>
  <b>22</b>
  <c>23</c>
  <d>23</d>
</x>
<x>
  <alias source="en_US">
</x>
<x>
  <a>01</a>
  <b>12</b>
  <c>03</c>
</x>
<x>
  <a>41</a>



</x>
<x>
  <a>41</a>
  <b>12</b>
  <c>03</c>
</x>

If the path attribute is present, then its value is an XPath that points to a different node in the tree. For example:

<alias source="root" path="../monthWidth[@type='wide']"/>

The default value if the path is not present is the same position in the tree. All of the attributes in the XPath must be distinguishing elements. For more details, see Appendix I: Inheritance and Validity.

There is a special value for the source attribute, the constant source="locale", which is the default value. This special value is equivalent to the locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:

Inheritance with source="locale"
Root de Resolved
<x>
  <a>1</a>
  <b>2</b>
  <c>3</c>
</x>
<x>
 <a>11</a>
 <b>12</b>
 <d>14</d>
</x>
<x>
 <a>11</a>
 <b>12</b>
 <c>3</c>
 <d>14</d>
</x>
<y>
 <alias path="../x">
</y>
<y>
 <b>22</b>
 <e>25</e>
</y>
<y>
 <a>11</a>
 <b>22</b>
 <c>3</c>
 <d>14</d>
 <e>25</e>
</y>

The first row shows the inheritance within the <x> element, whereby <c> is inherited from root. The second shows the inheritance within the <y> element, whereby <a>, <c>, and <d> are inherited also from root, but from an alias there. The alias in root is logically replaced not by the elements in root itself, but by elements in the 'target' locale.

For more details on data resolution, see Appendix I: Inheritance and Validity.

It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups (including inheritance and multiple inheritance) can be followed indefinitely without terminating.

<displayName>

Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs.

  <numberFormat>
    <displayName>Prozentformat</displayName>
...
  <numberFormat>

Where present, the display names must be unique; that is, two distinct code would not get the same display name. Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].

<default type="someID"/>

In some cases, a number of elements are present. The default element can be used to indicate which of them is the default, in the absence of other information. The value of the type attribute is to match the value of the type attribute for the selected item.

<timeFormats>
  <default type="medium" /> 
  <timeFormatLength type="full">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a z</pattern> 
    </timeFormat>
  </timeFormatLength>
  <timeFormatLength type="long">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a z</pattern> 
    </timeFormat>
  </timeFormatLength>
  <timeFormatLength type="medium">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a</pattern> 
    </timeFormat>
  </timeFormatLength>
...

Like all other elements, the <default> element is inherited. Thus, it can also refer to inherited resources. For example, suppose that the above resources are present in fr, and that in fr_BE we have the following:

<timeFormats>
  <default type="long"/>
</timeFormats>

In that case, the default time format for fr_BE would be the inherited "long" resource from fr. Now suppose that we had in fr_CA:

  <timeFormatLength type="medium">
    <timeFormat type="standard">
      <pattern type="standard">...</pattern> 
    </timeFormat>
  </timeFormatLength>

In this case, the <default> is inherited from fr, and has the value "medium". It thus refers to this new "medium" pattern in this resource bundle.

5.1.1 Escaping Characters

Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, extra syntax is required to represent those code points that cannot be otherwise represented in element content. This also must be used where spaces are significant (otherwise they can be stripped).

Escaping Characters
Code Point XML Example
U+0000 <cp hex="0">

Note: This is not necessary in XML 1.0 — except for NULL (U+0000), which is typically never used. However, for backwards compatibility with XML 1.0 systems it is best for some time to come to use these special escapes. These escapes are only allowed in certain elements, according to the DTD.

5.2 Common Attributes

<... type="stroke" ...>

The attribute type is also used to indicate an alternate resource that can be selected with a matching type=option in the locale id modifiers, or be referenced by a default element. For example:

<ldml>
  ...
  <currencies>
    <currency>...</currency>
    <currency type="preEuro">...</currency>
  </currencies>
</ldml>

<... draft="true" ...>

If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrary draft value).

If an element has the attribute draft="true", then the data is not known to be valid. ("Not known to be" because it may actually be valid, but it has not been vetted.). But it is a bit more complicated than that, since draft="true" is inherited by subelements, and sublocales. For a more formal description of how elements are inherited, and what their draft status is, see Inheritance_and_Validity.

<... alt="proposed" ...>

This attribute should only be present if draft="true". It indicates that the data is proposed replacement data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September for some language is "Settembru", and a bug report is filed that that should be "Settembro". The new data can be entered in, but marked as alt until it is vetted.

...
<month type="9">Settembru</month>
<month type="9" draft="true" alt="proposed">Settembro</month>
<month type="10">...

The allowable values for alt at this time are "proposed" and "variant". This may be expanded in the future.

<... validSubLocales="de_AT de_CH de_DE" ...>

The attribute validSubLocales allows sublocales in a given tree to be treated as though a file for them were present when there isn't one. It can be applied to any element. It only has an effect for locales that inherit from the current file where a file is missing, and the elements wouldn't otherwise be draft.

For a more complete description of how draft applies to data, see Inheritance_and_Validity.

<... standard="..." ...>

Note: this attribute is deprecated. Instead, use a reference element with the attribute standard="true". See Section 5.12 <references>.

The value of this attribute is a list of strings representing standards: international, national, organization, or vendor standards. The presence of this attribute indicates that the data in this element is compliant with the indicated standards. Where possible, for uniqueness, the string should be a URL that represents that standard. The strings are separated by commas; leading or trailing spaces on each string are not significant. Examples:

<collation standard="MSA 200:2002">
...
<dateFormatStyle standard=”http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=26780&amp;ICS1=1&amp;ICS2=140&amp;ICS3=30”>

<... references="..." ...>

The value of this attribute is a list of strings representing a reference for the information in the element, including standards that it may conform to. The best format is a series of tokens, where each token corresponds to a reference element. See Section 5.12 <references>. (In older versions of CLDR, the value of the attribute was freeform text. That format is deprecated.)

Example:

<territory type="UM" references="R2">USAs yttre öar</territory>

The reference element may be inherited. Thus for example, even if R2 may be used in sv_SE.xml even though it is not defined there, if it is defined in sv.xml.


5.3 <identity>

<!ELEMENT identity (alias | (version, generation, language, script?, territory?, variant?, special*) ) >

The identity element contains information identifying the target locale for this data, and general information about the version of this data.

<version number="$Revision: 1.95 $">

The version element provides, in an attribute, the version of this file.  The contents of the element can contain textual notes about the changes between this version and the last. For example:

<version number="1.1">Various notes and changes in version 1.1</version>

This is not to be confused with the version attribute on the ldml element, which tracks the dtd version.

<generation date="$Date: 2005/06/30 21:30:12 $" />

The generation element contains the last modified date for the data. The data is in the format illustrated by the example above.

<language type="en"/>

The language code is the primary part of the specification of the locale id, with values as described above.

<script type="Latn" />

The script field may be used in the identification of written languages, with values described above.

<territory type="US"/>

The territory code is a common part of the specification of the locale id, with values as described above.

<variant type="nynorsk"/>

The variant code is the tertiary part of the specification of the locale id, with values as described above.

5.4 <localeDisplayNames>

<!ELEMENT localeDisplayNames (alias | (languages?, scripts?, territories?, variants?, keys?, types?, special*)) >

Display names for scripts, languages, countries, and variants in this locale are supplied by this element. These supply localized names for these items for use in user-interfaces for displaying lists of locales and scripts. Examples are given below.

Where present, the display names must be unique; that is, two distinct code would not get the same display name. Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].

<languages>

This contains a list of elements that provide the user-translated names for language codes from [ISO639], as described in Locale_IDs.

<language type="ab">Abkhazian</language>
<language type="aa">Afar</language>
<language type="af">Afrikaans</language>
<language type="sq">Albanian</language>

The type can actually be any locale ID as specified above. The set of which locale IDs is not fixed, and depends on the locale. For example, in one language one could translate the following locale IDs, and in another, fall back on the normal composition.

type translation composition
nl-BE Flemish Dutch (Belgium)
zh-Hans Simplified Chinese Chinese (Simplified Han)
en-GB British English English (United Kingdom)

Thus when a complete locale ID is formed by composition, the longest match in the language type is used, and the remaining fields (if any) added using composition.

<scripts>

This element can contain an number of script elements. Each script element provides the localized name for a script, given by the value of the type attribute. The script IDs can be either the long or short forms from Scripts.txt in the UCD. (See UAX #24: Script Names [Scripts] for more information.) For example, in the language of this locale, the name for the Latin script might be "Romana", and for the Cyrillic script is "Kyrillica". That would be expressed with the following.

<script type="Latn">Romana</script>
<script type="Cyrl">Kyrillica</script>

<territories>

This contains a list of elements that provide the user-translated names for territory codes from [ISO3166], as described in Locale_IDs.

<territory type="AF">Afghanistan</territory>
<territory type="AL">Albania</territory>
<territory type="DZ">Algeria</territory>
<territory type="AD">Andorra</territory>
<territory type="AO">Angola</territory>
<territory type="US">United States</territory>

The territory code can also be any of the numeric UN M.49 region codes, excluding Selected economic and other groupings. The data for the area codes is found at [UNM49].

<variants>

This contains a list of elements that provide the user-translated names for the variant_code values described in Locale_IDs.

<variant type="nynorsk">Nynorsk</variant>

<keys>

This contains a list of elements that provide the user-translated names for the key values described in Locale_IDs.

<key type="collation">Sortierung</key>

<types>

This contains a list of elements that provide the user-translated names  for the type values described in Locale_IDs. Since the translation of an option name may depend on the key it is used with, the latter is optionally supplied.

<type type="phonebook" key="collation">Telefonbuch</type>

5.5 <layout>

<!ELEMENT layout ( alias | (orientation?, special*) ) >

This top-level element specifies general layout features. It currently only has one possible element (other than <special>, which is always permitted).

<orientation lines="top-to-bottom" characters="left-to-right" />

The lines and characters attributes specify the default general ordering of lines within a page, and characters within a line. The values are:

Orientation Attributes
Vertical top-to-bottom
bottom-to-top
Horizontal left-to-right
right-to-left

If the lines value is one of the vertical attributes, then the characters value must be one of the horizontal attributes, and vice versa. For example, for English the lines are top-to-bottom, and the characters are left-to-right. For Mongolian the lines are right-to-left, and the characters are top to bottom. This does not override the ordering behavior of bidirectional text; it does, however, supply the paragraph direction for that text (for more information, see UAX #9: The Bidirectional Algorithm [BIDI]).

<inList>

The following element controls whether display names (language, territory, etc) are titlecased in GUI menu lists and the like. It is only used in languages where the normal display is lowercase, but titlecase is used in lists.

<inList casing="titlecase-words">

It can be fine-tuned by using alt="list" on any element where titlecasing as defined by the Unicode Standard will produce the wrong value. For example, suppose that "turc de Crimée" is a value, and the titlecase should be "Turc de Crimée". Then that can be expressed using the alt="list" value.

5.6 <characters>

<!ELEMENT characters (alias | (exemplarCharacters*, mapping*, special*)) >

The <characters> element provides optional information about characters that are in common use in the locale, and information that can be helpful in picking resources or data appropriate for the locale, such as when choosing among character encodings that are typically used to transmit data in the language of the locale. It typically only occurs in a language locale, not in a language/territory locale.

<exemplarCharacters>[a-zåæø]</exemplarCharacters>

The exemplar character set contains the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation. ("Letter is interpreted broadly, as in the Unicode General Category, and also includes syllabaries and ideographs.) It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included. In general, the test to see whether or not a letter belongs in the set is based on whether it is acceptable in that language to always use spellings that avoid that character. For example, the exemplar character set for en (English) is the set [a-z]. This set does not contain the accented letters that are sometimes seen in words like "résumé" or "naïve", because it is acceptable in common practice to spell those words without the accents. The exemplar character set for fr (French), on the other hand, must contain those characters: [a-z é è ù ç à â ê î ô û æ œ ë ï ÿ].

The list of characters is in the Unicode Set format, which allows boolean combinations of sets of letters, including those specified by Unicode properties.

Sequences of characters that act like a single letter in the language — especially in collation — are included within braces, such as [a-z á é í ó ú ö ü ő ű {cs} {dz} {dzs} {gy} ...]. The characters should be in normalized form (NFC). Where combining marks are used generatively, and apply to a large number of base characters (such as in Indic scripts), the individual combining marks should be included. Where they are used with only a few base characters, the specific combinations should be included. Wherever there is not a precomposed character (e.g. single codepoint) for a given combination, that must be included within braces. For example, to include sequences from the Where is my Character? page on the Unicode site, one would write: [{ch} {tʰ} {x̣} {ƛ̓} {ą́} {i̇́} {ト゚}], but for French one would just write [a-z é è ù ...]. When in doubt use braces, since it does no harm to included them around single code points: e.g. [a-z {é} {è} {ù} ...].

The exemplar character set for Han characters is composed somewhat differently. It is even harder to draw a clear line for Han characters, since usage is more like a frequency curve that slowly trails off to the right in terms of decreasing frequency. So for this case, the exemplar characters simply contain a set of reasonably frequent characters for the language.

The letters do not necessarily form a complete set (especially for languages using large character sets, such as CJK). Nor does the list necessarily include letters that are used in common foreign words used in that language. This set is case insensitive; this means it only needs lower case characters. The ordering of these characters are irrelevant. For the special case of Turkish and similar languages, the dotted capital I should be included. For more information, see [Data Formats] and the discussion of Special Casing in the Unicode Character Database.

There can be more than two exemplarCharacters elements, with the second having the type "auxiliary". This element can be used for additional characters that are used in common foreign words, dictionaries, etc. used in the locale.

<characters>
    <exemplarCharacters>[a-zñç]</exemplarCharacters>
    <exemplarCharacters type="auxiliary">[ä ö ü ß]</exemplarCharacters> 
</characters>

<mapping registry="iana" type="windows-1252"/>

There can be multiple mapping elements. Each indicates the character conversion mapping table name for a character encoding that may be commonly used to encode data in the language of this locale. The version field of the mapping table is omitted. The ordering among the mapping elements is not significant. The mapping elements themselves are not inherited from parents.

The registry indicates the source of the encoding. Currently the only registry that can be used is "iana", which specifies use of  an IANA name.  Note: while IANA names are not precise for conversion (see UTR #22: Character Mapping Tables [CharMapML]), they are sufficient for this purpose.

5.7 <delimiters>

<!ELEMENT delimiters (alias | (quotationStart*, quotationEnd*, alternateQuotationStart*, alternateQuotationEnd*, special*)) >

The delimiters supply common delimiters for bracketing quotations. The quotation marks are used with simple quoted text, such as:

He said, “Don’t be absurd!”

The alternate marks are used with embedded quotations, such as:

He said, “Remember what the Mad Hatter said: ‘Not the same thing a bit! Why you might just as well say that “I see what I eat” is the same thing as “I eat what I see”!’”

<quotationStart></quotationStart>
<quotationEnd></quotationEnd>
<alternateQuotationStart></alternateQuotationStart>
<alternateQuotationEnd></alternateQuotationEnd>

5.8 <measurement>

<!ELEMENT measurement (alias | (measurementSystem?, paperSize?, special*)) >

<measurementSystem type="US"/>

The measurement system is the normal measurement system in common everyday use (except for date/time). The values are "metric" (= ISO 1000), "US", or "UK"; others may be added over time. The "US" value indicates the customary system of measurement with feet, inches, pints, quarts, etc. as used in the United States. The "UK" value indicates the customary system of measurement with feet, inches, pints, quarts, etc. as used in the United Kingdom. It is also called the Imperial system: the pint, quart, etc. are different sizes than in "US".

Note: In the future, we may need to add display names for the particular measurement units (millimeter vs millimetre vs whatever the Greek, Russian, etc are), and a message format for position those with respect to numbers. E.g. "{number} {unitName}" in some languages, but "{unitName} {number}" in others.

Note: Numbers indicating measurements should never be interchanged without known dimensions. You never want the number 3.51 interpreted as 3.51 feet by one user and 3.51 meters by another. However, this element can be used to convert dimensioned numbers into the user's desired notation: so the value of 3.51 meters can be formatted as 11.52 feet on a particular user's system.

<paperSize>

The paperSize element gives the height and width of paper used for normal business letters. The units for the numbers are always in millimeters. For example, the paperSize in the root (the default) is A4:

<paperSize>
  <height>297</height>
  <width>210</width>
</paperSize>

An example of locale data that differs from this would be en-US:

<paperSize>
  <height>279</height>
  <width>216</width>
</paperSize>

5.9 <dates>

<!ELEMENT dates (alias | (localizedPatternChars*, calendars?, timeZoneNames?, special*)) >

This top-level element contains information regarding the format and parsing of dates and times. The data is based on the Java/ICU format. Most of these are fairly self-explanatory, except the week elements, localizedPatternChars, and the meaning of the pattern characters. For information on this, and more information on other elements and attributes, see Appendix F: Date Format Patterns.

5.9.1 <calendars>

<!ELEMENT calendar (alias | (months?, monthNames?, monthAbbr?, days?, dayNames?, dayAbbr?, week?, am?, pm?, eras?, dateFormats?, timeFormats?, dateTimeFormats?, fields*, special*))>

This element contains multiple <calendar> elements, each of which specifies the fields used for formatting and parsing dates and times according to the given calendar. The month names are identified numerically, starting at 1. The day names are identified with short strings, since there is no universally-accepted numeric designation.

Many calendars will only differ from the Gregorian Calendar in the year and era values. For example, the Japanese calendar will have many more eras (one for each Emperor), and the years will be numbered within that era. All calendar data inherits from the Gregorian calendar in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Multiple Inheritance.

<!ELEMENT months ( alias | (default?, monthContext*, special*)) >
<!ELEMENT monthContext ( alias | (default?, monthWidth*, special*)) >
<!ELEMENT monthWidth ( alias | (month*, special*)) >

<!ELEMENT days ( alias | (default?, dayContext*, special*)) >
<!ELEMENT dayContext ( alias | (default?, dayWidth*, special*)) >
<!ELEMENT dayWidth ( alias | (day*, special*)) >

Both month and day names may vary along two axes: the width and the context. The context is either format (the default), the form used within a date format string (such as "Saturday, November 12th", or stand-alone, the form used independently, such as in Calendar headers. The width can be wide (the default), abbreviated, or narrow. The latter is the shortest possible width, no more than one character (or grapheme cluster): it is typically used in calendar headers. The values in formats must be distinct; that is, "S" could not be used both for Saturday and for Sunday. The same is not true for stand-alone values; they might only be distinguished by context, especially in the narrow format.

If the stand-alone form does not exist (in the chain up to root), then it inherits from the format form. See Multiple Inheritance. If the narrow format does not exist, it inherits from the abbreviated form; if the abbreviated format does not exist, it inherits from the wide format.

The older monthNames, dayNames, and monthAbbr, dayAbbr are maintained for backwards compatibility. They are equivalent to: using the months element with the context type="format" and the width type="wide" (for ...Names) and type="narrow" (for ...Abbr), respectively.

Example:

  <calendar type="gregorian">
    <months>
      <default type="format"/>
      <monthContext type="format">
         <default type="wide"/>
         <monthWidth type="wide">
            <month type="1">January</month>
            <month type="2">February</month>
...
            <month type="11">November</month>
            <month type="12">December</month>
        </monthWidth>
        <monthWidth type="abbreviated">
            <month type="1">Jan</month>
            <month type="2">Feb</month>
...
            <month type="11">Nov</month>
            <month type="12">Dec</month>
        </monthWidth>
       <monthContext type="stand-alone">
         <default type="wide"/>
         <monthWidth type="wide">
            <month type="1">Januaria</month>
            <month type="2">Februaria</month>
...
            <month type="11">Novembria</month>
            <month type="12">Decembria</month>
        </monthWidth>
        <monthWidth type="narrow">
            <month type="1">J</month>
            <month type="2">F</month>
...
            <month type="11">N</month>
            <month type="12">D</month>
        </monthWidth>
       </monthContext>
    </months>

    <days>
      <default type="format"/>
      <dayContext type="format">
         <default type="wide"/>
         <dayWidth type="wide">
            <day type="sun">Sunday</day>
            <day type="mon">Monday</day>
...
            <day type="fri">Friday</day>
            <day type="sat">Saturday</day>
        </dayWidth>
        <dayWidth type="abbreviated">
            <day type="sun">Sun</day>
            <day type="mon">Mon</day>
...
            <day type="fri">Fri</day>
            <day type="sat">Sat</day>
        </dayWidth>
        <dayWidth type="narrow">
            <day type="sun">Su</day>
            <day type="mon">M</day>
...
            <day type="fri">F</day>
            <day type="sat">Sa</day>
        </dayWidth>
      </dayContext>
      <dayContext type="stand-alone">
        <dayWidth type="narrow">
            <day type="sun">S</day>
            <day type="mon">M</day>
...
            <day type="fri">F</day>
            <day type="sat">S</day>
        </dayWidth>
      </dayContext>
    </days>
    <week>
        <minDays count="1"/>
        <firstDay day="sun"/>
        <weekendStart day="fri" time="18:00"/>
        <weekendEnd day="sun" time="18:00"/>
    </week>

    <am>AM</am>
    <pm>PM</pm>

    <eras>
       <eraAbbr>
        <era type="0">BC</era>
        <era type="1">AD</era>
       </eraAbbr>
       <eraName>
        <era type="0">Before Christ</era>
        <era type="1">Anno Domini</era>
       </eraName>
    </eras>

<dateFormats>

<!ELEMENT dateFormats (alias | (default?, dateFormatLength*, special*)) >
<!ELEMENT dateFormatLength (alias | (default?, dateFormat*, special*)) >
<!ELEMENT dateFormat (alias | (pattern*, displayName?, special*)) >

Date formats have the following form:

    <dateFormats>
      <default type=”medium”/>
      <dateFormatLength type=”full”>
        <dateFormat>
          <pattern>EEEE, MMMM d, yyyy</pattern>
        </dateFormat>
       </dateFormatLength>
     <dateFormatLength type="medium">
       <default type="DateFormatsKey2">
       <dateFormat type="DateFormatsKey2">
        <pattern>MMM d, yyyy</pattern>
       </dateFormat>
       <dateFormat type="DateFormatsKey3">
         <pattern>MMM dd, yyyy</pattern>
        </dateFormat>
      </dateFormatLength>
    <dateFormats>

<timeFormats>

<!ELEMENT timeFormats (alias | (default?, timeFormatLength*, special*)) >
<!ELEMENT timeFormatLength (alias | (default?, timeFormat*, special*)) >
<!ELEMENT timeFormat (alias | (pattern*, displayName?, special*)) >

Time formats have the following form:

     <timeFormats>
       <default type="medium"/>
       <timeFormatLength type=”full”>
         <timeFormat>
           <displayName>DIN 5008 (EN 28601)</displayName>
           <pattern>h:mm:ss a z</pattern>
         </timeFormat>
       </timeFormatLength>
       <timeFormatLength type="medium">
         <timeFormat>
           <pattern>h:mm:ss a</pattern>
         </timeFormat>
       </timeFormatLength>
     </timeFormats>

     <dateTimeFormats>
       <default type="medium"/>
       <dateTimeFormatLength type=”full”>
         <dateTimeFormat>
            <pattern>{0} {1}</pattern>
         </dateTimeFormat>
       </dateTimeFormatLength>
     </dateTimeFormats>
  </calendar>

  <calendar class="buddhist">
    <eras>
        <era type="0">BE</era>
    </eras>
  </calendar>

<dateTimeFormats>

<!ELEMENT dateTimeFormats (alias | (default?, dateTimeFormatLength*, special*)) >
<!ATTLIST dateTimeFormats draft ( true | false ) #IMPLIED >
<!ATTLIST dateTimeFormats validSubLocales CDATA #IMPLIED >

These formats allow for date and time formats to be composed in various ways. The main elements work like the dateFormats and timeFormats, except that the pattern is of the form "{0} {1}", where {0} is replaced by the date format, and {1} is replaced by the time format.

<week>

<!ELEMENT week (alias | (minDays?, firstDay?, weekendStart?, weekendEnd?, special*))>

The weekendStart time defaults to "00:00:00" (midnight at the start of the day). The weekendEnd time defaults to "24:00:00" (midnight at the end of the day). (That is, Friday at 24:00:00 is the same time as Saturday at 00:00:00.) Thus the following are equivalent:

<weekendStart day="sat"/>
<weekendEnd day="sun"/>
<weekendStart day="sat" time="00:00"/>
<weekendEnd day="sun" time="24:00"/>
<weekendStart day="fri" time="24:00"/>
<weekendEnd day="mon" time="00:00"/>

What is meant by the weekend varies from country to country. It is typically when most non-retail businesses are closed. The time should not be specified unless it is a well-recognized part of the day.

For information on the other fields, see Appendix F: Date Format Patterns.


Calendar Fields

<!ELEMENT fields ( alias | (field*, special*)) >
<!ELEMENT field ( alias | (displayName?, relative*, special*)) >

Translations may be supplied for names of calendar fields (elements of a calendar, such as Day, Month, Year, Hour, etc.), and for relative values for those fields (for example, the day with
relative value -1 is "Yesterday"). Where there is not a convenient, customary word or phrase in a particular language for a relative value, it should be omitted.

Here are examples for English and German. Notice that the German has more fields than the English does.

<calendar>
  <fields>
...
   <field type='day'>
    <displayName>Day</displayName>
    <relative type='-1'>Yesterday</relative>
    <relative type='0'>Today</relative>
    <relative type='1'>Tomorrow</relative>
   </field>
...
  </fields>
</calendars>
<calendar>
  <fields>
...
   <field type='day'>
    <displayName>Tag</displayName>
    <relative type='-2'>Vorgestern</relative>
    <relative type='-1'>Gestern</relative>
    <relative type='0'>Heute</relative>
    <relative type='1'>Morgen</relative>
    <relative type='2'>Übermorgen</relative>
   </field>
...
  </fields>
</calendars>

5.9.2 <timeZoneNames>

<!ELEMENT timeZoneNames (alias | (hourFormat*, hoursFormat*, gmtFormat*, regionFormat*, fallbackFormat*, abbreviationFallback*, preferenceOrdering*, singleCountries*, default*, zone*, special*)) >
<!ELEMENT zone (alias | ( long*, short*, exemplarCity*, special*)) >

The timezone IDs (tzid) are language-independent, and follow the Olson timezone database [Olson]. However, the display names for those IDs can vary by locale. The generic time is so-called wall-time; what clocks use when they are correctly switched from standard to daylight time at the mandated time of the year.

Unfortunately, the canonical tzid's (those in zone.tab) are not stable: may change in each release of the Olson Timezone database. In CLDR, however, stability of identifiers is very important. So the canonical IDs in CLDR are kept stable as described in Appendix L: Canonical Form.

The following is an example of timezone data. Although this is an example of possible data, in most cases only the exemplarCity is needs translation. And that does not even need to be present, if a country only has a single timezone. As always, the type field for each zone is the identification of that zone. It is not to be translated.

<zone type="America/Los_Angeles" >
    <long>
        <generic>Pacific Time</generic>
        <standard>Pacific Standard Time</standard>
        <daylight>Pacific Daylight Time</daylight>
    </long>
    <short>
        <generic>PT</generic>
        <standard>PST</standard>
        <daylight>PDT</daylight>
    </short>
    <exemplarCity>San Francisco</exemplarCity>
</zone>

<zone type="Europe/London">
     <long>
        <generic>British Time</generic>
        <standard>British Standard Time</standard>
        <daylight>British Daylight Time</daylight>
    </long>
    <exemplarCity>York</exemplarCity>
</zone>

Note: Transmitting "14:30" with no other context is incomplete unless it contains information about the time zone. Ideally one would transmit neutral-format date/time information, commonly in UTC, and localize as close to the user as possible. (For more about UTC, see [UTCInfo].)

The conversion from local time into UTC depends on the particular time zone rules, which will vary by location. The standard data used for converting local time (sometimes called wall time) to UTC and back is the Olson Data [Olson], used by UNIX, Java, ICU, and others. The data includes rules for matching the laws for time changes in different countries. For example, for the US it is:

"During the period commencing at 2 o'clock antemeridian on the first Sunday of April of each year and ending at 2 o'clock antemeridian on the last Sunday of October of each year, the standard time of each zone established by sections 261 to 264 of this title, as modified by section 265 of this title, shall be advanced one hour..." (United States Law - 15 U.S.C. §6(IX)(260-7)).

Each region that has a different timezone or daylight savings time rules, either now or at any time back to 1970, is given a unique internal ID, such as Europe/Paris. (Some IDs are also distinguished on the basis of differences before 1970.) As with currency codes, these are internal codes. A localized string associated with these is provided for users (such as in the Windows Control Panels>Date/Time>Time Zone).

Unfortunately, laws change over time, and will continue to change in the future, both for the boundaries of timezone regions and the rules for daylight savings. Thus the Olson data is continually being augmented. Any two implementations using the same version of the Olson data will get the same results for the same IDs (assuming a correct implementation). However, if implementations use different versions of the data they may get different results. So if precise results are required then both the Olson ID and the Olson data version must be transmitted between the different implementations.

For more information, see [Data Formats].

The following subelements of timezoneNames are used to control the fallback process described in Appendix J: Time Zone Fallback.

Element Name Data Examples Results/Comment
hourFormat "+HHmm;-HHmm" "+1200"
"-1200"
hoursFormat "{0}/{1}" "-0800/-0700"
gmtFormat "GMT{0}" "GMT-0800"
"{0}ВпГ" "-0800ВпГ"
regionFormat "{0} Time" "Japan Time"
"Tiempo de {0}" "Tiempo de Japón"
fallbackFormat "Tiempo de «{0}»" "Tiempo de «Tokyo»"
abbreviationFallback type="GMT" causes any "long" match to be skipped in Timezone fallbacks
preferenceOrdering type="America/Mexico_City America/Chihuahua America/New_York" a preference ordering among modern zones
singleCountries list="America/Godthab America/Santiago America/Guayaquil Europe/Madrid Pacific/Auckland  Pacific/Tahiti Europe/Lisbon..." uses country name alone


5.10 <numbers>

<!ELEMENT numbers (alias | (symbols?, decimalFormats?, scientificFormats?, percentFormats?, currencyFormats?, currencies?, special*)) >

The numbers element supplies information for formatting and parsing numbers and currencies. It has the following sub-elements: <symbols>, <decimalFormats>, <scientificFormats>, <percentFormats>, <currencyFormats>, and <currencies>. The currency IDs are from [ISO4217] (plus some additional common-use codes). For more information, including the pattern structure, see Appendix G: Number Pattern Format.

<!ELEMENT symbols (alias | (decimal?, group?, list?, percentSign?, nativeZeroDigit?, patternDigit?, plusSign?, minusSign?, exponential?, perMille?, infinity?, nan?, special*)) >

<symbols>
      <decimal>.</decimal>
      <group>,</group>
      <list>;</list>
      <percentSign>%</percentSign>
      <nativeZeroDigit>0</nativeZeroDigit>
      <patternDigit>#</patternDigit>
      <plusSign>+</plusSign>
      <minusSign>-</minusSign>
      <exponential>E</exponential>
      <perMille></perMille>
      <infinity></infinity>
      <nan></nan>
</symbols>

<!ELEMENT decimalFormats (alias | (default?, decimalFormatLength*, special*))>
<!ELEMENT decimalFormatLength (alias | (default?, decimalFormat*, special*))>
<!ELEMENT decimalFormat (alias | (pattern*, special*)) >
(scientificFormats, percentFormats, and currencyFormats have the same structure)

<decimalFormats>
  <decimalFormatLength type="long">
    <decimalFormat>
      <pattern>#,##0.###</pattern>
    </decimalFormat>
  </decimalFormatLength>
</decimalFormats>
<scientificFormats>
  <default type="long"/>
  <scientificFormatLength type="long">
    <scientificFormat>
      <pattern>0.000###E+00</pattern>
    </scientificFormat>
  </scientificFormatLength>
  <scientificFormatLength type="medium">
    <scientificFormat>
      <pattern>0.00##E+00</pattern>
    </scientificFormat>
  </scientificFormatLength>
</scientificFormats>
<percentFormats>
  <percentFormatLength type="long">
    <percentFormat>
      <pattern>#,##0%</pattern>
    </percentFormat>
  </percentFormatLength>
</percentFormats>
<currencyFormats>
  <currencyFormatLength type="long">
    <currencyFormat>
      <pattern>¤ #,##0.00;(¤ #,##0.00)</pattern>
    </currencyFormat>
  </currencyFormatLength>
</currencyFormats>

<!ELEMENT currency (alias | (displayName?, symbol?, pattern*, decimal?, group?, special*)) >

<currencies>
    <currency type="USD">
        <displayName>Dollar</displayName>
        <symbol>$</symbol>
    </currency>
    <currency type ="JPY">
        <displayName>Yen</displayName>
        <symbol>¥</symbol>
    </currency>
    <currency type ="INR">
        <displayName>Rupee</displayName>
        <symbol choice="true">0≤Rf|1≤Ru|1&lt;Rf</symbol>
    </currency>
    <currency type="PTE">
        <displayName>Escudo</displayName>
        <symbol>$</symbol>
    </currency>
</currencies>

In formatting currencies, the currency number format is used with the appropriate symbol from <currencies>, according to the currency code. The <currencies> list can contain codes that are no longer in current use, such as PTE. The choice attribute can be used to indicate that the value uses a pattern interpreted as in Appendix H: Choice Patterns.

When the currency symbol is substituted into a pattern, there may be some further modifications, according to the following.

<currencySpacing>
  <beforeCurrency>
    <currencyMatch>[:letter:]</currencyMatch>
    <surroundingMatch>[:digit:]</surroundingMatch>
    <insertBetween> </insertBetween>
  </beforeCurrency>
  <afterCurrency>
    <currencyMatch>[:letter:]</currencyMatch>
    <surroundingMatch>[:digit:]</surroundingMatch>
    <insertBetween> </insertBetween>
  </afterCurrency>
</currencySpacing>

This element controls whether additional characters are inserted on the boundary between the symbol and the pattern. For example, in the above, inserting the symbol "US$" into the pattern "#,##0.00¤" would result in an extra space inserted before the symbol, eg "#,##0.00 US$", while inserting into the pattern "¤#,##0.00" would not, eg "US$#,##0.00". That is because the afterCurrency condition matches and the beforeCurrency condition doesn't. For more information on the matching used in the currencyMatch and surroundingMatch elements, see Appendix E: Unicode Sets.

Currencies can also contain optional grouping, decimal data, and pattern elements. This data is inherited from the <symbols> in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Multiple Inheritance.

Note: Currency values should never be interchanged without a known currency code. You never want the number 3.5 interpreted as $3.5 by one user and ¥3.5 by another. Locale data contains localization information for currencies, not a currency value for a country. A currency amount logically consists of a numeric value, plus an accompanying currency code (or equivalent). The currency code may be implicit in a protocol, such as where USD is implicit. But if the raw numeric value is transmitted without any context, then it has no definitive interpretation.

Notice that the currency code is completely independent of the end-user's language or locale. For example, RUR is the code for Russian Rubles. A currency amount of <RUR, 1.23457×10³> would be localized for a Russian user into "1 234,57р." (using U+0440 (р) cyrillic small letter er). For an English user it would be localized into the string "Rub 1,234.57" The end-user's language is needed for doing this last localization step; but that language is completely orthogonal to the currency code needed in the data. After all, the same English user could be working with dozens of currencies.Notice also that the currency code is also independent of whether currency values are inter-converted, which requires more interesting financial processing: the rate of conversion may depend on a variety of factors.

Thus logically speaking, once a currency amount is entered into a system, it should be logically accompanied by a currency code in all processing. This currency code is independent of whatever the user's original locale was. Only in badly-designed software is the currency code (or equivalent) not present, so that the software has to "guess" at the currency code based on the user's locale.

Note: The number of decimal places and the rounding for each currency is not locale-specific data, and is not contained in the Locale Data Markup Language format. Those values override whatever is given in the currency numberFormat. For more information, see Supplemental Data.

For background information on currency names, see [CurrencyInfo].

5.11 <posix>

<!ELEMENT posix (alias | (messages*, special*)) >
<!ELEMENT messages (alias | ( yesstr?, nostr?, yesexpr?, noexpr?)) >

The following are included for compatibility with POSIX.

 <posix>
       <posix:messages>
            <posix:yesstr>Yes</posix:yesstr>
            <posix:nostr>No</posix:nostr>
            <posix:yesexpr>^[Yy].*</posix:yesexpr>
            <posix:noexpr>^[Nn].*</posix:noexpr>
        </posix:messages>
 <posix>

  1. The values for yesstr and nostr contain a colon-separated list of strings that would normally be recognized as "yes" and "no" responses. For cased languages, this shall include only the lowercase version. POSIX locale generation tools must generate the uppercase equivalents.
  2. Values for yesstr and nostr should include the complete word for "yes" or "no", as well any commonly used abbreviations for same. If a Latin script is used to render the language, this is usually the first letter of the word "yes" or "no".
  3. The values for yesexpr and noexpr contain a regular expression that matches only those strings that would be recognized as "yes" and "no" responses, in any case variation. The value of yesexpr and noexpr should match all the values in yesstr and nostr respectively.

So for English, the appropriate strings and expressions would be as follows:

yesstr "yes:y"
nostr "no:n"

yesexpr "^[yY]([eE][sS])?"
This would match y,Y,yes,yeS,yEs,yES,Yes,YeS,YEs,YES.

noexpr "^[nN][oO]?"
This would match n,N,no,nO,No,NO.

5.12 <references>

<!ELEMENT references ( reference* ) >
<!ELEMENT reference ( #PCDATA ) >
<!ATTLIST reference type NMTOKEN #REQUIRED>
<!ATTLIST reference standard ( true | false ) #IMPLIED >
<!ATTLIST reference uri CDATA #IMPLIED >

The references section supplies a central location for specifying references and standards. The uri should be supplied if at all possible. If not online, then a ISBN number should be supplied, such as in the following example:

<reference type="R2" uri="http://www.ur.se/nyhetsjournalistik/3lan.html">Landskoder på Internet</reference>
<reference type="R3" uri="URN:ISBN:ISBN 47-04974-X">Svenska skrivregler</reference>

5.13 <collations>

<!ELEMENT collations (alias | (default?, collation*, special*)) >

This section contains one or more collation elements, distinguished by type. Each collation contains rules that specify a certain sort-order, as a tailoring of the UCA table defined in UTS #10: Unicode Collation Algorithm [UCA]. (For a chart view of the UCA, see Collation Chart [UCAChart].) This syntax is an XMLized version of the Java/ICU syntax. For illustration, the rules are accompanied by the corresponding basic ICU rule syntax [ICUCollation] (used in ICU and Java) and/or the ICU parameterizations, and the basic syntax may be used in examples.

Note: ICU provides a concise format for specifying orderings, based on tailorings to the UCA. For example, to specify that k and q follow 'c', one can use the rule: "& c < k < q". The rules also allow people to set default general parameter values, such as whether uppercase is before lowercase or not. (Java contains an earlier version of ICU, and has not been updated recently. It does not support any of the basic syntax marked with [...], and its default table is not the UCA.)

However, it is not necessary for ICU to be used in the underlying implementation. The features are simply related to the ICU capabilities, since that supplies more detailed examples. Note: there is an on-line demonstration of collation at [LocaleExplorer] (pick the locale and scroll to "Collation Rules").

Version

The version attribute is used in case a specific version of the UCA is to be specified. It is optional, and is specified if the results are to be identical on different systems. If it is not supplied, then the version is assumed to be the same as the Unicode version for the system as a whole.

Note: For version 3.1.1 of the UCA, the version of Unicode must also be specified with any versioning information; an example would be "3.1.1/4.0" for version 3.1.1 of the UCA, for version 3.2 of Unicode. This has been changed by decision of the UTC, so that it will no longer be necessary as of UCA 4.0. So for 4.0 and beyond, the version just has a single number.

5.12.1 <collation>

<!ELEMENT collation (alias | (base?, settings?, suppress_contractions?, optimize?, rules?, special*)) >

Like the ICU rules, the tailoring syntax is designed to be independent of the actual weights used in any particular UCA table. That way the same rules can be applied to UCA versions over time, even if the underlying weights change. The following describes the overall document structure of a collation:

<collation>
 <settings caseLevel="on"/>
 <rules>
  <!-- rules go here -->
 </rules>
</collation>

The optional base element <base>...</base>, contains an alias element that points to another data source that defines a base collation. If present, it indicates that the settings and rules in the collation are modifications applied on top of the respective elements in the base collation. That is, any successive settings, where present, override what is in the base as described in Setting Options. Any successive rules are concatenated to the end of the rules in the base. The results of multiple rules applying to the same characters is covered in Orderings.

Setting Options

In XML, these are attributes of <settings>. For example, <setting strength="secondary"> will only compare strings based on their primary and secondary weights.

If the attribute is not present, the default (or for the base url's attribute, if there is one) is used. The default is listed in italics.

Collation Settings
Attribute Options Basic Example   XML Example Description
strength primary (1)
secondary (2)
tertiary (3)
quarternary (4)
identical (5)
[strength 1] strength = "primary" Sets the default strength for comparison, as described in the UCA.
alternate non-ignorable
shifted
[alternate non-ignorable] alternate = "non-ignorable" Sets alternate handling for variable weights, as described in UCA
backwards on
off
[backwards 2]   backwards = "on" Sets the comparison for the second level to be backwards ("French"), as described in UCA
normalization on
off
[normalization on]  normalization = "off" If on, then the normal UCA algorithm is used. If off, then all strings that are in [FCD] will sort correctly, but others won't. So should only be set off if the the strings to be compared are in FCD.
caseLevel on
off
[caseLevel on] caseLevel = "off" If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on
caseFirst upper
lower
off
[caseFirst off] caseFirst = "off" If set to upper, causes upper case to sort before lower case. If set to lower, lower case will sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels.
hiraganaQ on
off
[hiraganaQ on] hiragana­Quarternary = "on" Controls special treatment of Hiragana code points on quaternary level. If turned on, Hiragana codepoints will get lower values than all the other non-variable code points. The strength must be greater or equal than quaternary if you want this attribute to take effect.
numeric on
off
[numeric on] numeric = "on" If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UCD]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123".

 

Collation Rule Syntax

<!ELEMENT rules (alias | ( reset, ( reset | p | pc | s | sc | t | tc | q | qc | i | ic | x)* )) >

The goal for the collation rule syntax is to have clearly expressed rules with a concise format, that parallels the Basic syntax as much as possible.  The rule syntax uses abbreviated element names for primary (level 1), secondary (level 2), tertiary (level 3), and identical, to be as short as possible. The reason for this is because the tailorings for CJK characters are quite large (tens of thousands of elements), and the extra overhead would have been considerable. Other elements and attributes do not occur as frequently, and have longer names.

Note: The rules are stated in terms of actions that cause characters to change their ordering relative to other characters. This is for stability; assigning characters specific weights would not work, since the exact weight assignment in UCA (or ISO 14651) is not required for conformance — only the relative ordering of the weights. In addition, stating rules in terms of relative order is much less sensitive to changes over time in the UCA itself.

Orderings

The following are the normal ordering actions used for the bulk of characters. Each rule contains a string of ordered characters that starts with an anchor point or a reset value. The reset value is an absolute point in the UCA that determines the order of other characters. For example, the rule & a < g, places "g" after "a" in a tailored UCA: the "a" does not change place. Logically, subsequent rule after a reset indicates a change to the ordering (and comparison strength) of the characters in the UCA. For example, the UCA has the following sequence (abbreviated for illustration):

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

Whenever a character is inserted into the UCA sequence, it is inserted at the first point where the strength difference will not disturb the other characters in the UCA. For example, & a < g puts g in the above sequence with a strength of L1. Thus the g must go in after any lower strengths,  as follows:

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 g <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

The rule & a << g, which uses a level-2 strength, would produce the following sequence:

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 g <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

And the rule & a <<< g, which uses a level-3 strength, would produce the following sequence:

... a <3 g <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

Since resets always work on the existing state, the rule entries must be in the proper order. A character or sequence may occur multiple times; each subsequent occurrence causes a different change. The following shows the result of serially applying a three rules.

  Rules   Result Comment  
1 & a < g ... a <1 g ... Put g after a.
2 & a < h < k ... a <1 h <1 k