L2/05-020

Reasons for Enhancing RFC 3066

RFC 3066 and its predecessor, RFC 1766, define language tags for use on the Internet. Language tags are necessary for many applications, ranging from cataloging content to computer processing of text. The RFC 3066 standard for language tags has been widely adopted in various protocols and text formats, including HTML, XML, and CLDR, as the best means of identifying languages and language preferences.

This specification proposes enhancements to RFC 3066. Because revisions to RFC 3066 therefore have such broad implications, it is important to understand the reasons for modifying the structure of language tags and the design implications of the proposed replacement.

Problems

This specification, the proposed successor to RFC 3066, addresses a number of issues that implementers of language tags have faced in recent years:

The stability, accessibility, and ambiguity issues are crucial. Currently, because of changes in underlying ISO standards, a valid RFC 3066 language tag may become invalid (or have its meaning change) at a later date. With much of the world's computing infrastructure dependent on language tags, this is simply unacceptable: it invalidates content that may have an extensive shelf-life. In this specification, once a language tag is valid, it remains valid forever.

RFC 3066 Language Tags: A brief survey

Tags defined by RFC 3066 take two forms. Most tags are formed using an ISO 639-1 (two-letter) or ISO 639-2 (three letter) language tag, optionally followed by an ISO 3166 country code. Tags formed in this manner are not individually registered and anyone can use such a combination of codes to identify their language preferences or the language of some piece of content. Because this system allows a broad range of tags to be formed by reference to the underlying standards, these tags are referred to as generative in nature. The generative system is very powerful and allows content authors and others to form and use very expressive tags without the need to engage in a long and arduous registration process. Examples of such tags are:

While it is possible to generate tags that do not identify any likely real-world content, such as Aleut as used in Belgium, tags of this nature do not represent a serious problem. Consider the case of a database that can identify people by national origin and by hair color. It is not a problem that one could compose a query for blond Mongolians, even though no results would ever be returned.

There are problems with the the RFC 3066 definition of generative tags, however. The ISO 639 and ISO 3166 standards are not freely available and evolve over time. For example, ISO 3166 has withdrawn tags in the past and, worse, then reassigned them to a different country altogether. As a result, it is difficult for implementers to obtain a correct list of codes and then ensure interoperability with other implementations of language tags.

The other way to form an RFC 3066 tag is via registration with IANA. Tags registered with IANA identify a specific language, dialect or variation. Unlike the generative tags, the registered values cannot be combined with other standard subtags to form additional tags that are more descriptive. Examples of such tags are:

Registration, besides being a long and arduous process, also presents a variety of problems for implementers. Although the tags are freely available, most implementations do not support these tags because they do not fit neatly into the generative system. Special logic is required to handle them, especially when performing language negotiation or fallback. In addition, many of the tags are deprecated because the registration process is less opaque and time-consuming than registering a language with ISO 639 MA/RA has historically been. Eventually ISO 639 does catch up and assign the language a code, resulting in overlapping tag choices. Implementations must also deal with the implications of multiple valid tags identifying what is essentially the same content.

But most problematic is the lack of a relationship to the generative mechanism. Since each variation of a tag must be separately registered, language variations with a broad range of valid uses require an enormous number of registrations. For example, there are 8 registrations to deal with minor spelling reforms in the German language and these registrations cover just three countries where German is commonly spoken--and no countries where it is not the major language. Variations in languages with a broader diffusion (such as Chinese) may require 20 or more registrations to gain full coverage, sometimes of important distinctions.

Solving the Problems

This specification addresses each of these issues with a simple, elegant design that is compatible with existing language tags and implementations.

This compatibility exists on several levels. All language tags, both generative and registered, that were valid under RFC 3066 are still valid under this specification. In addition, and very importantly, language tags that are newly defined by this specification are compatible with the ABNF syntax, matching, parsing, and other mechanisms defined by RFC 3066.

Thus for an implementation of RFC 3066, all of the new tags defined by this specification are still in the form of valid registered tags, and will simply be dealt with in whatever fashion the implementation used to handle future registrations, those that were added to the registry after the implementation was created. In other words, tags formed under this specification that are unfamiliar to RFC 3066 implementations will be treated by those implementations as if they were registered tags from a future version of the 3066 registry.

Subtags and the Registry

The largest change in the specification is that it modifies the structure of the language tag registry. Instead of having to obtain lists of codes from five separate external standards (not all of which are easily available), the IANA registry will maintain a comprehensive list of valid subtags that can be used in the generative mechanism in a machine-parseable text format. This registry will continue to track the existing core standards and will start with the current list of valid codes. As future codes are assigned, the IANA registry will be updated to reflect the changes.

Having a separate registry allows IANA language tags to resolve ambiguity and stability problems with the underlying standards. Language tags formed today will be guaranteed to maintain their validity and meaning essentially forever, something that is not true today.

In addition, switching to a subtag registry changes the nature of registrations themselves. Instead of registering complete tags and therefore potentially having to register a very large number of them (complicating life for implementers and discouraging support for the registry), a single subtag can be generatively combined to form many useful tags.

For example, one registered tag today is 'zh-Hans', which represents "Chinese written in the Simplified Chinese script". Only this tag is valid under RFC 3066. Useful tags such as 'zh-Hans-SG' (SG=Signapore) or 'zh-Hans-CN' are not valid. By switching to a registry in which 'Hans' is a registered subtag, any of these valid and useful tags can be formed generatively.

In addition, the subtag registry will encourage implementers to support registered items, since the subtags will fit the generative mechanism and exception handling code will no longer be necessary.

To prevent the IANA language registry filling up with deprecated entries, rules have also been introduced to curb harmful registrations that should be handled by the various ISO maintenance and registration authorities (such as ISO 639).

The new structure and registry allows implementations to determine much more about tags, even in the absence of registry information. This is important because at any given point in time there will be a mixture of implementations that have different snapshots of the registry. The new structure allows these implementations to to interoperate effectively. In particular, the category of all subtags (as language, region, script, etc.) can be determined without reference to the particular version of the registry snapshot by the implementation. This allows for much more robust implementations, and greater compatibility over time.

In addition, this specification also makes it possible, for the first time, to effectively test whether an implementation conforms to the specification. The problem with RFC 3066 is that to determine the status of an implementation produced at a given point, one has to reconstruct the historical contents of each of the ISO standards and the historical contents of the registry. This is a time-consuming and error-prone process. The new registry provides a complete, easily parseable file which provides the precise the contents of valid tags for any point in time.

Additional Subtag Sources

This specification introduces two additional international standards as sources for language tags.

ISO 15924 represents script codes. (The example above of 'Hans' is from ISO 15924.) Writing system variations are often crucial to communicate, especially when selecting content using language negotiation. Addition of this standard will allow these distinctions to be formed generatively, rather than via individual registration.

UN M.49 represents region and country codes. The UN M.49 standard is used by ISO 3166 to determine what a country is. The UN M.49 codes are used by this specification in two ways. First, if ISO 3166 reassigns a country code formerly associated with one country to another country (as it did in 2001 with the 'CS' code, formerly Czechoslovakia and now assigned to Serbia and Montenegro), then the UN M.49 code can be placed in the registry to preserve stability. Secondly, the UN M.49 standard defines regional codes for areas such as "Central and South America" which can be useful in forming language tags for larger regions.

Future-Proofing: Private Use and Extensions

Because of the widespread use of language tags, it is potentially disruptive to have periodic revisions of the core specification, despite demonstrated need. This specification addresses this problem by fully specifying the valid syntax of language tags, while providing for future, unforeseen, requirements. One of these mechanisms is the extlang subtags, which allows for future extensions of ISO 639, in particular, ISO 639-3.

Private use subtags is another one of these mechanisms. In RFC 3066, any tag that was not registered or wholly made up of generative subtags must be completely tagged as private use. Recipients of such a tag are not allowed to infer any information from such a tag, except by private agreement. Thus if any private-use information needed to be included in the tag, the entire tag had to be private use; making the entire tag uninterpretable to other implementations.

This specification allows for private use subtags in a particular, prescribed manner. Consider the IANA registered tag 'sl-nedis', which represents the Natisone dialect of Slovenian. The subtag 'sl' is a valid ISO 639-1 code for Slovenian. Prior to its registration with IANA, if users wished to tag content as being in the Natisone dialect, they had two choices for language tags: 'sl' and 'x-sl-nedis' (or similar). The first tag does not meet the need of distinguishing the text from other varieties of Slovenian, while the second one does not convey the relationship to Slovenian to outside processors (a human might look at the tag and infer Slovenian, but the 'sl' subtag doesn't necessarily represent that language).

Under this specification, if a new dialect of Slovenian were needed (let's call it the 'xyzzy' dialect), a tag such as 'sl-x-xyzzy' can be used. In fact, a quite comprehensive amount of information can be communicated: 'sl-Latn-IT-x-xyzzy' would represent Slovenian written using the Latin script as used in Italy with some additional private distinguishing information (which implementations of this specification can match algorithmically).

Note that RFC 3066 private use tags are still permitted and have the same information content and treatment as they did previously.

The extension mechanism also provides a way for independent RFCs to define extensions to language tags. These extensions have a very constrained, well-defined structure to prevent extensions from interfering with implementations of this specification (or RFC 3066).

Matching and Language Negotiation

Content tagging is only one of the applications for language tags. The other major applications are querying for for matches and in content negotiation. RFC 3066 defines "language ranges" for use in content negotiation and querying and describes a very simple matching algorithm. This specification maintains compatibility with this language negotiation scheme, while providing additional information on the implementation of language matching.

Well-Formed vs. Validating

Existing language tag processors already fall into two categories. There are language tag processors that check if language tags have the proper, well-formed, syntax, but which do not validate their content, and there are language tag processors that in addition validate and reject unrecognized tags. Each of these categories is appropriate to different implementations. For example, to process incoming tags that may have been formed under a future registry, an implementation may restrict itself to only checking well-formedness. Another implementation that allows users to generate tags may fully validate.

This specification clearly distinguishes these two possible classes of conformance, and provides an explicit, testable definition of each one.

Impact of the New Design on Existing Implementations

One concern that is crucial to acceptance of the new language tag design is how it works with existing implementations of RFC 3066 and how existing implementations will interact with implementations of the newer language tags.

It is important to recognize that all language tags that were valid under the existing RFC 3066 will remain valid, with their meanings intact, under this specification. In fact, this specification stabilizes these meanings so that existing implementations can be continued forward for as long as it necessary. Content, regardless of its format, will remain valid, essentially forever.

As content and systems begin to make use of the new language tags by adopting the additional fields defined by this specification, there will be an impact on software and systems that expect only the older tags. The design of this specification was carefully created so that all of the new values that can be assigned fit the pattern for registered language tags under RFC 3066. Thus while existing implementations will not recognize the meaning in the tags, they will be able to process them as if they were unrecognized-but-well-formed registered tags.

In addition, although this specification acknowledges the possibility of alternate or advanced matching and negotiation strategies, it maintains the existing matching algorithm (by removing subtags from the right side of a language tag until a match is obtained), simply providing more detail on usage.

Summary

The authors of this specification have worked for the past year with a wide range of experts in the language tagging community to build consensus on a design for language tags that meets the needs and requirements of the user community. Language tags form a basic building block for natural language support in computer systems and content. The revision proposed in this specification addresses the needs of this community of users with a minimal impact on existing content and implementations, while providing a stable basis for future development, expansion, and improvement.