RFC 3066 tags vs. locales (was RE: Common Locale Data Repository Project

From: Peter Constable (petercon@microsoft.com)
Date: Mon Apr 26 2004 - 12:25:08 EDT

  • Next message: Shawn Steele: "RE: Proposal to add 2 Romanian characters"

    Mark:

    I really feel your usage of terminology here is unhelpful -- in very
    practical ways, unhelpful, because it makes it more difficult to get
    people to understand how to implement things in the right way.

    It may be that the application that most interests you is the naming of
    locales, but that does not change the fact that the notions of "locale"
    and "language" are different, and that the primary intent of RFC 1766
    and it's successors has always been identification of "languages", as
    the title and introduction to RFC 3066 indicate:

    "Tags for the Identification of Languages"

    "One means of indicating the language used is by labeling the
    information content with an identifier for the language that is used in
    this information content."

    Whether in your broad or narrow sense, a locale is an operational mode
    of a software application or of a software operating environment to
    provide culture-dependent tailoring.

    "Language" in the sense used by RFC 1766/3066 is a
    linguistically-related attribute of content, and a language identifier
    is used to label content to indicate that attribute, or to label
    resources (e.g. spelling checkers) that can appropriately be applied to
    that content. I think that's stated reasonably clearly in RFC 1766/3066

    One should also refer to RFC 2277, IETF Policy on Character Sets and
    Languages, which clearly distinguishes "language" tags and "locale"
    tags. In the IETF context, which is the context for RFC 1766/3066, those
    documents provide do *not* provide tags for locales; they provide tags
    for languages.

    > There is, as I have said, a perfectly reasonable, narrow sense of
    > locale which is essentially identical to what is captured by RFC 3066.

    But that does not mean that it's a good thing to refer to RFC 3066 tags
    as locale identifiers.

    > And in
    > practice, RFC 3066 is often used with that meaning. I don't see any
    need to deny
    > reality (at least not in this area ;-)

    I think you overstate actual practice: For many years, various software
    implementations have used combinations of ISO 639-1 language identifiers
    and ISO 3166 country identifiers joined with an underscore to create
    locale identifiers; e.g. "en_US". It was not until Microsoft's .Net
    Framework that locales ('CultureInfo' in that context) have been named
    using strings that *resemble* RFC 3066 tags -- and it needs to be
    pointed out that the namespace for CultureInfo.Name is not the same as
    the RFC 3066 namespace.

    It may be that you and some others have come to refer to RFC 3066 tags
    as "locale" (in some unspecified sense) identifiers, but that
    terminology certainly is not used by all. Indeed, as mentioned above, it
    is counter to IETF practice as described in RFC 2277.

    My contention is that it's unhelpful to refer to RFC 3066 as "locale"
    tags. I have no problem with *using* RFC 3066 to name certain locales,
    or to control the operational mode of software processes in certain
    contexts. But saying that RFC 3066 tags are "locale" tags is decidedly
    unhelpful in getting people to understand what are appropriate
    requirements of implementations. While you may have a conceptualization
    that distinguishes between "narrow" and "broad" senses of "locale",
    there are at least some software implementers (and I suspect this
    applies to most) that only know of "locale", without any distinction of
    subtypes. As a result, people inevitably will end up confusing
    namespaces for locales with the RFC 3066 namespace. My concern is that
    this will lead to problems of interoperation, and will potentially
    undermine RFC 3066.

    Consider a couple of situations. First, someone needs to define in their
    software a locale for (say) US English but we a 24-hour time format.
    Yes, that falls in your broad rather than narrow sense of locale, but
    there are lots of software implementers out there that don't know the
    difference. All they know is that someone they consider knowledgeable in
    i18n/g11n issues has referred to RFC 3066 tags as "locale tags". So,
    they decide to name their locale "en-US-24hr". Then they write software,
    or document their system leading others to write software, that inserts
    this name into contexts like xml:lang. We know they shouldn't do it, but
    they don't know that; and referring to RFC 3066 as "locale" tagging only
    encouraged them to do this. And once they've done it, it can become a
    problem that all of us have to work around.

    Secondly, consider Mongolian. Documents written in Mongolian using
    Mongolian script should be tagged (following the provisions of RFC
    3066bis) as "mn-Mong". There is no distinction to be made between
    whether these documents were written in Mongolia or in PRC. Therefore,
    there's no need to tag the documents as "mn-Mong-CN" or "mn-Mong-MN".
    But for software locales, this country distinction *is* important. So,
    if a software implementer names their locale "mn-Mong-MN" and then
    assumes they should insert that string into the accept-language header
    of an HTTP request, there's a better than fair chance content will not
    be returned according to what the user would prefer, because what they
    want is "mn-Mong", and that's how the content is tagged, but because the
    software implementer didn't understand that the intent of RFC 3066 and
    the requirements for locales are not the same, the request that was sent
    was overly specific.

    So, I will persist in trying to get people to understand that RFC 3066
    tags are not "locale" tags, and ask that you not perpetuate confusion
    that is out there.

    Peter
     
    Peter Constable
    Globalization Infrastructure and Font Technologies
    Microsoft Windows Division



    This archive was generated by hypermail 2.1.5 : Mon Apr 26 2004 - 13:10:28 EDT