From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 19 2003 - 18:51:03 EST
From: Addison Phillips [wM]
> Please note that there is a discussion list for this topic at:
ietf-languages@iana.org
>
> While Mark and I welcome your comments here or privately, off-list, you
can best be
> a part of the discussion by joining that list. Join the list by sending a
request email
> to: ietf-languages-request@iana.org
I note that the language tags proposal includes the following EBNF
productions for extensions that may be padded after the language code,
script code, region code, variant code:
extensions = "-x" 1* ("-" key "=" value)
key = ALPHA *alphanum
value = 1* utf8uri
alphanum = (ALPHA / DIGIT)
utf8uri = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))
Under this new scheme, the following language tag may be valid:
"sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
which here would mean: {
language="sr"; // Serbian
script="Latn"; // Latin
region="SP"; // Serbia-Montenegro
variant="2003";
extensions="-x"; {
href="http://www.iana.org/"
version="1.0"
}
}
However the problem with that scheme is its new use of characters "%" and
"=". There are a lot of applications that where not expecting something else
in this field than just alphanum and "-" or "_" or ".", so that the language
tag could safely be used without specific escaping within URIs (for example
in HTTP GET URLs) or as options of a MIME type (I take a sample here, which
may not correspond to an existing option of the "text/plain" MIME type):
Content-Encoding: text/plain; charset=UTF-8;
lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0
This may break the compatiblity of some parsers if such "extended language
tags" are found there, as there are two "=" signs within the value of the
"lang=" option.
For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
through correctly, as the following would become possible and prone to
generate form data parsing errors:
I think it's quite strange that these extensions have not used the existing
restricted encoding set to encode them, instead on relying on "%" and "=".
Why not using "_" instead of "=" and "." instead of "%", like this:
"sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0"
(same meaning as the first example above).
But at least this draft offers a good starting point to indicate locales
more precisely.
I fully support the new reference to the ISO-15924 standard for the script
code, and for documenting the legal values of variant codes (either a year
with possible era, or a registered tag), as well as clearly indicating that
languages codes should be the shortest ISO-639 codes (is it true for a few
legacy languages which previously were coded with 3 letters and upgraded to
2-letter codes, until there was a policy not to do it anymore in the
future?)
Where does it affect Unicode, I don't know, except in its possible normative
data tables which may contain future language code conditions, or in
Language tags inserted in the Unicode encoded texts. Does Unicode need these
extensions?
This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 19:42:04 EST