RE: New Locale Proposal

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon Sep 18 2000 - 11:33:01 EDT


Antoine Leca & Paul Deuter

I am sorry that my previous reply was so short, I was rushed. A bit of
background:

I ported ICU for the Apache web server using ICU 1.4. Part of the project
required locale validation in order to override some of the Apache MIME
handling.

ICU does not currently validate locales. The proposal allows users to have
locales in private resource bundles that are not implemented in ICU by
specifying the bundle path to use for validation. Normal fallback
mechanisms are used for locales validated for one path that are used with a
different path.

I am upgrading this code to ICU 1.6 and working with the team to integrate
this into 1.7 as a pro-bono project because I think it needs to be done.
(ICU support in web servers)

The validation is important for two reasons:

1) Predictable results

2) Improved performance

Part of this change involves combining the C and C++ locale structure for
improving the performance. Currently ICU rebuilds the C++ locale same
objects over and over again. If we follow this standard to also economize
on resource definitions for language variants, the structure will simplify
locale processing.

>-----Original Message-----
>From: Antoine Leca [mailto:Antoine.Leca@renault.fr]
>Sent: Monday, September 18, 2000 3:02 AM
>To: Unicode List
>Subject: Re: New Locale Proposal

>I do not know if this proposal is good or evil.
>But in any case there are some points that need to be enhanced IMHO.

>Carl W. Brown wrote:
>>
>> The locale will consist of three parts:
>>
>> 1) A modified lower case RFC 1766bis language
>>
>> 2) An ISO 3166 country code

>Can you allow for areas that are a little bigger ?
>The first obvious case is the EU (but I believe it may soon become a
>ISO 3166 code). Problematic cases also include the Arabic countries
>and the Spanish America, where the unity of language conjugated with
>the differences in countries create a long list of almots completely
>virtual locales (that is, outside the need to tag monetary amounts,
>these locales are non-informative). Same problem for French in
>Africa and, to a lesser extend, English on wide areas on Earth.

Good point. A combined South American Spanish is also a good starting point
for a neutral Spanish dialect. I guess you can always use a 5-8 character
language variant.

es-soamer_CL Chile using the South American variant.

On the other hand for a language like Portuguese you might want to use
Brazilian Portuguese from Minas Gerais as a language neutral. This might be
a case for your ISO 3166-2 codes Brazil is the major producer to T.V. and
movies and influences the Portuguese language. I guess it is like taking
California English as a standard, maybe resented but generally understood.
But I am not sure that in both cases you would not make some adjustments for
local idioms and slang that making it a different variant anyway.

>> 3) A variant
>> The modifications to RFC 1766bis to make to better suited for locales are
as
>> follows:
>>
>> 1) Normalize to single form when possible. Use ISO 639-1 code instead of
>> 639-2 if one exists.

>Are you forced to re-tag every bit of data when ISO 639/RA issues a
>new code?

ICU has an ALIAS mechanism that I am changing so that it can be used more
flexibly. Old encoding point to new encoding.

>> 3) Variants that are not related to language are locale variants.
>> fr_FR_EURO

>Can *please* people avoid this abuse of the variant idea?

>We are at less than 16 months from the end of the use of FRF. So in
>16 months from now, the "fr_FR" locale will become completely
>indistinguishable from your example. Unless you want to force us to
>leave the "fr_FR" and reserves it for tagging obsolete datas, but
>I can tell you this is an already lost battle.
>This is a big problem for a draft RFC that will take around,
>say, 15 months (;-)), to be completed.

It is a simple matter of changing fr_FR to use the Euro and fr_FR_EURO will
iterate to fr_FR.

>Now, if we try to be a bit more clever, the locale that speaks
>French and which labels monetary amounts in euros should be named
>"fr_EU", for anything except very peculiar and very rare uses.
>There are as much differences between France's French and Belgian
>French as between Scottish English and London English (the most
>notable being the use of "octante" instead of "quatre-vingt" for
>eighty); and I believe the few other similar cases like "de_EU"
>for "de_DE"/"de_AT", "nl_EU" for "nl_NL"/"nl_BE", and the perhaps
>more future "en_EU"/"en_IE"/"en_GB" or "sv_EU"/"sv_FI"/"sv_SE".
>Furthermore, the small countries and alike, as are "LU", "AD",
>"SM", "MC" or "VC", for which independant locales will be quite
>of jokes (I except "lb_LU"), will then be covered easily.

By combining language country and locale country only when it make sense we
can separate then when necessary. If we want to use EU as a locale country
but not a language country than we can have: fr-fr_EU & fr-be_EU

>> 5) Convert all non-human locales "C" & "POSIX" to human locales e.g.
en_US.

>There are BIG differences between "C"/"POSIX" and "en_US".
>If you do not see that, then I believe there are big holes
>in the intended uses of these new locales.

>A major one is that "POSIX" collates in the same order as ASCII;
>while I do not believe you are willingful to impose this burden
>on every user of "en_US"!
>The whole point of "C" and "POSIX" (or its grand'brother "i18n"),
>as locales, are to provide surety in execution in an area where
>fuzziness is the rule. And yes, there are cases where this is
>much more important than displaying user-friendly dates...

>Furthermore, I am not sure at all that mapping "C" to "en_US" will
>be welcome everywhere (even if C99 now insists that the names
>used in full text dates are the English ones). I am not even sure
>this is conforming, even assuming the _classical_ "en_US" where
>accentuated characters are considered punctuation.
>In any ways, the modern, Unicode-conformant, definition of "en_US"
>will certainly not qualify.

Maybe the answer should be POSIX = en-posix_US

Carl



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT