From: Keutgen, Walter (walter.keutgen@be.unisys.com)
Date: Tue May 16 2006 - 12:58:48 CDT
Asmus,
you are right, there is enough fog for people not in the CLDR team around that. Moreover applying your criterion below (newspaper), I felt that for German it would be politically correct now to include Polish characters. The space in the survey tool is limited, so I should have thrown out south European languages. So I left this.
The tool uses the sets to flag textual data as erroneous if containing letters outside of the sets, but the team has decided to disregard this 'error' because many exemplar character lists have stayed empty.
Best regards
Walter
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Asmus Freytag
Sent: Tuesday, 16 May 2006 15:57
To: Jukka K. Korpela
Cc: unicode@unicode.org
Subject: Re: CLDR
On 5/16/2006 12:32 AM, Jukka K. Korpela wrote:
> On Tue, 16 May 2006, Balasankar wrote:
>
>> Whether the union of Exemplar & auxiliary exemplar character set
>> should contain all the possible characters used in the particular
>> language?
>
> No. It is impossible to list down the characters used in a language;
> the set is very fuzzy, with membership ranging from core characters
> (such as "a" in English) through marginal characters (like "?", i.e.
> "e" with acute, in English) to characters may appear in special words,
> typically borrowings, perhaps _very_ rarely.
At some point you run into the 'newspaper' issue: in some cultures,
newspapers will preserve more of the spelling of foreign names (if they
use the Latin script) than is common in US papers. While such names are
not exactly borrowed words, they do form part of widely disseminated
texts in that language. As a result, the set required to be able to
handle 'texts accessed by ordinary users' in these cultures is quite
large, and has lost any specificity towards a given *language*.
I ran into that problem a decade ago when I dabbled in language recognition.
> Moreover, these sets are currently supposed to list down _letters_
> only. The two sets make it possible to give a rather rough description
> of letters used in a language, and the choices made are often rather
> debatable.
>
> It isn't even clear what the intended _use_ of the sets is, or what
> the actual use will be. There is a large number of imagineable uses,
> with their own implications on what the grounds for defining the sets
> should really be. I'm afraid the (mostly implicit) criteria applied
> now make the sets incommensurable across languages.
>
That's been my feeling as well, but every time I mention this to people
who are at the core of the CLDR activity they assure me that there are
such criteria (including a clear specification of the intended use). If
that's the case, can anyone give a URL to them?
A./
This archive was generated by hypermail 2.1.5 : Tue May 16 2006 - 13:15:20 CDT