From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue May 16 2006 - 08:57:07 CDT
On 5/16/2006 12:32 AM, Jukka K. Korpela wrote:
> On Tue, 16 May 2006, Balasankar wrote:
>
>> Whether the union of Exemplar & auxiliary exemplar character set
>> should contain all the possible characters used in the particular
>> language?
>
> No. It is impossible to list down the characters used in a language;
> the set is very fuzzy, with membership ranging from core characters
> (such as "a" in English) through marginal characters (like "?", i.e.
> "e" with acute, in English) to characters may appear in special words,
> typically borrowings, perhaps _very_ rarely.
At some point you run into the 'newspaper' issue: in some cultures,
newspapers will preserve more of the spelling of foreign names (if they
use the Latin script) than is common in US papers. While such names are
not exactly borrowed words, they do form part of widely disseminated
texts in that language. As a result, the set required to be able to
handle 'texts accessed by ordinary users' in these cultures is quite
large, and has lost any specificity towards a given *language*.
I ran into that problem a decade ago when I dabbled in language recognition.
> Moreover, these sets are currently supposed to list down _letters_
> only. The two sets make it possible to give a rather rough description
> of letters used in a language, and the choices made are often rather
> debatable.
>
> It isn't even clear what the intended _use_ of the sets is, or what
> the actual use will be. There is a large number of imagineable uses,
> with their own implications on what the grounds for defining the sets
> should really be. I'm afraid the (mostly implicit) criteria applied
> now make the sets incommensurable across languages.
>
That's been my feeling as well, but every time I mention this to people
who are at the core of the CLDR activity they assure me that there are
such criteria (including a clear specification of the intended use). If
that's the case, can anyone give a URL to them?
A./
This archive was generated by hypermail 2.1.5 : Tue May 16 2006 - 09:12:56 CDT