Re: (iso639.193) the Ethnologue

From: Peter_Constable@sil.org
Date: Tue Sep 19 2000 - 17:55:53 EDT

Next message: Marco.Cimarosti@icl.com: "RE: [idn] nameprep forbidden characters"
Previous message: James E. Agenbroad: "Re: TATAP => TATAR"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I've got the revisions to the revisions on the paper sitting on Gary's desk
(was hoping we'd get this online today, but the day's getting old, so
tomorrow is looking more likely). So, I'll return to this discussion and
try to respond to some of the weekend's flurry of messages.

On 09/16/2000 08:21:04 AM Michael Everson wrote:

[snip]

>The Ethnologue lists six different Ancash Quechua, five different Huánaco
>Quechuas, and a lot of other Quechuas besides. It's got five kinds of
>Italian. How do we evaluate this? And I don't know how many Zapotecos,
>there are too many to count. Do we just accept that it's all been
evaluated?
>
>Well, then we find errors, and we point them out. And we say, that's why
>we're worried about this database. But Peter says that's not good enough,
>it's only "anecdotal", and indeed the burden is placed on us to improve
the
>Ethnologue by filing reports.

What I mean here, Michael is this: in the first paragraph above, you
haven't demonstrated that problems exist; you've merely implied that
problems exist based on the assumption that there shouldn't be more than
one Ancash Quechua, etc. This is the kind of thing I'm referring to as
anecdotal: "it's wrong because I don't agree with it".

There is a reason why six different Ancash Quechuas, etc. are listed:
research has indicated that there are that many related but distinct,
mutually non-intelligible, speech varieties there are that have made use of
the name "Ancash Quechua".

>I've got Meillet and Cohen's 1924 _Les langues du monde_ here on my desk
in
>front of me. Like the Ethnologue, it deals with the languages of the
world.
>It has big lists in it. Would I accept those uncritically either? No.

This seems to me to be an important issue: can people involved in creating
standardized systems of language identifiers trust the judgements of
experts from the field of linguistics. I think the answer must be yes for
two reasons:

1. People creating IT standards cannot be experts in all fields, and
certainly cannot all be experts in linguistics, especially of all different
languages and language families of the world. When dealing with something
outside their field of expertise, there must be a willingness to trust the
judgements of experts in that domain, and I think this applies in this
case.

2. The position that those controlling a system of language identifiers
must hold the expertise and be able to make determine how to "tile the
plane" of language variations around the world is based on an invalid
assumption: that there is only one, correct way to tile the plane for use
in IT. There is not one single, correct categorization of languages. This
is one of the key points Gary and I have made in our paper.

>I recognize the need for more languages. My concern with the Ethnologue is
>with its classification.

This seems to argue in favour of the proceeding point: there is no single
consensus on how to enumerate the world's languages, since different people
use different definitions for different purposes. The only solution to that
impossible situation is a system that allows for alternate namespaces, each
based on different particular definitions and maintained by different
authorities.

In various messages, it has sounded like you agree with us that the
international standards process could never cope with providing the
thousands of tags that some existing users need. We are in agreement that
the list of 6000+ Ethnologue codes can't serve as *the* international
standard; and we agree that you could never get everybody to agree on a
list that large - this is precisely our point about categorization. Thus if
you recognize the need for more language tags, then you must like our idea
of namespaces, since that gives us a way to have well-documented codes that
anybody can use to address the full scope of the world's languages, without
requiring that the whole world own the codes. It seems that, in the same
way that the XML community couldn't agree on a single worldwide tag set and
so adopted namespaces, so must the IT community do this for language
tagging.

>You know
>how much a fuss there was just because the code for Yiddish was changed
>from ji to yi? Well how much fuss is there going to be if we find out that
>Upper Kinauri and Lower Kinauri shouldn't really have been given two
>different codes? Because we DON'T want to change codes once they have been
>used in an RFC 1766 context.

This is somewhat overstated. Changing the code for a given meaning from
"abc" to "def" is a serious problem, and it is understandable that people
would be upset. And that is something that the Ethnologue staff is
committed never to do. This is different from changing the categorization
based on improved knowledge, such as merging two categories or splitting a
category into two. That is something that would only be done if it
conformed to the operational definition and was motivated by improved
knowledge. And given the operational definition, this is actually what
users would probably prefer to have happen - if they assume the categories
are defined in a certain way, and they gain a better understanding of the
real-world categories, they want the codes for the categories to reflect
the current best practices. There is no problem to users and to existing
data provided there is clear documentation as to what codes mean and
regarding what changes occurred when. This is exactly what we have argued
is needed to deal with the dynamic nature of language, and is precisely how
the Ethnologue will be maintained. This is also not a problem to many
users, including business users, for two reasons: the languages that are
most likely to undergo such recategorization are of interest to a
relatively limited number of users, and the categorization used for many
users, including business users, will generally be based on a different
operational definition such that the codes they are using (preferably from
a different namespace) would not be affected by such changes.

For instance, if a better understanding of regional Thai varieties results
in a revision to the categorization of those languages in a namespace
defined in terms on mutually non-intelligible speech varieties, that would
have no effect on the categorization (based on a different definition) that
is interested in only a single language variety based primarily on a common
written form, "Standard Thai", and business users and many other users
would continue to use "tha". But those users for whom the individual speech
varieties are important will get codes that reflect the improved
understanding, and for them, that is exactly what they want, provided that
they can also maintain their older data based on the earlier understanding.

>Therefore I am wary of such a huge list. Do you really find this so
>unreasonable?

Only in as much as I don't think all the issues have been considered. When
we really think about the entire set of problems involved in language
identification, the only real solutions seem to be:

- clarify what are the operational definitions on which categorizations are
based
- create distinct namespaces based upon distinct operational definitions
and maintained by agencies with expertise for the given domain (with an
assumption of some minimal criteria to be met for creating a distinct
namespace, including the need to ensure avoidance of synonyms with ISO
639-x in particular)
- have some mechanism to handle the dynamic nature of language and of our
knowledge of languages in a *controlled* manner that helps users rather
than creating problems for users
- provide adequate documentation as to the meaning of codes; this must
include some measure of encyclopedic information that is freely available
online, and maintained on an on-going basis (it's not about to go out of
print)

The Ethnologue is only a part of a broad solution that we are proposing,
though we feel it makes a valuable contribution to that solution
particularly because it conforms to the criteria described above and
because it immediately overcomes existing problems of scale.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>

Next message: Marco.Cimarosti@icl.com: "RE: [idn] nameprep forbidden characters"
Previous message: James E. Agenbroad: "Re: TATAP => TATAR"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT