Re: the Ethnologue

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Wed Sep 13 2000 - 11:31:56 EDT


Peter Constable wrote:
>
> A tag that denotes a group of languages serves no useful
> purpose for most language-specific processes. For example, if all you know
> about the language of some information object is that it is an Athapascan
> language, you can't spell-check that information.

While I agree with you, there are anyway problems with the way languages
are distinguished.
For example, I know quite well two languages, three if we add English.
And their situation concerning spell checking are quite different.

With French (this is viewed with a France's point of view, Canadians,
Belgians and Swiss may view things differently), the spellings are quite
uniform, at least this allows an useful use of list of words to check spell.

With Valencian, this is viewed (either with ISO 639 any part or Ethnologue)
as a dialect of Catalan. The problem is that the spelling of Standard
Valencian is clearly established (and I am *not* talking about the alternative
spellings that are sometimes in use in Valencia or even more in the Balear
Islands), and it differs in some points with Catalan practice. These
points include: the termination of 1st person of present of indicative,
the whole subjunctive, the ordinals adjectives and the feminine possessive.
As a result, any operation of spell checking leads to quite a number of
false positives. Here, the solution is quite easy: doing specific lists of
words for Valencian (this exists for some tools, particularly for the public
domain softwares); however, there are no solution in sight about the tagging
of data. And Ethnologue does not seem of help, particularly since it seems to
aggregate the deviations I mentioned above with the tentative from a (minor)
part of Valencians to create a different spelling, specific to Valencian
concerns (as Ethnologue correctly notes, "The standard dialect is a literary
composite which no one speaks"; so specific local 'solutions' are easy to
design, particularly if intermixed with political problems of primacy and
rivalry between Barcelona and Valencia).

With English, the problem with spell checking is quite different, and different
lists of words would not be as easy for a solution: the en-US vs. en-GB
tagging does not seem to adequately cover the various differences such as
-ise vs. -ize, -our vs. -or, -re vs. -er, use of shall vs. will at 1st person,...
Or more precisely, if it does, that is if "en-GB" is intended to always cover
the first case in the pairs above, then I believe it will be of less use to
people (this is as I understand things; certainly people much more proficient
with English will contradict me here; please allow for my lack of knowledge in
this field and try to extract the point from my explanations. Thanks.)
So here the solution with spell checking is more to allow "parametrisation"
of the checking process, according to the user's taste and practice. While
this is an feasible solution for English, this is not as easy for all languages.
And certainly this is a process that does not fit well with tagging...

I have no firm idea for what should be the form of a list of languages.

But I am _sure_ that any list will lead to problems, due to the fuzziness
of the borders between languages. And while this problem is more or less
possible to deal with when it comes to the major languages with abundant
literature and standardized spelling, at the very time it narrows to lesser
used languages, problems will arise.

> Change is needed as the objects described change and as our knowledge of
> the objects change. This is no less true of several ISO standards: 10646,
> 3166,... It is especially true of 639: for example, currently if someone
> wants to tag a document containing Hopi text, they would need to use the
> tag nai "North American Indian (other)". Suppose in two years time there is
> a specific code for Hopi added to ISO 639-2; consider what happens to that
> existing data: it is now *incorrectly* tagged (not just sub-optimally
> tagged), because nai no longer includes Hopi since that now has its own
> code. Every time a new code is added to ISO 639, the meaning of some
> existing codes changes.

The problem you mentioned with the incorrect tagging of Hopi is inherent to
any persistent use of an information that uses a varying database.
If Ethnologue is merged with (or into) ISO 639, this problem won't fade away,
because the linguistic map of the planet is alive (not to mention political
pressures like what I spoke about Valencian above). So if CLN (I am sorry,
I do not know Hopi's situation, so I cannot comment on your specific example)
if CLN is split, with a special code for Valencian created, then this very
day all literature in Valencian would be *now* incorrectly tagged. Exactly
the same case as you described above. The same, except for one point: the
number of documents that might be affected...

I do not expect this problem to have any cure.

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT