Re: RFC 1766 language tags

From: pollarda@physc2.byu.edu
Date: Tue Jun 17 1997 - 17:49:01 EDT


In a previous post which my VMS editor refused to pull up it was asked :

"Why do we need language tags anyway?"

There are a number of reasons for having language tags. Primarily, it is to
help with the machine understanding and display of the text at hand.
Language tags are crucial for any robust information retrieval system.
(I consult building custom retrieval engines/applications.)
English for example has a very simple word structure.
There are a limited set of inflections for a given word.

Run:
Run
Running
Ran

Belch:
Belch
Belching
Belched
Belches
etc.

Because of English's simple word structure, it is easy to search for all
forms of a word with a simple wildcard. (i.e., "belch*")
However, not all languages are like English. Greek on the other hand
can have over 600 forms for the same verb and Finish can have anywhere from
10,000-40,000 forms for the same word depending on how you want to count
them. Generally the way this is handled is to pass the words to some
sort of normalization routine (some are more robust than others) which convert
the word to its root form or some other normalized form before it is indexed.

So, in this case, language tags would be crucial so that you know which
language processing routines you need to pass the word to. As electronic
documents and commerce becomes more and more common and robust, the need
for electronic analysis of the text becomes increasingly important. Not just
analysis of individual words but, computational analysis of whole documents.

In addition, many languages have different sort orders specified for the same
characters -- often sorting on whether two characters are next to each
other in addition to looking at the characters themselves. So even to
sort a simple list of 10 words, language tags could play a very crucial
part. (That way you know which sort routine to use.)

All in all, there are _many_ different reasons why it is important to
use/have language tags.

Personally, I am not too fond of what has been proposed in this current thread
as I believe that it makes far more sense to have some sort of hierarchical
system such as:
English/American/Southern
English/American/HawaiianPigon
etc.
At least that way if you don't happen to have an English/American/Southern
processor you can back up and use an English/American processor and if that
fails back up one more to a simple English processor.

-Art
Art Pollard
Moderator Comp.Theory.Info-Retrieval/Consultant



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT