Re: transliterations (was Compelling Unicode demo)

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Nov 19 2001 - 13:43:21 EST


Lars Kristan wrote:

> Anyway, maybe I did a mistake by mixing the two aspects right from the
> start. If we forget about the ë for a moment and think about Björk or
> Almodóvar. The most basic transliteration would be dropping of all accents
> and I did not find that in the http://oss.software.ibm.com/cgi-bin/icu/tr
> demo, the closest thing I got was Almodo<'>var.

You can write a short ICU 2.0 transliteration ID that decomposes the input (NFD) and then removes accents. Mark knows the syntax better...

> I think people will expect that searching for Almodovar will find both
> forms. And that means people searching the web (ok, you can say those have
> time to repeat the search) as well as people working for example in a bank
> searching for an account.

This may be done better based on a locale-specific collator than based on transliteration. With a collator, you can base a search on only primary (letter-level) differences.

> Once simple transliteration is covered, adding some transcriptions as well
> would be a plus. Providing both Bjork and Bjoerk as entries in the index may
> not be neither always correct nor always complete, but - it's something,
> right?

You can do this in ICU with custom rules.

> To sum it up - I am was not thinking exact transcription or transliteration,
> with both source and target language defined. All I am saying is that
> something generic would be handy.

More generic than an almost regexp-style rules syntax and means to concatenate arbitrary transliterator objects?

markus



This archive was generated by hypermail 2.1.2 : Mon Nov 19 2001 - 14:54:45 EST