Lars Kristan wrote:
> Anyway, maybe I did a mistake by mixing the two aspects right from the
> start. If we forget about the ë for a moment and think about Björk or
> Almodóvar. The most basic transliteration would be dropping of all accents
> and I did not find that in the http://oss.software.ibm.com/cgi-bin/icu/tr
> demo, the closest thing I got was Almodo<'>var.
You can write a short ICU 2.0 transliteration ID that decomposes the input (NFD) and then removes accents. Mark knows the syntax better...
> I think people will expect that searching for Almodovar will find both
> forms. And that means people searching the web (ok, you can say those have
> time to repeat the search) as well as people working for example in a bank
> searching for an account.
This may be done better based on a locale-specific collator than based on transliteration. With a collator, you can base a search on only primary (letter-level) differences.
> Once simple transliteration is covered, adding some transcriptions as well
> would be a plus. Providing both Bjork and Bjoerk as entries in the index may
> not be neither always correct nor always complete, but - it's something,
> right?
You can do this in ICU with custom rules.
> To sum it up - I am was not thinking exact transcription or transliteration,
> with both source and target language defined. All I am saying is that
> something generic would be handy.
More generic than an almost regexp-style rules syntax and means to concatenate arbitrary transliterator objects?
markus
This archive was generated by hypermail 2.1.2 : Mon Nov 19 2001 - 14:54:45 EST