Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Peter Westlake (peter@harlequin.co.uk)
Date: Fri Mar 13 1998 - 08:39:05 EST


At 04:46 1998-03-13 -0800, Kolbjørn Aambø@unicode.org wrote:
>Would not something like:
>
>Aa:á:Àà:â:Ãã:Ææ:Ää:Åå,Bb,Cc:Çç,Dd,Ee:Ééèêë,Ff,Gg,Hh,I:¡iíìîï,Jj,Kk,Ll,Mm,Nn:Ññ,O
>o:óòô:Õõ:‘¦:Øø:Öö,Pp,Qq,Rr,Ss,Tt,Uu:úùû,Vv,Ww,Xx,Yy:Üü,Zz.
>
>be apropriate for english searching?

Yes. In fact, that could be the value of the ordered equivalence
class for letters in English, except that I think you are including
extra information about how letters sort within each class.

It would be nice to have all common collation sequences built in
to the system, and to be able to define and name new ones using
your notation.

>Then you would find Ångstrøm by searching for Angstrom.
>
>A little problem though: I have a problem matching
>KVÆRNER by searching for KVAERNER using the above relation, any suggestion?

Make the equivalence classes work for strings rather than
single characters. This might also help when searching for
characters that can be either decomposed or predefined; or
maybe that is better handled at a lower level. String classes
would allow silly tricks like searching for kana by Roman
transcription too :-)

Peter.

>By the way I have seen this way of putting relation among characters in
>several other peoples work.
>
>
>Peter Westlake <peter@harlequin.co.uk> wrote:
>:
>>Now, if I want to find a word beginning with A in a list of
>>scientific words used in English, then I would hope to find
>>"Ångstrøm". But if I were searching for names beginning with
>>A in the Danish telephone directory, it would be a mistake to
>>find "Ångstrøm". So I need to say what I mean. If I want to
>>match A-F in English, I need a short way of saying whether to
>>include accents and case and of saying that I mean English.
>>Something like [A-F::u,a,uk] where u means upper case, a means
>>any accent, uk is from a standard list of codes. The range is
>>interpreted in the context of the UK collating sequence. To
>>omit Ångstrøms, I would ask for ^[A::u,a,dk]* meaning "a string
>>beginning with a letter that matches A in Danish". In this context,
>>"Danish" and "English" can be seen as equivalence relations that
>>partition the character set into equivalence classes. Kolbjørn
>>gave an example of such a relation.
>>
>:
>:
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT