From: Mike (mike-list@pobox.com)
Date: Tue Sep 25 2007 - 15:23:23 CDT
>>> I don't think it will ever really be feasible to define regular
>>> expressions in terms of specific languages, to the point of treating
>>> combinations of two or more base characters as a single matchable
>>> "character" on the basis that speakers of language X consider the
>>> combination to be a single "letter."
>>
>> It is feasible, and I already have working code.
>
> Sorry, I made two huge mistakes in my earlier post:
>
> 1. I should never have thrown down the gauntlet to the regex mavens in
> the first place.
I had to look up maven in the dictionary, and it means (according to
Princeton University), "someone who is dazzlingly skilled in any
field." So I guess I should be flattered, but when I first read it,
it sounded like an insult. In truth, I have been just a user of
regular expressions (using the excellent pcre project) until May of
this year, when I decided to try implementing them myself.
> Dinking around with regular expressions is a popular
> pastime; I'm sure lots of people really do think they have devised an
> elegant language-dependent solution.
I don't consider what I do to be "dinking around" as I'm sure you
wouldn't say you dink around with language tags in describing your
own work.
> 2. I should have been much more clear: what I don't think is feasible
> is to specify regexes in a language-dependent way, such that a certain
> combination means different things depending on some sort of language
> "mode." An example would be treating the sequence "[ch]" as a choice
> between 'c' and 'h' in English, but as a single "letter" in Spanish or
> Slovak or what have you.
Nobody has suggested that "[ch]" would mean anything different in
any language. To specify that you want to treat "ch" as a single
character, you can use either [[.ch.]] or [\q{ch}]. The former
is POSIX syntax, and I don't know who invented the \q notation.
As I mentioned in a previous message, this functionality is
*required* for level 2 conformance.
> Note carefully that I used the word "feasible" and not the word
> "possible." By adding more and more hair to the syntax, it becomes
> "possible" to do just about anything imaginable with regexes, at
> significant cost to clarity and elegance.
I am very aware of the difference between "feasible" and "possible."
When I design, I prefer to go by what is "useful" and "usable" and
necessarily has to be clear and elegant.
>> The specification uses as an example, [a-z\q{x\u0323}], which allows
>> American Indians to treat x with an under dot as a single character
>> even though there is no precomposed character for it.
>
> I did say "two or more base characters." Combining characters are a
> different kettle of fish, and indeed your solution does make the most
> sense for combining characters.
Then I chose the wrong example from the spec. It also contains the
character class example, [a-z\q{aa}] which allows Danish users to
match "aa" as a single character.
If you can specify that a character class matches specific grapheme
clusters, I think that a natural extension of this is to be able to
specify grapheme clusters that should be matched by "." (which is
just a character class itself).
> Now on the other hand, Andy Heninger wrote:
>
>> POSIX has defined exactly that, see
>> http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_05
>>
>> "Collation Elements" are locale (language) specific multi-character
>> sequences that can appear as set elements in bracket expressions.
>> I'm not sure that it's a particularly good idea, but it has been done.
>
> It looks like this is defined in terms of *my* locale, which will
> probably conform to English rules and will probably not include the line:
>
> collating-element <ch-digraph> from "<c><h>"
>
> whereas someone with a Traditional Spanish Sort locale might have this
> line. This means the same text would match differentlydepending on who
> is grepping it.
I agree with you that behavior should be the same for all users.
Who would argue otherwise? But would you say there shouldn't be
a way to specify a locale to work with? I've thought about how
I would do it and came up with \l{locale}. A regular expression
without \l would behave in the normal language-independent mode,
but if your expression was /\l{es}./, the Spanish locale would
enable the . to match "ch", "ll", or "rr" as a single character.
> What I had in mind as being infeasible was a way to specify the language
> mode *in the regex itself*, so I could use "[ch]" against English text
> with one meaning and use "[ch]" against traditional Spanish text with
> the othe rmeaning.
I would agree that "[ch]" should never mean match "ch" as one
character. You need to use [.ch.] or \q{ch} for that.
Mike
This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 15:26:45 CDT