From: Mike (mike-list@pobox.com)
Date: Thu Oct 04 2007 - 22:44:43 CDT
>>> In addition, the meaning of ranges in sets like [a-z] should also be
>>> consistant with the collation used...
>>
>> I disagree with this. I think that having [a-z] magically
>> mean all characters in a particular language is asking for
>> trouble. In French, would you say that [a-z] should match
>> C WITH CEDILLA or A + ACUTE?
> Having that kind of support allows regexes to be written that match, say
> the top half of a list
> by using [a-k] etc. That's something that you can do in English today,
> but not in any other
> language. You need to decide whether extending regexs to other languages
> should allow
> such uses (in which case you think of collation elements and sorting
> order) or not.
>
> Depending on how many accented letters a language uses, writing the
> equivalent expression manually can be both tedious and error-prone.
The reason I think that [a-z] should only match the 26 code points
is that regular expressions are often used to match things like
domain name parts: [a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])? where
the allowed characters do not change depending on locale.
I agree that having an easy way to say "match any Swedish character",
or some range of the characters, would be useful; maybe this could be
done using something similar to the \p{} syntax for properties? I
don't want to propose anything since I haven't studied it enough yet.
>> It's my opinion that ranges inside [] should be simple binary
>> order. If you want to do anything fancier, there should be
>> new syntax for it.
> That, or an option?
I would be ok with it being an option.
> Now, other than for canonical decompositions (and conjoining Jamo), I've
> not seen an example that informs me of why it is useful for a regex
> package to be able to match 'ch' as if it were a single code point. Can
> somebody please present a simple example that shows an important use
> case that can't be realized if regexes are limited to a single character
> (plus *canonical* equivalents).
I don't know the reason -- I just implemented all the features
required for level 1 and level 2 conformance, and part of level 2
is being able to do this.
> After all, the atomic elements for writing would be the 'c' and 'h', it
> is only for the purpose of some other text operations that 'ch' are
> (sometimes) considered a unit.
I used to be fluent in written Spanish, but despite that, I never
considered ch, ll, or rr to be single characters. I think I did
a Spanish crossword once where ch went into a single square.
Mike
This archive was generated by hypermail 2.1.5 : Fri Oct 05 2007 - 00:31:16 CDT