From: Doug Ewell (dewell@roadrunner.com)
Date: Tue Sep 25 2007 - 01:23:21 CDT
"Mike" <mike dash list at pobox dot com> wrote:
>> I don't think it will ever really be feasible to define regular
>> expressions in terms of specific languages, to the point of treating
>> combinations of two or more base characters as a single matchable
>> "character" on the basis that speakers of language X consider the
>> combination to be a single "letter."
>
> It is feasible, and I already have working code.
Sorry, I made two huge mistakes in my earlier post:
1. I should never have thrown down the gauntlet to the regex mavens in
the first place. Dinking around with regular expressions is a popular
pastime; I'm sure lots of people really do think they have devised an
elegant language-dependent solution.
2. I should have been much more clear: what I don't think is feasible
is to specify regexes in a language-dependent way, such that a certain
combination means different things depending on some sort of language
"mode." An example would be treating the sequence "[ch]" as a choice
between 'c' and 'h' in English, but as a single "letter" in Spanish or
Slovak or what have you.
Note carefully that I used the word "feasible" and not the word
"possible." By adding more and more hair to the syntax, it becomes
"possible" to do just about anything imaginable with regexes, at
significant cost to clarity and elegance.
> There is no avoiding it. Consider: [\uAC00-\uD7A3] which should match
> any LV or LVT Hangul syllable. That character class needs to be able
> to match any of the precomposed characters listed in the range, but
> also must match any sequence of jamos that is canonically equivalent,
> such as <U+1103 U+1167 U+11AB>.
That solution would be specific to Korean, but would not be interpreted
differently in a Korean-language context vs. a non-Korean-language
context, which is how I should have phrased it.
> The specification uses as an example, [a-z\q{x\u0323}], which allows
> American Indians to treat x with an under dot as a single character
> even though there is no precomposed character for it.
I did say "two or more base characters." Combining characters are a
different kettle of fish, and indeed your solution does make the most
sense for combining characters.
> I also allow you to put named character sequences in a character
> class: [\N{KATAKANA LETTER AINU P}] and they always consist of
> multiple code points, by definition.
But again, the behavior is not different for different languages, right?
Now on the other hand, Andy Heninger wrote:
> POSIX has defined exactly that, see
> http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_05
> "Collation Elements" are locale (language) specific multi-character
> sequences that can appear as set elements in bracket expressions.
> I'm not sure that it's a particularly good idea, but it has been done.
It looks like this is defined in terms of *my* locale, which will
probably conform to English rules and will probably not include the
line:
collating-element <ch-digraph> from "<c><h>"
whereas someone with a Traditional Spanish Sort locale might have this
line. This means the same text would match differentlydepending on who
is grepping it.
What I had in mind as being infeasible was a way to specify the language
mode *in the regex itself*, so I could use "[ch]" against English text
with one meaning and use "[ch]" against traditional Spanish text with
the othe rmeaning.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 01:25:57 CDT