From: Mike (mike-list@pobox.com)
Date: Mon Sep 24 2007 - 10:25:37 CDT
> I don't think it will ever really be feasible to define regular
> expressions in terms of specific languages, to the point of treating
> combinations of two or more base characters as a single matchable
> "character" on the basis that speakers of language X consider the
> combination to be a single "letter."
It is feasible, and I already have working code.
There is no avoiding it. Consider: [\uAC00-\uD7A3] which should
match any LV or LVT Hangul syllable. That character class needs
to be able to match any of the precomposed characters listed in
the range, but also must match any sequence of jamos that is
canonically equivalent, such as <U+1103 U+1167 U+11AB>.
The specification uses as an example, [a-z\q{x\u0323}], which
allows American Indians to treat x with an under dot as a single
character even though there is no precomposed character for it.
I also allow you to put named character sequences in a character
class: [\N{KATAKANA LETTER AINU P}] and they always consist of
multiple code points, by definition.
Mike
This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 10:28:55 CDT