From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Sep 24 2007 - 15:52:56 CDT
Having named character sequences in \N is an interesting idea. Would you
mind proposing that to the UTC using the online form? (That's the way to
raise issues to the UTC's attention.)
BTW, Andy and I concluded that the really effective way to do canonical
equivalence in regex would be in a mode where grapheme cluster is the unit,
not code point.
On the comment on "feasible" -- I think the reference there was to
language/locale-sensitive regex. That involves a few things which are quite
tricky, and are thus listed under Level 3 in UTS#18.
- sensitivity: "aa" matches a-ring in Danish
- language-sensitive ordering ranges: [a-z] doesn't include o-slash in
Danish
- language-sensitive grapheme clusters: a dot matches "ch" in Slovak
- ...
Few implementations try to handle locale-sensitivity except for POSIX (and
that has significant problems in it). I wouldn't say that they are
infeasible, but they are tricky.
Mark
On 9/24/07, Mike <mike-list@pobox.com> wrote:
>
> > I don't think it will ever really be feasible to define regular
> > expressions in terms of specific languages, to the point of treating
> > combinations of two or more base characters as a single matchable
> > "character" on the basis that speakers of language X consider the
> > combination to be a single "letter."
>
> It is feasible, and I already have working code.
>
> There is no avoiding it. Consider: [\uAC00-\uD7A3] which should
> match any LV or LVT Hangul syllable. That character class needs
> to be able to match any of the precomposed characters listed in
> the range, but also must match any sequence of jamos that is
> canonically equivalent, such as <U+1103 U+1167 U+11AB>.
>
> The specification uses as an example, [a-z\q{x\u0323}], which
> allows American Indians to treat x with an under dot as a single
> character even though there is no precomposed character for it.
>
> I also allow you to put named character sequences in a character
> class: [\N{KATAKANA LETTER AINU P}] and they always consist of
> multiple code points, by definition.
>
> Mike
>
>
-- Mark
This archive was generated by hypermail 2.1.5 : Mon Sep 24 2007 - 15:54:29 CDT