RE: regular expressions with unicode situation?

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Tue Apr 22 2003 - 16:29:17 EDT

  • Next message: Roozbeh Pournader: "RE: New document N2581"

    Hi Ben,

    Most regex engines can handle Unicode text for the trivial cases, such as
    exact matching. The problem of creating regex that is useful in a Unicode
    context (where specifying huge numbers of code points might come into play
    otherwise or in which you want to use character properties specified by the
    Unicode Character database) is a non-trivial exercise. The guidelines for
    implementing Unicode regex are actually a Unicode Technical Report (not part
    of the standard) which you can find here:
    http://www.unicode.org/reports/tr18

    JDK 1.4 (and later) contains a reasonable version that supports a few of
    these guidelines. You can read the Javadoc here:

    http://java.sun.com/j2se/1.4.1/docs/api/java/util/regex/package-frame.html

    Xerces-flavored (Java) regex contains similar functionality.

    Perl 5, I believe, has support for the same level of functionality. I don't
    work much with it, so haven't paid much attention to the details, but am
    given to understand that the support is present.

    If you mean (as you appear to) the regex API, I would look at a good Unicode
    library such as ICU (http://oss.software.ibm.com/icu).

    Best Regards,

    Addison

    Addison P. Phillips
    Director, Globalization Architecture
    webMethods, Inc.

    +1 408.962.5487 (phone) +1 408.210.3569 (mobile)
    -------------------------------------------------
    Internationalization is an architecture.
    It is not a feature.

    Chair, W3C-I18N-WG Web Services Task Force
    To participate see http://www.w3.org/International/ws

    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
    > Behalf Of Ben Dougall
    > Sent: Tuesday, April 22, 2003 12:39 PM
    > To: unicode@unicode.org
    > Subject: regular expressions with unicode situation?
    >
    >
    > i'm just wondering if anyone can tell me what the general state of play
    > is at the moment regarding using regular expressions with unicode?
    >
    > i'm not even completely sure if / how the two would fit together
    > completely or successfully? i've used regex in php, which was a version
    > of posix regex, and found it very useful. i'm now doing stuff on a mac
    > - os x (cocoa), and am starting work on an app that will analyses and
    > dissect text and am wondering if i can make use of regular expressions.
    > i want the app to work equally in all languages / character subsets. if
    > regex in general only covers small portions of unicode i don't think
    > it'll be so useful.
    >
    > any general info regarding regex in conjunction with unicode much
    > appreciated. thanks.
    >



    This archive was generated by hypermail 2.1.5 : Tue Apr 22 2003 - 17:08:21 EDT