From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Tue Apr 22 2003 - 16:29:17 EDT
Hi Ben,
Most regex engines can handle Unicode text for the trivial cases, such as
exact matching. The problem of creating regex that is useful in a Unicode
context (where specifying huge numbers of code points might come into play
otherwise or in which you want to use character properties specified by the
Unicode Character database) is a non-trivial exercise. The guidelines for
implementing Unicode regex are actually a Unicode Technical Report (not part
of the standard) which you can find here:
http://www.unicode.org/reports/tr18
JDK 1.4 (and later) contains a reasonable version that supports a few of
these guidelines. You can read the Javadoc here:
http://java.sun.com/j2se/1.4.1/docs/api/java/util/regex/package-frame.html
Xerces-flavored (Java) regex contains similar functionality.
Perl 5, I believe, has support for the same level of functionality. I don't
work much with it, so haven't paid much attention to the details, but am
given to understand that the support is present.
If you mean (as you appear to) the regex API, I would look at a good Unicode
library such as ICU (http://oss.software.ibm.com/icu).
Best Regards,
Addison
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
+1 408.962.5487 (phone) +1 408.210.3569 (mobile)
-------------------------------------------------
Internationalization is an architecture.
It is not a feature.
Chair, W3C-I18N-WG Web Services Task Force
To participate see http://www.w3.org/International/ws
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Ben Dougall
> Sent: Tuesday, April 22, 2003 12:39 PM
> To: unicode@unicode.org
> Subject: regular expressions with unicode situation?
>
>
> i'm just wondering if anyone can tell me what the general state of play
> is at the moment regarding using regular expressions with unicode?
>
> i'm not even completely sure if / how the two would fit together
> completely or successfully? i've used regex in php, which was a version
> of posix regex, and found it very useful. i'm now doing stuff on a mac
> - os x (cocoa), and am starting work on an app that will analyses and
> dissect text and am wondering if i can make use of regular expressions.
> i want the app to work equally in all languages / character subsets. if
> regex in general only covers small portions of unicode i don't think
> it'll be so useful.
>
> any general info regarding regex in conjunction with unicode much
> appreciated. thanks.
>
This archive was generated by hypermail 2.1.5 : Tue Apr 22 2003 - 17:08:21 EDT