Pattern Matching with Multilingual Regular Expressions
Intended Audience: |
Managers, Software Engineers, Systems Analysts, Marketers,
Technical Writers, Testers, Web Administrators, Designers |
Session Level: |
Intermediate, Advanced |
Regular expressions have long gained general popularity in most
computing environments as a powerful tool for text and data pattern
matching and manipulation. They offer a tremendous amount of
processing power to a broad range of applications through a
versatile and concise syntax that can be used to solve large and
small problems alike. However, regular expression implementations
are traditionally designed to support Western European data only,
it follows that certain match concepts are not well-defined when
extended to support multiple languages. It is therefore highly
desirable to have a universal regular expression model that can
work with all languages with different linguistic characteristics
and be able to perform pattern matching in a locale-sensitive
manner. The Unicode Regular Expression Guidelines (UTR#18)
documents the general guidelines for adapting regular expression
engines to support Unicode and describes the levels of support
possible. This paper explores the design and development of a multilingual
regular expression engine capable of handling arbitrary number of
languages and character sets. We will cover the support for
locale-sensitive features such as Unicode character support,
character properties, linguistic ranges, special collation
elements, equivalence classes, common optimization techniques,
performance considerations, and so on. We will survey the
multilingual capabilities in the existing major regular expression
packages and utilities, including Perl 5, Java, GNU, XML, etc. In
conclusion, we will illustrate the ideas discussed by introducing
the new multilingual regular expression features in the upcoming
Oracle release, which brings the power of complete multilingual
regular expression search to Oracle database through native support
in SQL and PL/SQL environments. |