Re: Support for non-BMP characters

From: Steve Slevinski <slevin_at_signpuddle.net>
Date: Wed, 25 Apr 2012 10:02:14 -0500

On 4/25/12 9:04 AM, Jukka K. Korpela wrote:
> This can be really awkward when you would need advanced tools like
> Unicode regular expressions (JavaScript has just Ascii regexps)
>
Regular Expressions for Unicode in JavaScript are possible but very
awkward because it requires manipulations of UTF-16 surrogate pairs.

The example in the previous linked PDF states:
"You will never be able to write regexes like [𝒜-𝒵] since that gets
misinterpreted as [\uD835\uDC9C-\uD835\uDCB5]"

This regex can be rewritten as \uD835[\uDC9C-\uDCB5], which will work
just fine.

If you're searching for a range where the lead surrogate changes, you'd
need to use the pipe | to combine 2 different patterns. To match from
U+10000 - U+107FF, you'd need to combine the following 2 ranges:
1) \uD800[\uDC00-\uDFFF]
2) \uD801[\uDC00-\uDFFF]

The Regular Expression would be:
(\uD800[\uDC00-\uDFFF]|\uD801[\uDC00-\uDFFF])

It's ugly and round about, but it does work.

Just saying,
-Steve
Received on Wed Apr 25 2012 - 10:05:20 CDT

This archive was generated by hypermail 2.2.0 : Wed Apr 25 2012 - 10:05:22 CDT