On 4/25/12 9:04 AM, Jukka K. Korpela wrote:
> This can be really awkward when you would need advanced tools like
> Unicode regular expressions (JavaScript has just Ascii regexps)
>
Regular Expressions for Unicode in JavaScript are possible but very
awkward because it requires manipulations of UTF-16 surrogate pairs.
The example in the previous linked PDF states:
"You will never be able to write regexes like [𝒜-𝒵] since that gets
misinterpreted as [\uD835\uDC9C-\uD835\uDCB5]"
This regex can be rewritten as \uD835[\uDC9C-\uDCB5], which will work
just fine.
If you're searching for a range where the lead surrogate changes, you'd
need to use the pipe | to combine 2 different patterns. To match from
U+10000 - U+107FF, you'd need to combine the following 2 ranges:
1) \uD800[\uDC00-\uDFFF]
2) \uD801[\uDC00-\uDFFF]
The Regular Expression would be:
(\uD800[\uDC00-\uDFFF]|\uD801[\uDC00-\uDFFF])
It's ugly and round about, but it does work.
Just saying,
-Steve
Received on Wed Apr 25 2012 - 10:05:20 CDT
This archive was generated by hypermail 2.2.0 : Wed Apr 25 2012 - 10:05:22 CDT