Re: Unicode Sets in 'Unicode Regular Expressions'

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 28 May 2014 01:19:26 +0100

On Wed, 28 May 2014 00:56:40 +0200
Charlie Ruland ☘ <ruland_at_luckymail.com> wrote:

> So I take “Unicode set” to mean “set of Unicode characters” with
> their respective codepoints, whether decomposable or not.

The decomposability issue arises when trying to follow RL2.1
"Canonical Equivalence". In a pattern such as "f\p{L}te".
\p{L} is not just a set of codepoints if the pattern is to be matched
by "fête" when processing NFD strings. This is one reason I think Ken
is right when he says the ICU meaning is intended. I believe I have a
coherent resolution of RL2.1, but I'm currently wrestling with the
other requirements that an implementation satisfying the spirit of
RL2.1 ought to address.

Richard.

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Tue May 27 2014 - 19:20:29 CDT

This archive was generated by hypermail 2.2.0 : Tue May 27 2014 - 19:20:29 CDT