The concept of \b in a regular expression meaning to match the boundary
between a word and non-word was invented by Larry Wall, for the Perl
programming language. This was before Unicode, and a word was defined
as alphanumerics plus the underscore, which fit well with how
identifiers in that computer language (and many others) were defined.
Essentially \b is defined to break between runs of word characters
versus runs of non-word characters.
The latest version of Perl 5 (recently released) has added \b{w} based
on Unicode's definition. The typical expectation of its programmers is
that it would be a drop-in replacement for the old \b, with much better
results in parsing natural languages.
But it isn't such a replacement, creating some consternation, and the
main reason is that, unlike \b, it treats the boundary between white
space characters as a breaking opportunity, so that it doesn't create
runs of them. Thus if you have two spaces after a full stop, it treats
each as an individual word.
My question is "Was this intentional, and if so, Why?"
TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. Note
that this is different than \b alone, which corresponds to \w and \W."
And UAX29 says "adjacent spaces are collapsed to a single space" in
intelligent cut and paste using the WB property.
Received on Sat Aug 22 2015 - 15:09:51 CDT
This archive was generated by hypermail 2.2.0 : Sat Aug 22 2015 - 15:09:51 CDT