From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Sep 23 2007 - 01:31:33 CDT
Andy Heninger wrote:
> Regarding the question of how to complement a [^set] that
> contains strings, or grapheme clusters, or collation elements,
> or whatever we want to call them, I am still struggling with
> what it means, and what makes sense.
The first intutitive approach to what [^set] means is that it should match
everywhere [set] does not match, and [set] should match everywhere [^set]
doest not match, i.e. they should be perfect complementary of each other.
But already, they are aren't perfect complements because both will exclude
line terminators in multiline mode. This is solved by saying that, in
multiline mode, there's no line terminator in any content that is accessible
to searches (each line is treated as a separate text, whose line terminators
have been dropped), so that "." (the universe for matching classes) is the
union of [set] and [^set] (these two subsets create a partition of the "."
set).
Now if you accept digraphs or grapheme clusters in [set], you should accept
them also in [^set] and "." should also include all digraphs and grapheme
clusters, but this means that "." will need to include all possible texts,
because digraphs are not limited in size. As this seems unreasonable
(because it would make counting the number of matches with "." impossible to
perform), it seems reasonable to exclude the possibility of using digraphs
in [set].
There remains the possibility of including (default) grapheme clusters in
the "." universe (and as a consequence in [set] and [^set]) but the exact
definition of grapheme clusters that are counted as 1 unit or "." remains a
problem to specify; this is something that will depend on the implementation
of regexps (some implementations will allow you to specify which strings the
"." universe includes), but at least for Unicode, the minimum universe
should be the UCS, limited to single code points.
But this means a regexp engine using this minimum set will be able to find E
WITH ACUTE only if it is encoded as a single code point, but not when it is
encoded in NFD form despite it is canonically equivalent; such regexp engine
will then not be a conforming Unicode process because it will generate
different output (distinct sets of matches) from canonically equivalent
inputs.
So the idea of implanting regexps by making them find matchs in the NFD
transformation of the input text is good as it creates a conforming process.
The bad thing is that E WITH ACUTE is no more a single character and is then
absent from the "." universe and can't be part of [set] and [^set].
Another possibility is to include in the "." universe the NFD transformation
of every code point of the UCS, in such a way that the sequence <C,
COMBINING ACUTE ACCENT> is still counted as 1 unit, but <C, COMBINING ACUTE
ACCENT, COMBINING CEDILLA > will be counted as 2 "." units (but then
remember that "." is sensitive to Unicode versions).
But then, should a search for <C WITH COMBINING ACUTE ACCENT>, equivalent to
a search for <C, COMBINING ACUTE ACCENT> in NFD form, will easily match the
text <C WITH COMBINING ACUTE ACCENT, COMBINING CEDILLA>, but should it match
the text encoded as <C WITH CEDILLA, COMBINING ACUTE ACCENT>, which is
canonically equivalent? If the intent is to produce a Unicode conforming
process, it should be yes. So matches will be for two non-contiguous
subtrings in the scanned text, excluding the CEDILLA part !
* (1) If it includes C WITH CEDILLA completely, then the matched substring
contains more than just C WITH COMBINING ACCENT (and this will be a problem
is the matched string is used as a candidate for replacement by another
text, because the replacement will drop the CEDILLA despite it was not
looked for in the regexp).
* (2) If it does not include the CEDILLA part, how to perform the
replacement the two substrings in the text that are matching the regular
expression? The way to do it would be to reorder the second substring just
after the first one (allowed because the match was for any text canonically
equivalent to the searched regexp), so that the CEDILLA will be left just
after the replacement string. After the replacement has been made, the text
may eventually be reordered in NFD or NFC form (but I think it should not be
performed by default, as the user may may to delete manually this CEDILLA
when it does not make sense to keep it after the replacement made in a text
editor).
With this profile (2), the universe "." needs not include all possible
grapheme clusters (because there are countless). The "." universe contains:
* all Unicode code points excepting those that have a canonical
decomposition mapping in the UCD,
* plus the strings in NFD form that result from the NFD form of UCD
charactes that have a canonical decomposition mapping in the UCD.
When looking for matches in the text, the regexp engine will scan the text
by converting it internally into NFD form before looking up for elements in
the "." universe above. Matches found will possibly include a discontinuous
trailer (with combining characters) in the original text (not in NFD form),
but they will be continuous in the internal NFD representation of the text.
The trailer will necessarily be in the last characters of the corresponding
match in the original text (it may be reduced to a NFD part of the last
character of this source text).
To perform replacements, one has to look at the last character in the
matched source text, decompose it to NFD in an internal store, then going
back to the beginning, comparing it to the last matched cluster in the
regexp. The internal trailer, not matched, will not bereplaced, but will be
reinserted in the text after the replacement string.
But it is still a conforming process that produces equal sets of matches
from canonically equivalent inputs.
This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 01:35:24 CDT