From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 01 2007 - 12:32:55 CST
Mike wrote:
> > [...] as soon as you are introducing collation elements
> > in regexps, these are sorted by collation, and collations are
> > locale-sensitive...
>
> I don't see why they need to be sorted. All that matters is
> that you find the longest match. [a-z\q{ch}] will match "ch"
> in "chinchilla" rather than just "c".
And what can you do with the negated class? How do you define it
consistently?
For what you are defining, you are not creating the necessary support for
correct support of locales, i.e. you are restricting yourself only to the C
locale, whose collation is defined strictly in binary order of code points
and nothing else (you are working in the "C" locale only.
So in this restricted "C" locale:
* The classes of collation elements will only contain single code points
(and effectively, in that locale, there's no possible extension of the set
of collation elements which is exactly the range \u0000 to \u10FFFF in that
order, all of them with only primary differences, so they are equal to their
collation keys)
* You won't recognize any Unicode canonical equivalence in regexps. (But
then why are you recognizing them in scanned texts? This is inconsistent.)
* You won't be able to recognize case-mappings consistently (for case
insensitive searches), because collation elements will all be distinct with
only primary differences, and no further levels.
* Even if you restrict to only the set of primary differences, the only
case-mappings you will be able to recognize are the simple one-to-one case
mappings defined in the main UCD file, excluding special mappings (like the
consistent distictions of dotted and undotted Turkic "I"... or finding
matches containing "SS" when a German sharp s is specified in the search
string... or allowing matches for final variants of greek letters)
* and many other restrictions.
This may finally be consistant for the "C" locale (with binary order), but
you have not solved any of the linguistic needs, and even worse, your regexp
matcher cannot be a Unicode conforming process (because it will return
distinct sets of matches depending on the encoding or normalization or
non-normalization of input text and input regexps.)
What you have done for now is a partial mix, which is intrinsicly
inconsistent, as soon as you have started converting input texts to NFD
(i.e. applying a normalization to them without applying the same rule to the
regexps...)
I'm not advocating that Unicode regexps should support all locales. It
should support at least the legacy "C" locale (with binary order), and a
basic Unicode-based "U" locale (that is *reasonably* neutral to many
locales) based on the full set of Unicode properties, and the DUCET
collation elements (you have partly implemented it by recognizing many
Unicode properties, but not all those needed for consistency).
Other locales could be defined by tailoring (i.e. allowing the use of
special case mappings for case-insensitive searches and the use of tailored
collations): many locales could be supported by a locally implemented
database, or by external databases specified by the user; some tailoring,
that would not depend on this preinstalled support of specific locales,
could be specified directly within regexps working from the predefined "U"
locale that is implemented with character properties defined up to a given
Unicode version (further versions would be supported also by user-defined
tailoring installed on their system, or specific tailoring directly within
regexps).
One could avoid the cost of having to handle complete collation (and then
revert to the binary encoding), without affecting the Unicode conformance
(canonically equivalent texts and regexps willberecognized), by making the
support of tailored collations as a flag given to the regexp exgine (in grep
for example, you would use a dash-option, but in sed or vi you would pass
the flag after the regexp as a final flag character like C for enabling UCA
collation and matching by collation elements rather than characters).
You could also disable finding the canonical equivalences by using another
flag, but then you must do it consistently, by disabling it BOTH in the
input texts AND in the input regexp, but NOT only in texts like what you
have done and you propose.
You could make the C locale the default if you want (but in the C locale, no
normalization should be performed on the input text), but there should exist
a way to specify at least the Unicode neutral locale, where normalization of
input texts is possible (and whose collation is the DUCET unless it is
explicitly tailored by the content of the regexp itself).
However I don't think that normalization of input texts (to NFD in your
implementation) is the best way to handle the found matches, as
normalization will not only change the input text before scanning, but it
will also reorder parts of the input text, because it creates severe
difficulties for using the discovered matches, for example to apply
replacements or other Unicode transforms:
My opinion is that input texts should not be altered, and normalization
should only be performed on output, and only if it is explicitly part of the
transforms applied on matches, and even if normalization is not performed on
the whole output text (if needed a user can perform it separately, or by
specifying an optional flag that will be off by default).
It's not up to regexps to make normalizations, but it's up to regexps to be
able to recognize classes of canonically equivalent texts and find identical
sets of matches in this case if they want to be Unicode compliant processes.
This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 12:37:31 CST