From: Mark Davis (mark.davis@icu-project.org)
Date: Tue Oct 02 2007 - 14:52:09 CST
As far as I can tell, you are saying that there needs to be no syntactic
difference between specifying an exact match and specifying with regex
matching. I am never sure, however, because your messages are often so
difficult for me to understand that I give up after the first paragraph.
Anyway, assuming that this is what you are saying, I disagree, since there
are two different operations.
1. the set of characters whose names are (exact match to) "LATIN CAPITAL
LETTER A". This is already provided for as \p{name=LATIN CAPITAL LETTER A},
and is the same as [\u0061]
2. the set of characters that contain "LATIN CAPITAL LETTER A" (so will
match stuff that follows with " WITH DIAERESIS" and so on). This needs new
syntax, because it has to be different than the syntax already used in #1.
Mark
On 10/2/07, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
>
> Mark Davis wrote:
> > Trying to parse yourlanguage, what I read you as saying is that a
> different
> > equivalence operator could be used instead of the slashes, like
> > propname~value instead of propname=/value/
>
> If I also try to parse your language (that introduces the new concept of
> "equivalence operator") I still don't see any difference you are seeing
> between propname=value (which I correctly termed using "equals" or
> "equality") and propname~value.
>
> What do you really mean when you write "propname=/value/"? Is it a
> containment relation (what you seem to call now "equivalence") or equality
> relation?
> If we accepted both your proposal 2 (multiple values, which is just a
> particular kind of matching a regexp) and 3 (matching a regexp), then the
> slashes in proposal 3 are superfluous around the regexp value (or become
> optional and just complicate the syntax without any change in what will be
> matched by the regexp).
>
> So I don't see any semantic difference between "propname=value" and your
> proposed "propname=/value/" as soon as value regexp are accepted. The only
> important question is: do property values need to be matched according to
> a
> regular expression or must they be matched only by equality.
>
> In your example: \N{MARK} would match nothing because there's no Unicode
> character named "MARK". If you want to match characters whose name
> *contains* the word "MARK", then you just need to include a ".*" prefix
> and
> suffix: "\N{.*MARK.*}".
>
> Note that the Unicode character names have known constraints (according to
> the stability rules for the assignment of unique names):
> * they use only the letters [A-Z], the digits [0-9] and the space and
> hyphen.
> * Letter case is not significant (so "\N{SPACE}" would match the same
> thing
> as "\N{space}")
> * leading and trailing spaces or multiple spaces or hyphens are not
> significant (so "\N{LATIN SMALL LETTER A}" would match the same thing as
> "\N{ latin small letter - a }")
> * the words "letter" or "digit" or "mark" are non significant.
> * Other spaces and hyphens are normally not significant, so they can be
> removed from the name (but there's one exception for one Hangul vowel
> whose
> name makes a distinction between "O E" and "O-E")
>
> So implicitly, when matching a name property value in a regexp character
> class, the subregexp for the value can be compiled using case insensitive
> rules and possibly weaker rules (according to the Unicode constraints
> above). These are just global compilation behaviour, but we probably don't
> need to complicate the syntax for something that is already invariable,
> and
> there's no need to introduce a new "equivalence" operator for the specific
> need of matching character names, given that the regexp already specify
> encode the property name using "\N{...}" or "\p{name=...} that clearly
> indicates we are trying to match Unicode character names (or sequences).
>
> Suppose you want to look for all characters that contain the *words* acute
> accent. I would just encode it as:
> \N{<ACUTE>} or \p{name=<ACUTE>} or as well \p{name=<acute>}
> (the angle brackets here are part of the regexp value to match, and are
> representing here a word boundary, replace them by the appropriate syntax
> used in the regexp)
> But I won't need the extra superfluous delimiting slashes in:
> \N{/<ACUTE>/} or \p{name=/<ACUTE>/}
> (it will match not only the combining accent itself, but also precomposed
> characters with an acute accent and whose name contain the "ACUTE" word.
>
> We can then create a negated character class matching all characters that
> don't contain the same *words* using simply:
> \N{!<ACUTE>} or \p{name!=<ACUTE>} or \p{name!=<ACUTE>} or
> \p{^name=<ACUTE>}
> or \P{name=<ACUTE>}
> (the multiple possibilities come from the number of alternate notations
> you
> support for classes of character names, or for negated classes, I'm not
> saying which of them will be the preferred one.)
>
> But I won't need any superfluous delimiting slashes around the regexp
> value
> as suggested for your proposal 3:
> \N{/!<ACUTE>/} or \p{name!=/<ACUTE>/} or \p{name!=/<ACUTE>/} or
> \p{^name=/<ACUTE>/} or \P{name=/<ACUTE>/}
>
> So your proposal 3 to support regexp values is good, I just don't see the
> interest of introducing slashes here when you don't need them in your
> proposal 2 (your argument about complication for the case of multiple
> values
> supported by your proposal 2 is not relevant: we are already in the
> context
> of evaluating regular expressions, so the complications are already
> implemented elsewhere in the regexp parser and in the matching engine, and
> will need yo be supported anyway for accepting the proposal 3, i.e. regexp
> values).
>
>
>
>
-- Mark
This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 14:54:29 CST