From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Sep 23 2007 - 02:19:46 CDT
________________________________________
De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
part de Mark Davis
Envoyé : vendredi 21 septembre 2007 18:33
À : Mike; Andy Heninger
Cc : unicode@unicode.org; UTC
Objet : Re: New Public Review Issue: Proposed Update UTS #18
> allowing multiple values in a property definition such as \p{gc=L|M|N} or
\p{nv>=10}.
Allowing multiple values is a nice way to compact the regex. Similarly, in
my implementation I actually allow a regex within the property value, so for
example have \p{name=/.*MARK.*/} to pick up all the Unicode characters with
"MARK" in their name. A bit squirrely, but very handy. We might mention some
of these techniques as possibilities.
> As far as your other comments (copied below), the issue is as to what
[^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.
> • The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
"ñ", "ch", "ll", "rr"}.
> • The set inversion would be the set of all other strings. So that would
include "0", "A", ... but also "New York", and "onomotopaeic", and so on. An
infinite set.
Why do you assume such huge extension of the input universe ?
The only needed thing is that the inversion set has to be universe minus the
positive set, and that /./ has to include all possible positive sets, in
such a way that {/[set]/, /[^set]/} is an exact partition of the universe of
acceptable input units.
For you, it should be enough to include in the /./ universe all the UCD
codepoints that your regexp engine will accept in source texts, converted to
one of their normalized forms (such as NFC or NFD).
You are not required to include in /./ all codepoints in the UCS, you may
restrict /./ to include only assigned and valid characters that you accept
to reference in valid /simple text/ regexps, and in that case you must also
accept these in valid /[set]/ regexps and in valid /[^set]/ regexps (the
syntax used in the regexp formula to reference them does not matter, it may
require escaping them, but escaping does not change the universe or what
they represent.
As a consequence a non-empty file that does not contain any match for [set]
will not necessarily contain a match for [^set]: this will be the case if
the file cannot be read as a series of units containing only elements of
your /./ universe (for example if it contains unassigned characters and your
/./ universe contains only assigned characters).
For users, it is first expected that /[set]/ and /[^set]/ form a partition
of the "universe" { /./ union /\R/ } of input units (the "alphabet" in
lexers).
This remains true even if you use the single line mode where line
terminators are members of /./ including the two-characters sequence
/\q{\u000D\u000A}/, because /\R/ here is a set such that :
* in multiline mode, /\R/ contains this sequence and all other
single-character line terminators in /[\n\v\f\r\p{Zl}\p{Zp}]/ that your
engine will accept on input files, and has an empty intersection with /./;
* in single line mode, the /\R/ subset is fully included within /./, so /./
is the "universe", so that /./ also matches any line terminator;
* in both cases, the "universe" is { /./ union /\R/ }, and your lexer can be
built on this finite universe, even if it is built based on bitsets without
internal representation of negated sets.
This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 02:22:24 CDT