Unclear text in the UBA (UAX#9) of Unicode 6.3
nospam-abuse at ilyaz.org
Wed Apr 23 02:35:02 CDT 2014
On Tue, Apr 22, 2014 at 09:06:27AM -0700, Asmus Freytag wrote:
> if you read UAX#9, the way the algorithm works is by pushing openers
> on a stack, then, on finding the first closer, going down the stack
> and attempting to locate a match, then, on finding a match,
> discarding any enclosed openers, on not finding a match, discarding
> the closer.
I think I LOVE this definition. Simple, beautiful, and IMO following
people’s expectations very closely.
Here is what “theoretizing” gives:
a parsing is good if it satisfies all conditions below:
0) Some delimiters in the string are marked as “non-matching”; the rest
is broken into disjoint “matched” pairs;
MATCH) A “matched” pair consists of an open-delimiter and matching close-
delimiter (in this order in the string).
NEST) “Matched” pairs are properly nested (meaning that 2 pairs cannot be
positioned as Open1 Open2 Close1 Close2 in the string order).
MINLEN) “Inside” a “matched” pair, every delimiter which could match elements
of the pair but is marked as “non-matching” must nest inside
some deeper-nested “matched” pair.
(I hope that the meaning of the word “inside” in MINLEN is clear.)
GREED) Given any close-delimiter marked as “non-matching”, its
pre-context does not contain any open-delimiter which could
Here pre-context of a position is a concatenation of substrings of the
• Take the most deeply nested “matched pair” containing the position
(if none, the whole string);
• take the part of the string inside this pair AND before the position;
• remove all “matched” pairs completely contained insidde this substring
together with what they enclose.
P.S. Judging by another message of yours, for you “theoretizing” is a
4-letter word… Oh well…
More information about the Unicode