Unclear text in the UBA (UAX#9) of Unicode 6.3
asmusf at ix.netcom.com
Tue Apr 22 01:25:05 CDT 2014
On 4/21/2014 8:32 PM, Ilya Zakharevich wrote:
> On Mon, Apr 21, 2014 at 06:08:12PM -0700, Asmus Freytag wrote:
>> Here's the text I supplied, with numbers added for discussion. It
>> definitely needs some
>> editing, but the point of the exercise would be to see what:
>> 1. A bracket pair is a pair of characters consisting of an opening
>> paired bracket and a closing paired bracket such that the
>> Bidi_Paired_Bracket property value of the former equals the
>> subject to the following constraints.
>> a - both characters of a pair occur in the same isolating run
>> b - the closing character of a pair follows the opening character
>> c - any bracket character can belong at most to one pair, the
>> earliest possible one
>> d - any bracket character not part of a pair is treated like an
>> ordinary character
>> e - pairs may nest properly, but their spans may not overlap
>> 2. Bracket characters with canonical decompositions are
>> supposed to be treated
>> as if they had been normalized, to allow normalized and
>> non-normalized text
>> to give the same result.
>> c) needs rewording, because it is not correct
>> The BD16 examples show
>> a ( b ) c ) d 2-4
>> a ( b ( c ) d 4-6
>> From that, it follows that it's not the earliest but the one with the smallest span.
> Sorry, I do not see any definition here. Just a collection of words
> which looks like a definition, but only locally…
Thank you for the high praise. :?
Now you deleted language which I will restore here, put into a
reasonable order and complete the suggested
edit on "c"
d) brackets are resolved at the earliest opportunity, starting from the beginning of the text.
c) if there are two possible ways to resolve a pair, the one spanning less text is used.
f) unpaired bracket characters remaining inside a resolved bracket pair are treated as
ordinary characters (get ignored for bracket matching purposes).
> And I think I can even invent an example which I cannot parse using
> your definition:
> 1( 2[ 3( 4] 5) 6)
> Is looking-at-1 forcing match of 3-and-5? Or what?
Let's see what the text gives (before we improve it further).
1. - 1( or 3( could match 5) or 6) , 2[ could only match 4]
a. - we have only one isolating run, so this is a no-op
b. - all opening characters follow their putative closing characters, so
this is a no-op
d. - at location 5 is the earliest opportunity to match a pair
(before we get to 5 we don't have a opening and closing)
c. - we could match 1( or 3( but we use 3, because it spans less text
e. , f. - can probably combine these, but 4] is now inside a resolved
pair and is ignored.
Now, when we reach 6) we have another pair, and per d, it's the earliest
we can resolve it, so we match 1) and 6).
Now I add something to your example
1( 2[ 3( 4] 5) 6) 7]
even though 2[ and 7] properly surround 3( and 5), they can't match,
1( and 6) surround only 2[, which makes it unpaired and ignored (per f.).
If the example had been
1( 2[ 3( 4] 5) 6] 7)
then, on reaching 6] we could have matched it with 2[ and 7) with 1(
Eli's definition starts
A bracket pair is a pair of an opening paired bracket and a closing
paired bracket characters within the same isolating run sequence,
such that the Bidi_Paired_Bracket property value of the former
character or its canonical equivalent equals the latter character or
its canonical equivalent, ....
....and all the opening and closing bracket
characters in between these two are balanced.
That continuation we found out was incorrect, so we would need to fix it.
Here's an attempt:
... subject to the following conditions:
a. a match is attempted at the left-most closing bracket character
unmatched at this point
b. the closest earlier matching opening bracket, that is unmatched
at this point is used to form the pair
c. any unmatched bracket character enclosed in a pair is ignored
for further matching
d. matching ends when no more pairs can be formed
I believe with this, you can parse the examples in UAX#9 and the examples
we discussed here. If not, I'd appreciate if you could help identify and
remedy any gaps.
More information about the Unicode