Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Mon, 21 Apr 2014 07:32:15 -0700

On 4/21/2014 1:33 AM, Eli Zaretskii wrote:
>> Date: Sun, 20 Apr 2014 23:03:20 -0700
>> From: Asmus Freytag <asmusf_at_ix.netcom.com>
>> CC: Eli Zaretskii <eliz_at_gnu.org>, unicode_at_unicode.org,
>> Kenneth Whistler <ken_at_unicode.org>
>>
>>>> Note that the current embedding level is not changed by this rule.
>>>>
>>>> What does this last sentence mean by "the current embedding level"?
>>>> The first bullet of X6 mandates that "the current character’s
>>>> embedding level" _is_ changed by this rule, so what other "current
>>>> embedding level" is alluded to here?
>>> I'm punting on that one - can someone else answer this?
>>>
>>>
>>> I assume "current embedding level" here meant "the embedding level of
>>> the last entry on the directional status stack". (This is a natural
>>> slip to make if you think in terms of an optimized implementation that
>>> stores each component of the top of the directional status stack in a
>>> variable, as suggested in 3.3.2.)
>>>
>>> James
>>>
>> In general, I heartily dislike "specifications" that just narrate a
>> particular implementation...
> I cannot agree more.
>
> In fact, my main gripe about the UBA additions in 6.3 are that some of
> their crucial parts are not formally defined, except by an algorithm
> that narrates a specific implementation. The two worst examples of
> that are the "definitions" of the isolating run sequence and of the
> bracket pair. I didn't ask about those because I succeeded to figure
> them out, but it took many readings of the corresponding parts of the
> document. It is IMO a pity that the two main features added in 6.3
> are based on definitions that are so hard to penetrate, and which
> actually all but force you to use the specific implementation
> described by the document.
>
> My working definition that replaces BD13 is this:
>
> An isolating run sequence is the maximal sequence of level runs of
> the same embedding level that can be obtained by removing all the
> characters between an isolate initiator and its matching PDI (or
> paragraph end, if there is no matching PDI) within those level runs.
>
> As for bracket pair (BD16), I'm really amazed that a concept as easy
> and widely known/used as this would need such an obscure definition
> that must have an algorithm as its necessary part. How about this
> instead:
>
> A bracket pair is a pair of an opening paired bracket and a closing
> paired bracket characters within the same isolating run sequence,
> such that the Bidi_Paired_Bracket property value of the former
> character or its canonical equivalent equals the latter character or
> its canonical equivalent, and all the opening and closing bracket
> characters in between these two are balanced.
>
> Then we could use the algorithm to explain what it means for brackets
> to be balanced (for those readers who somehow don't already know
> that).
>
> Again, thanks for clarifying these subtle issues. I can now proceed
> to updating the Emacs bidirectional display with the changes in
> Unicode 6.3.
>
>
FWIW here is the restatement of BD16 that I used for myself (and that I put
into the source comments of the sample Java implementation):

     // The following is a restatement of BD 16 using non-algorithmic
language.
     //
     // A bracket pair is a pair of characters consisting of an opening
     // paired bracket and a closing paired bracket such that the
     // Bidi_Paired_Bracket property value of the former equals the latter,
     // subject to the following constraints.
     // - both characters of a pair occur in the same isolating run sequence
     // - the closing character of a pair follows the opening character
     // - any bracket character can belong at most to one pair, the
earliest possible one
     // - any bracket character not part of a pair is treated like an
ordinary character
     // - pairs may nest properly, but their spans may not overlap otherwise

     // Bracket characters with canonical decompositions are supposed to
be treated
     // as if they had been normalized, to allow normalized and
non-normalized text
     // to give the same result.

Your language is more concise, but you may compare for differences.

A./

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Mon Apr 21 2014 - 09:33:21 CDT

This archive was generated by hypermail 2.2.0 : Mon Apr 21 2014 - 09:33:22 CDT