Unclear text in the UBA (UAX#9) of Unicode 6.3
asmusf at ix.netcom.com
Mon Apr 21 09:32:15 CDT 2014
On 4/21/2014 1:33 AM, Eli Zaretskii wrote:
>> Date: Sun, 20 Apr 2014 23:03:20 -0700
>> From: Asmus Freytag <asmusf at ix.netcom.com>
>> CC: Eli Zaretskii <eliz at gnu.org>, unicode at unicode.org,
>> Kenneth Whistler <ken at unicode.org>
>>>> Note that the current embedding level is not changed by this rule.
>>>> What does this last sentence mean by "the current embedding level"?
>>>> The first bullet of X6 mandates that "the current character’s
>>>> embedding level" _is_ changed by this rule, so what other "current
>>>> embedding level" is alluded to here?
>>> I'm punting on that one - can someone else answer this?
>>> I assume "current embedding level" here meant "the embedding level of
>>> the last entry on the directional status stack". (This is a natural
>>> slip to make if you think in terms of an optimized implementation that
>>> stores each component of the top of the directional status stack in a
>>> variable, as suggested in 3.3.2.)
>> In general, I heartily dislike "specifications" that just narrate a
>> particular implementation...
> I cannot agree more.
> In fact, my main gripe about the UBA additions in 6.3 are that some of
> their crucial parts are not formally defined, except by an algorithm
> that narrates a specific implementation. The two worst examples of
> that are the "definitions" of the isolating run sequence and of the
> bracket pair. I didn't ask about those because I succeeded to figure
> them out, but it took many readings of the corresponding parts of the
> document. It is IMO a pity that the two main features added in 6.3
> are based on definitions that are so hard to penetrate, and which
> actually all but force you to use the specific implementation
> described by the document.
> My working definition that replaces BD13 is this:
> An isolating run sequence is the maximal sequence of level runs of
> the same embedding level that can be obtained by removing all the
> characters between an isolate initiator and its matching PDI (or
> paragraph end, if there is no matching PDI) within those level runs.
> As for bracket pair (BD16), I'm really amazed that a concept as easy
> and widely known/used as this would need such an obscure definition
> that must have an algorithm as its necessary part. How about this
> A bracket pair is a pair of an opening paired bracket and a closing
> paired bracket characters within the same isolating run sequence,
> such that the Bidi_Paired_Bracket property value of the former
> character or its canonical equivalent equals the latter character or
> its canonical equivalent, and all the opening and closing bracket
> characters in between these two are balanced.
> Then we could use the algorithm to explain what it means for brackets
> to be balanced (for those readers who somehow don't already know
> Again, thanks for clarifying these subtle issues. I can now proceed
> to updating the Emacs bidirectional display with the changes in
> Unicode 6.3.
FWIW here is the restatement of BD16 that I used for myself (and that I put
into the source comments of the sample Java implementation):
// The following is a restatement of BD 16 using non-algorithmic
// A bracket pair is a pair of characters consisting of an opening
// paired bracket and a closing paired bracket such that the
// Bidi_Paired_Bracket property value of the former equals the latter,
// subject to the following constraints.
// - both characters of a pair occur in the same isolating run sequence
// - the closing character of a pair follows the opening character
// - any bracket character can belong at most to one pair, the
earliest possible one
// - any bracket character not part of a pair is treated like an
// - pairs may nest properly, but their spans may not overlap otherwise
// Bracket characters with canonical decompositions are supposed to
// as if they had been normalized, to allow normalized and
// to give the same result.
Your language is more concise, but you may compare for differences.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode