L2/13-118
To: UTC
Re: 6.3 BIDI feedback review
From: Mark Davis
Live: http://goo.gl/5zNc8
I reviewed the Accumulated Feedback on PRI #232. The relevant editorial changes have been made, but the following issues I think are still for consideration by the committee.
Note: I did not review P. Verdy’s or C. E. Whitehead’s comments; I leave that to committee members.
b. Implementers may want to include more items in each entry, thus maybe replace "consists of" by "includes".
MD: We don’t specify the implementation; we just supply a logical algorithm. People are free to modify as long as they get the same results, so we don’t have to (and shouldn’t) specify every case where it would be possible to extend the algorithm.
SUGGESTION: N0 will act on level runs (and not isolating run sequences as currently specified).
Rationale: - cases b and c will behave similarly (no automatic pairing).
- if the author is smart enough to use embeddings (which justifies not pairing automatically in case b), he/she is no less smart in case c, so that he/she will take care of the level of parentheses if needed.
- simplified implementation of pairing
AG: Disagree, not changed. Isolates will be used frequently in dynamic text. The parenthesis algorithm needs to apply in such contexts. Embeddings could not be given parallel treatment since they do not form isolating run sequences. Solution is to use the isolates.
24) [...] Missing are the less-than and greater-than signs.
One important use case for pairing them is presenting XML or
HTML source code where tags and attributes are English, attribute
values may be anything, and the text between tags may also be of
any direction.
AL: I agree. Another important use case is email addresses like
"John Doe <john@doe.com>", which in RTL comes out with
the angle brackets mismatched.
While it is true that when used as less than and greater than
signs in math expressions, pairing these characters is inappropriate,
I think that it would be hard to come up with examples where not
only is a less than sign (used as such) followed by a greater
than sign, but applying the BPA to them would actually change the
display order.
AG: <> are not in the current version of BidiBrackets-6.3.0.txt, but we should discuss at UTC. I'm in favour of adding them.
MD: I am too, for reasons cited.
29) About the review note at the end of this section: I think that this is not the place to add more examples. In a normative document like this one, the role of the examples is to clarify the intent, not to justify it.
MD: I think examples always help, including examples that provide motivation. (I agree that we don’t want to be too “proposally” in the language, however.)
Since we are making such drastic changes to bidi, I suggest we also
bump up the 61 limit. I suggest either not specify a limit, or
something like "at least 253" kind of wording.
One of my concerns is that if, for example, a web browser ends up
using isolates or embedding characters when converting a div to text
copied to clipboard, then the deeply nested div structures of today's
web sites will make it feasible to reach the current 61 limit in a
realistic use case.
Not a huge deal, but given the computing resources of this decade,
it's just free to bump it up at least.
MD: The committee should consider.
It would be easier for implementers of TUS if a single uniform format were
adopted, and all new data files conformed to it. And, that format should
require a minimum of effort to add to implementations.
The format of BidiBrackets.txt, for example, requires one to teach the
implementation that column 2 is one property and column 3 is another. That is
extra work that could be avoided if the new files came in a format that didn't
require it. An existing file with such a syntax is DerivedCoreProperties.txt.
That format could easily be adapted for non-binary properties, and many other
formats are possible. But my point is that you should publish the files in
some such format to make it easier on implementers. We are stuck with the
format of already-published files, but we can do better for future files.
Similarly, the now machine-readable @missings lines are inconsistent.
In BidiBrackets.txt it is
# @missing: 0000..10FFFF; <none>; n
Compare that to an @missings line in PropertyValueAliases.txt
# @missing: 0000..10FFFF; Bidi_Mirroring_Glyph; <none>
There are three columns in each, but the meanings of column 2 are
inconsistent. One is a property name, and one is a property value. And the
third column in one gives the default value for the property in column 2. The
other gives the default for an unnamed property that has to be taught to the
implementation. If new @missings lines followed the syntax from
PropertyValueAliases.txt, no teaching would be necessary. In
BidiBrackets.txt, there would be two such lines, one for each property. My
implementation already deals with the possibility of multiple @missings lines
per file, as several existing files have them.
[...]
The bottom line of what I was trying to say is that going forward, each new
data file should be in a form that doesn't require manual intervention to
specify to an implementor. This could be because the format of the file has
each line contain only values for a single property, and includes that
property name; or there could be machine-readable comments that describe the
format of each entry, so that the file becomes self-describing.
Currently, one has to know the file's format in order to interpret the
@missings (supposedly) machine-readable line in this file. In the past, I've
coped with this by using the @missings lines in PropertyValueAliases.txt, but
there is no @missings entry there for Bidi_Paired_Bracket_Type. I presume
that is an oversight that will be fixed before final publication. But, I
believe all @missings lines should look like those in
PropertyValueAliases.txt, with each containing full information, and not
depending on the format of the file they are contained in.
MD: As I recall, the committee made a conscious decision to present the values in multiple fields (as we do in some other files), so that they are exactly parallel. Our formats do have to balance multiple, sometimes competing, goals:
What we have ended up doing for these is to generate Derived files that have a simple format; we may want to consider that here as well.
I agree that the <none> is a problem. It is a pain to have to deal with data values that are either code points or sequences of code points, and have a <none> value. They don’t correspond nicely to APIs, especially for single code points, where you always want to return a primitive type.
As to the @missing, I think it should exactly mirror whatever data lines are in the file in terms of content. I have an action to look at that.
2. I agree with Aharon Lanin that it should be made clear that all characters
with Bidi_Paired_Bracket_Type values Open or Close have bidi class ON (the
note at the end of rule N0, bullet d implies that, but it should be mentioned
explicitly); in fact, I think it ought to be a Unicode Stability Policy.
3. It might be worth mentioning (in the Implementation Notes section, perhaps)
that Rule N0 and the associated definition BD16 can be implemented without
actually creating a stack or list that BD16 calls for; such an implementation
would be slower, but could require less memory, which can be important for
embedded systems with limited RAM.
One way to implement BD16 with minimum memory requirement might be as follows:
* For each character with Bidi_Paired_Bracket_Type other than None, assign a
status, one of: unresolved (initial value), resolved as paired, resolved as
unpaired. Note that if such characters are guaranteed to have bc=ON, the
Bidi_Paired_Bracket_Type property and the status can be encoded by creating
additional, ‘virtual’ bidi classes (which would behave as ON for all the
other purposes).
* For each unresolved closing bracket, search backward until either sos or an
unresolved opening bracket that forms a bracket pair with the closing
bracket is found. In the latter case, resolve both brackets as paired, and
if there are any unresolved opening brackets enclosed within the pair,
resolve them all as unpaired. [Note: This corresponds to the 5 steps listed
in BD16.]
* Once the previous step is complete, for each opening bracket resolved as
paired, the matching closing bracket can be found by the following
algorithm: Initialize a counter to 1. Scan forward the isolating run
sequence, incrementing the counter for each opening bracket resolved as
paired, and decrementing it for each closing bracket resolved as paired; the
matching closing bracket is the first one that causes the counter to be
decremented to 0. (This would work because bracket pairs, as defined by
BD16, may be nested, but cannot otherwise overlap.)
* Note that closing brackets do not have to be resolved as unpaired; as long
as each is checked only once, those that are not resolved as paired can be
left in the unresolved state.
MD: The committee should consider whether it is worth adding in the implementation guide section.
Reference Implementation. Given the complexity of the new algorithm, I think it is incumbent upon us to have two independent reference implementations before we can release U6.3. Moreover, these must be tested against one another in a thorough “monkey test”, and we should recommend that any production-level implementation do the same. Merely extending the BidiTest file will not be sufficient.