Limits in UBA
eliz at gnu.org
Wed Oct 22 13:20:28 CDT 2014
> From: "Andrew Glass (WINDOWS)" <Andrew.Glass at microsoft.com>
> Date: Wed, 22 Oct 2014 17:57:52 +0000
Thanks for responding.
> Embeddings are common in generated text. The guiding principle, is seemingly, when in doubt wrap the string in an embedding. At the UTC, we heard, that this can lead to very deep stacks - but I've never actually seen one with more than 63 levels - but that is not my topic here.
I'd appreciate some pointers to such texts, if they are publicly
accessible. I'd be very interested to see why such deep embeddings
In Emacs, we do use embeddings and overrides in a few places in text
we generate, for example, to make sure information about a character
displayed by a specialized command doesn't get jumbled due to that
character's bidi class. But we never needed more than one, maximum 2
levels. Most of the cases can be resolved by using LRM or RLM.
> The BPA is not as subject to the extremes of generated text, and therefore brackets should follow a natural limit such that it is possible for a human to parse and track the bracketed levels. As such, the max depth is going to be quite low in normal text. Most cases of the BPA involve one pair. Nested pairs beyond three become quite artificial - except in languages such as LISP. However, supporting correct display of Bidi LISP code is not a goal of the BPA. I'm not sure what the maximum depth used by the test data is - I think that is the best current guide unless we introduce something.
The test data doesn't have more than 3 nested levels, I think.
For Emacs, I limited the BPA stack at 1024 levels, which is probably
way too much, but it was cheap, so I saw no reason forcing an
arbitrary lower limit.
As for Lisp and similar languages, since the BPA in otherwise all-L2R
text is equivalent to "normal" resolution of neutrals per N1 and N2, I
simply bypass the BPA in that case -- because N1/N2 processing is much
cheaper in the Emacs case. So Lisp is not the case that worries me.
But I do wonder why there's absolutely no guidance in the UBA
regarding this issue, which in practice every implementor will
probably bump into.
More information about the Unicode