The UTC is considering proposals for two characters to help address various difficult issues in bidirectional text layout. These two characters are similar to the already-encoded LRM and RLM.
The Unicode Bidirectional Algorithm (UBA), specified in Unicode Standard Annex #9, supports three strong Bidi_Class property values (also referred to as direction categories) for general text: L (Left-to-Right), R (Right-to-Left), and AL (Right-to-Left Arabic). R and AL differ in their effect on the resolution of the direction of subsequent characters in numeric expressions with Bidi_Class values EN, ES, CS, ET.
The UBA also currently provides two implicit directional marks: U+200E LEFT-TO-RIGHT MARK (LRM) and U+200F RIGHT-TO-LEFT MARK (RLM). These are invisible, zero-width characters that behave exactly like characters with Bidi_Class L and R, respectively. These characters are used to customize bidi text layout. They have no other semantic effect. As noted in UAX #9, “Their use is more convenient than using explicit embeddings or overrides because their scope is much more local.”
However, this set of implicit directional marks is missing an AL MARK (ALM), which is like the other two except that it has Bidi_Class value AL. This is needed in order to address some problems in the layout of numeric expressions. For example, consider an isolated field that should display a numeric expression in a way that would match what its layout would be if it were in the middle of Arabic text. To produce this layout, the field could use an ALM at the beginning of the numeric expression.
If necessary, an ALM could be inserted right after a RLO or RLE to ensure that the override or embedding begins with an AL direction context.
Adding ALM does not require any new Bidi_Class values or any changes to the definitions or steps of the UBA.
The UTC would appreciate any feedback regarding this proposed addition and its possible impact on implementations.
There are many instances in which semi-structured text is composed of two or more fields separated by neutral or weak-directional characters, and the fields should be laid out in order of the paragraph direction (or more precisely, the current embedding direction). For example, numeric dates in Arabic often have a logical order of d/M/y:
Because ‘/’ has Bidi_Class value CS and the digits (whether EN or AN) are weakly left-right, such a sequence will always be laid out left-to-right. Adding RLM before each ‘/’ will force the date to always be laid out right-to-left, regardless of direction context. If the direction context is known in advance then it is possible to insert RLM or not in order to generate appropriate behavior. However, it is impossible to create the correct behavior in all contexts. For example:
To handle situations of this sort, it is proposed to have a character which behaves like LRM or RLM, but whose Bidi_Class value is dynamically re-assigned based on the direction associated with the current embedding level. If the embedding level is L would behave like LRM, and if the embedding level is R it would behave like RLM.
To handle LDM, the optimum solution would normally be to define a corresponding new Bidi_Class value, and then update the UBA to handle this new category. It could then be used to override the Bidi_Class value of selected characters, which—in situations that permitted such overrides—could achieve the LDM behavior without insertion of extra mark characters.
However, per the Unicode Character Encoding Stability Policy, “The Bidi_Class property values will not be further subdivided”. There is no such restriction on changes to the bidi algorithm itself, though for implementation stability, changes that impact backwards compatibility should be avoided. This leaves several alternatives:
Define no new Bidi_Class value for LDM; instead, give LDM the Bidi_Class value ON (Other Neutral). Then define a new rule for UBA:
W0. Examine each level direction mark character (LDM) in the level run, and set the bidi type to L if the level is even, and R if the level is odd.
This has some problems:
Not defining a separate Bidi_Class value for LDM will probably result in implementations effectively defining their own additional classes.
The UTC would like feedback on which (if any) of these approaches is preferred.