Untitled Document

L2/99-050

Subject:	BIDI Ad Hoc proposal for Unicode 3.0
By:	Mark Davis
Date:	1999-02-04

The following proposal reflects the results of the Bidi subcommittee meetings in January, with a few additional changes proposed at the UTC meeting. This proposal is a delta to TR#9 and Unicode Data 2.1.8. The material has been reviewed at the meeting, and I will not repeat the discussion there on the different items. Overall, the effects on the display by conformant implementations should be quite small--basically edge conditions are cleared up.

Items marked by * or ¤ are will improve the statement of the algorithm, but not change results of conformant implementations as long as they are all included together.
The items marked with ¤ are normative, even though they will not affect the results of the algorithm, since they do change the BIDI character properties in the Unicode Character Database.
Unmarked items are normative.
Other editorial changes such as separating out definitions, making sure the same defined terms were used uniformally, renumbering rules for clarity, adding more examples, etc. are not included here, since those do not need decision by the UTC at the next meeting.

BIDI Character Properties

A. New properties¤

a. BIDI properties AL, CM, LRO, RLO, LRE, RLE, PDF will be created¤
b. All characters with general category Me, Mn will be given BIDI property CM.¤
c. All characters of type R in the Arabic, Thana, Syriac ranges (0600-07BF, FB50-FDFF, FE70-FEFF) will be given BIDI property AL.¤
d. The explicit embedding characters LRO, RLO, LRE, RLE, PDF will be given the corresponding property.¤

B. Related Algorithm Changes

a. Unassigned Hebrew characters (0590-05FF, FB1D-FB4F) will be given type R.
b. Unassigned Arabic, Thana, Syriac characters (0600-07BF, FB50-FDFF, FE70-FEFF) will be given type AL.
c. All other unassigned characters will be given type L.
d. We will add notes that: (1) as characters are assigned, these values might change, (2) private use characters can be assigned different values by a conformant implementation, (3) a-c are exceptions to the conformance requirements with respect to unassigned characters.
e. Rules referring to combining marks will refer instead to CM.
f. Rules referring to characters in the Arabic Block will refer instead to AL.*

C. Reset the following individual characters to a new type (in parens):¤

Lm ON (L) 3005 IDEOGRAPHIC ITERATION MARK
Lm ON (L) 3031 VERTICAL KANA REPEAT MARK
Lm ON (L) 3032 VERTICAL KANA REPEAT WITH VOICED SOUND MARK
Lm ON (L) 3033 VERTICAL KANA REPEAT MARK UPPER HALF
Lm ON (L) 3034 VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF
Lm ON (L) 3035 VERTICAL KANA REPEAT MARK LOWER HALF
Lm ON (L) FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
Lm ON (L) FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
Mc ON (L) 0F3E TIBETAN SIGN YAR TSHES
Mc ON (L) 0F3F TIBETAN SIGN MAR TSHES
Zs CS (WS) 2007 FIGURE SPACE
Zs WS (CS) 00A0 NO-BREAK SPACE
Zs WS (CS) 202F NARROW NO-BREAK SPACE
Zl B (WS) 2028 LINE SEPARATOR

D. Add

"For the purpose of the BIDI algorithm, inline objects (such as graphics) are treated as if they are an U+FFFC OBJECT REPLACEMENT CHARACTER."

E. Fix Tables 3-1 to correspond to the new data tables, adding the new BIDI categories.

F. Fix Table 3-2 to remove AL, add explanations, and change sot/eot to start of block/end of block. Make clearer that blocks are treated separatedly.*

F2. Fix Chapter 4 to reflect new BIDI categories.

Algorithm

G. Change the maximum embedding level set by explicit controls to 61 (e.g. 6-bit limit).

H. At the end of rules E2a, E3a, O1a, O2a, set RLE => R, LRE => L, RLO => R, LRO => L, respectively.*

I. Change T6 to read as follows, and change the example as appropriate.

"T6. If the character after a PDF is the same as the matching code for that PDF, set the PDF and that next character to BN.* Otherwise, if the PDF is immediately preceded by an embedding code, set that previous character and the PDF to BN."

J. Incorporate new types by changing the following rules. (P0a has a fix for ET).

"C0. A sequence of CM is given the type of the preceding character; at the start of a block, they are given the type ON."*

"P0. Search backwards from each instance of a European number until the first strong character (or block boundary) is found. If that first character is AL, change the type of the European number to Arabic number:"*

"P0+. Change all ALs to R."*

"P0a. Change any Boundary Neutrals adjacent to an European Number to a European Number; otherwise change any Boundary Neutrals adjacent to an European Terminator to a European Terminator; otherwise change any remaining Boundary Neutrals adjacent to an Arabic Number to an Arabic Number."

K. Change N3 to the following. The rules will be reordered or commented to make clear that N3 must be applied before N1.

"N3. For the purpose of resolving neutrals,
(a) European numbers are treated as though they were the type of the previous strong character. If this type is L, change the EN to L.
(b) If there is no previous strong character, European number are treated as though they had the base direction. If this type is L, change the EN to L.
(c) Arabic numbers are treated as though they were R.

The following are examples

R N EN -> R R EN
L N EN -> L L EN
EN N R -> EN e R
EN N L -> EN e L

R N AN -> R R AN
L N AN -> L e AN
AN N R -> AN R R
AN N L -> AN e L"

L. Change I1 to drop the "unless" clause (handled by additions to N3a and N3b).*

"I1. If the embedding direction is even (left-to-right), then the right-to-left text goes up one level. Numeric text (AN) goes up two levels. A sequence of one or more numeric types (EN) goes up two levels."*

M. Change in 3.1 the following to add "should":

"The directional formatting codes are used only to influence the display ordering of text. In all other respects they are ignored--they should have no effect on the comparison of text, nor on word breaks, parsing, or numeric analysis."

N. In "Terminating Embeddings and Overrides", delete:

"Higher level protocols may choose to interpret PDFs that occur when there is no pushed state. For example, a presentation engine may receive blocks of processed Unicode text divided into lines. If the complexity of the text is limited by the higher-level protocol, then PDF can be interpreted significantly."

O. In "Higher Level Protocols", change to something like:

"Override the number handling to use information provided by a broader context. For example, information from other paragraphs in a document could be used to conclude that the document was fundamentally Arabic, and that EN should generally be converted to AN."

Additional Changes

P. Add a section with examples of the kinds of differences from 2.0 that will occur (basically edge cases).

Q. Make an explicit conformance clause that reads something like the following (to be wordsmithed in the editorial committee):

Conformance: A process that displays text containing supported right-to-left characters or embedding codes shall display all visible representations of characters (excluding format characters) in the same order as if the Bidirectional Algorithm had been applied to the text, in the absence of higher level protocols.

The goal in marking a character as BN is that it have no effect on the rest of the algorithm. Since the precise ordering of format characters with respect to others is not required for conformance, implementations are free to handle them in different ways for efficiency as long as the ordering of the other characters is preserved.

R. Change B1/2 to be something like the following (to be wordsmithed in the editorial committee):

B1. In the block, find the first character that either is a strong directional character (L, AL, R) or is a directional code (RLE, LRE, RLO, LRO) not immediately followed by PDF. (Because block separators delimit text in this algorithm, this will generally be the first strong character after a block separator or at the very beginning of the text.)

B2. If a character is found in B1 and it is of type AL, R, RLE, or RLO, then set the base level to one; otherwise, set it to zero.

S. In accordance with Q, change format and control characters that have type N to be of type BN. [Note: I added this after the Ad Hoc meeting after looking at the effects of Q and the statements about other format characters in the book.--MD.]

PS.: After reviewing the algorithm, NSM would be a better abbreviation than CM