[Unicode]  Frequently Asked Questions Home | Site Map | Search

Writing Direction and Bidirectional Text FAQ

Writing Direction

Q: What does "writing direction" refer to?

A: Individual writing systems make different default assumptions about how characters are arranged into lines and how lines of text are then arranged on a page or screen. Such assumptions are referred to as a writing system's directionality. For example, in writing systems based on the Latin script, characters are laid out horizontally from left to right to form lines, and lines of text are then laid out running from top to bottom on a page. Because the predominant direction of text flow for the Latin script is from left to right on the page, the Latin script is referred to as a "left-to-right script".

Q: Are there scripts that go from right to left?

Many scripts arrange characters from right to left into lines. For some historic Middle-Eastern scripts, all of the characters in text flow from right to left, so those scripts are referred to as "right-to-left scripts". But modern writing systems such as Arabic and Hebrew are more complicated, because in addition to their basic letters, which flow from right-to-left, they also use other characters, such as digits, which may display the other way. And they are often mixed on the same page with left-to-right scripts such as Latin, so that text needs to run both ways on the same line. When text runs both ways on the same line, we refer to it as bidirectional (or bidi) text.

Q: How does the Unicode Standard deal with writing that mixes directions?

Ordering characters into lines can be very complex when left-to-right and right-to-left scripts are used together. Proper display of Arabic, Hebrew, and similar scripts can require dealing with runs of text that have opposite directions on the same line. Also, the direction and location of punctuation characters is determined by the text that surrounds them, so the actual direction of a specific part of the line depends on context analysis. The Unicode Standard defines an algorithm to determine the layout of a line, including provision for overrides to handle situations that are ambiguous; see UAX #9, Unicode Bidirectional Algorithm for more information.

Q: When is text written vertically?

A: It is quite common to see text written in vertical lines in East Asia. This practice is still widespread in modern Japanese writing, for example, and used to be standard typography for China and countries influenced by Chinese culture.

When Japanese or Chinese are written vertically, the lines run from top to bottom, and then are arranged in columns that run from right to left on the page. Traditional Mongolian is also written vertically, but for Mongolian the columns run from left to right on the page.

Q: Is vertical text also handled by the Unicode Bidirectional Algorithm?

A: No. Vertical text in Japanese or Chinese only runs in a single direction—from top to bottom—so does not require dealing with two opposite directions on the same line. Unlike the bidirectional case, the choice of vertical layout is usually treated just as a formatting style. The Unicode Standard does not provide directionality controls designed to override such behavior.

Q: How does vertical text influence the orientation of characters?

Most characters use the same shape and orientation whether displayed horizontally or vertically, but many punctuation characters will change their shape when displayed vertically. Also, letters and words from other scripts are generally rotated through ninety degree angles when mixed with vertical Japanese or Chinese writing, so that they, too, will read from top to bottom. Letters from left-to-right scripts will be rotated clockwise, while letters from right-to-left scripts will be rotated counterclockwise, both through ninety degree angles.

Some individual letters and digits, as well as short combinations of them, may remain upright, instead of being rotated in vertical text. In some cases, there exist compatibility characters specifically intended to have this upright orientation in East Asian typography.

The Unicode Standard provides a property, Vertical_Orientation, which specifies which characters rotate in a vertical context, and which stay upright in orientation by default. See UAX #50, Unicode Vertical Text Layout for more information.

Q: Are there any other script directions?

A: Other script directionalities are possible and are found in actual writing systems, mainly in historical ones. For example, some ancient Numidian texts are written bottom-to-top, and Egyptian hieroglyphics can be written with various directions for individual lines.

One prominent example is boustrophedon (literally, "ox-turning"), which is often found in ancient European writing systems such as early Greek. In boustrophedon writing, characters are arranged into horizontal lines, but the individual lines alternate between running right to left and running left to right, the way an ox goes back and forth when plowing a field. The letters themselves use mirrored images in accordance with each individual line's direction. [JJ]

Q: Do developers need to worry about these historical directions?

A: Not really. Boustrophedon writing is of interest almost exclusively to scholars intent on reproducing the exact visual content of ancient texts. The Unicode Standard does not provide formatting codes to signal boustrophedon text. Specialized word processors for ancient scripts might offer support for this. In the absence of that, fixed texts can be written in boustrophedon by using hard line breaks and directionality overrides. [JJ]

Bidirectional Text

Q: What is the Unicode Bidirectional Algorithm?

A: The Unicode Bidirectional Algorithm, often abbreviated to just "UBA", explains in detail how text should be laid out in lines whenever it consists of a mixture of left-to-right script characters and right-to-left script characters. The UBA is specified in UAX #9, Unicode Bidirectional Algorithm.

Q: How does the UBA tell which characters go left to right and which go right to left?

A: The UBA depends on a character property called Bidi_Class, which has property values defined for all Unicode characters. Note that in addition to left-to-right characters (for example, in the Latin script) and right-to-left characters (for example, in the Arabic script), there are also many characters with a neutral direction. Their behavior in bidirectional text layout depends on the details of their proximity to other characters of strong right-to-left or left-to-right direction. For example, most punctuation and symbol characters have neutral direction. For a complete listing of the Bidi_Class for all characters, see the data file DerivedBidiClass.txt in the Unicode Character Database.

Q: Does the Unicode Bidirectional Algorithm depend on giving default values for the Bidi_Class property to unassigned code points?

A: Yes, default values are defined for unassigned code points for all character properties. For a discussion of how this works and details about particular default values for the Bidi_Class property used in the Unicode Bidirectional Algorithm, see UAX #44, Unicode Character Database.

Q: Are there any issues with normalizing Arabic and/or Hebrew?

A: Yes, see the question "Isn't the canonical order for Arabic characters wrong?" for a clarification.

Q: Some Kannada characters seem to have inconsistent character properties. They have General_Category Mn but Bidi_Class L. Is this right?

A: Ordinarily, nonspacing combining marks (General_Category=Mn) also get the Bidi_Class NSM. There are exceptions, however. For two Kannada vowels, U+0CBF KANNADA VOWEL SIGN I and U+0CC6 KANNADA VOWEL SIGN E, the Unicode Technical Committee made an explicit decision to give these combining marks the Bidi_Class L. This choice serves to preserve canonical equivalence in handling bidirectional text formatting for the two-part Kannada vowels which have either of these two vowels as part of their canonical decompositions.

Q: I have some mixed Arabic and English text. It seems to display incorrectly on my browser! Why?

A: There are several possible reasons. You may simply not have an appropriate Arabic font on your device. But when Arabic and English text are mixed together on a single line, the exact way they are formatted depends on application of the Unicode Bidirectional Algorithm, and the correct display is not always immediately obvious. In particular, the overall paragraph direction can change how mixed Arabic and English text appears on a line.

Q: How does overall paragraph direction change the display of mixed text?

For example, suppose your text is: " ما هو الترميز الموحد يونيكود؟ in Arabic ". The logical representation of that text is shown in the table below, with uppercase letters standing for the Arabic and lowercase letters for the English. As you understand the Unicode Bidirectional Algorithm, you might think this should be rendered as in the RTL column. For example, it might be what your application does, and an Arabic speaker may have confirmed to you that this is indeed correct. Your browser, however, displays this as in the LTR column. Your understanding of both the algorithm and of how to read bidirectional text implies that the example text is predominantly Arabic and should be a right-to-left (RTL) paragraph. Hence, you should start with the Arabic at the right hand side (reading towards the left) and then continue with the English text after that (reading towards the right).

Logical Order Display Order
◀ ◀ ◀ RTL LTR ▶ ▶ ▶
WHAT IS UNICODE؟ in arabic in arabic ؟EDOCINU SI TAHW ؟EDOCINU SI TAHW in arabic
  ما هو الترميز الموحد يونيكود؟ in Arabic ما هو الترميز الموحد يونيكود؟ in Arabic

However, the rendering you actually get depends on the setting of the paragraph direction. The paragraph direction can be based on the first strongly directional character in the text. But as this is often an incorrect guess, one option is to override such a guess by making an explicit choice, whether by means of the document style or the user interface. Depending on the setting for paragraph direction, you would get either the RTL or the LTR display shown above.

So it is likely that your browser is not actually displaying incorrectly. Use of paragraph direction markup is illustrated in the last row of the table above, for which the table cell for the RTL column has an explicit dir="rtl" attribute set, while the table cell for the LTR column has an explicit dir=ltr" attribute set. When viewing this page, your browser should lay out the pieces of those examples as in the schematic row shown just above them.

Q: Can I see another paragraph direction example?

A: Sure. Here is a small Hebrew example. The paragraph direction is set to RTL by putting the dir="rtl" attribute on a "blockquote" element containing the text.

פרטים אודות הקונסורציום של יוניקוד (Unicode Consortium)

הקונסורציום של יוניקוד הוא ארגון ללא מטרת רווח שנוסד כדי לפתח, להרחיב ולקדם את השימוש בתקן יוניקוד, אשר מגדיר את ייצוג הטקסט במוצרי תוכנה ותקנים מודרניים.

Here it is again. In this case the paragraph direction is set instead to LTR.

פרטים אודות הקונסורציום של יוניקוד (Unicode Consortium)

הקונסורציום של יוניקוד הוא ארגון ללא מטרת רווח שנוסד כדי לפתח, להרחיב ולקדם את השימוש בתקן יוניקוד, אשר מגדיר את ייצוג הטקסט במוצרי תוכנה ותקנים מודרניים.

Explicit specification of the paragraph direction as either RTL or LTR by means of an attribute in the "blockquote" element is an example of application of a higher-level protocol. Both displays, if correctly handled by your browser, should be considered conformant with the Unicode Standard—it is not the case that one is correct and one is incorrect. The application of the higher-level protocol simply defines the directional context within which the UBA then determines the appropriate layout of the bidirectional text lines.

Q: What is a higher-level protocol?

A: The Unicode Standard defines a higher-level protocol as "any agreement on the interpretation of Unicode characters that extends beyond the scope of [the] standard." The Unicode Bidirectional Algorithm, in particular, allows some options to be set by higher-level protocols. See "Higher-Level Protocols" in UAX #9 for examples. Directional markup for HTML is just one example of a higher-level protocol which can be used with the UBA.

Q: Can a program itself constitute a higher-level protocol for bidirectional text?

A: A program, such as one implementing a terminal display window, is not generally considered a "protocol", per se. However, it is quite common for such programs to implicitly define an overall directional context for display, and that implicit definition of direction is itself an example of application of a higher-level protocol for the purposes of the UBA. For example, a terminal window may simply assume a left-to-right paragraph direction for display. That is the functional equivalent of an explicit HTML markup for dir="ltr" on a form input element.

A terminal display window might also allow a choice between an overall left-to-right paragraph direction or a right-to-left paragraph direction. Such a choice is not required for conformance to the Unicode Standard, although, of course, it might be a useful option for end users. In any case, if a terminal display window can display bidirectional text as illustrated above for Hebrew correctly for a left-to-right directional context or for a right-to-left directional context or for both contexts, that suffices to consider it conformant to the Unicode Bidirectional Algorithm.