Technical Reports |
Version | 4.1.0 |
Authors | Asmus Freytag (asmus@unicode.org) |
Date | 2005-08-29 |
This Version | http://www.unicode.org/reports/tr14/tr14-17.html |
Previous Version | http://www.unicode.org/reports/tr14/tr14-15.html |
Latest Version | http://www.unicode.org/reports/tr14/ |
Revision | 17 |
This report presents the specification of line breaking properties for Unicode characters as well as a model algorithm for determining line break opportunities.
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References section. For the latest version of the Unicode Standard see [Unicode]. See [Reports] for a list of current Unicode Technical Reports. For more information about versions of the Unicode Standard, see [Versions].
The text of The Unicode Standard [Unicode] presents a limited description of some of the characters with specific function in line breaking, but does not give a complete specification of line breaking behavior. This Unicode Standard Annex provides more detailed information about default line breaking behavior reflecting best practices for the support of multilingual texts.
For most Unicode characters, considerable variation in line breaking behavior can be expected, including variation based on local or stylistic preferences. Therefore, the line breaking properties provided for these characters are informative. Some characters are intended to explicitly influence line breaking. Their line breaking behavior is therefore expected to be identical across all implementations. The Unicode Standard assigns normative line breaking properties to those characters. The Unicode Line Breaking Algorithm is a tailorable set of rules that uses these line breaking properties in context to determine line break opportunities.
This document opens with formal definitions, a summary of the line breaking task and a brief section on conformance requirements. Four main sections follow:
All terms not defined here shall be as defined in the Unicode Standard [Unicode]. The notation defined in this technical report differs somewhat from the notation defined elsewhere in the Unicode Standard. All other notation used here without an explicit definition shall be as defined in the Unicode Standard.
Line fitting — the process of determining how much text will fit on a line of text, given the available space between the margins and the actual display width of the text.
Line Break — the position in the text where one line ends and the next one starts.
Line Break Opportunity — a place where a line is allowed to end. Whether a given position in the text is a valid line break opportunity depends on context as well as the line breaking rules in force.
Line Breaking — the process of selecting one among several line break opportunities such that the resulting line is optimal or ends at a user-requested explicit line break.
Line Breaking Property — A character property with enumerated values, as listed in Table 1 and separated into normative and informative. Line breaking property values are used to classify characters, and taken in context, determine the type of break.
Line Breaking Class — a class of characters with the line breaking property value.
Mandatory Break -—a line must break following a character that has the mandatory break property. Such a break is also known as a forced break and is indicated in the rules as B !, where B is the character with the mandatory break property.
Direct Break — a line break opportunity exists between two adjacent characters of the given line breaking classes. This is indicated in the rules below as B ÷ A, where B is the character class of the character before and A is the character class of the character after the break. If they are separated by one or more space characters, a break opportunity also exists after the last space. In the pair table, the optional space characters are not shown.
Indirect Break — a line break opportunity exists between two characters of the given line breaking classes only if they are separated by one or more spaces. In this case, a break opportunity exists after the last space. No break opportunity exists if the characters are immediately adjacent. This is indicated in the pair table below as B % A, where B is the character class of the character before and A is the character class of the character after the break. Even though space characters are not shown in the pair table, an indirect break can only occur if one or more spaces follow B. In the notation of the rules in Section 6, Line Breaking Algorithm this would be represented as two rules: B × A and B SP+ ÷ A.
Prohibited Break — no line break opportunity exists between two characters of the given line breaking classes, even if they are separated by one or more space characters. This is indicated in the pair table below as B ^ A, where B is the character class of the character before and A is the character class of the character after the break and the optional space characters are not shown. In the notation of the rules in Section 6, Line Breaking Algorithm this would be expressed as a rule of the form: B SP* × A.
Hyphenation — Hyphenation uses language-specific rules to provide additional line break opportunities within a word. Hyphenation improves the layout of narrow columns, especially for languages with many longer words, such as German or Finnish. For the purpose of this document, it is assumed that hyphenation is equivalent to inserting soft hyphen characters. All other aspects of hyphenation are outside the scope of this document.
Class |
Descriptive Name |
Examples |
Characters with this property... |
Normative Line Breaking Classes |
|||
Mandatory Break |
NL, PS |
cause a line break (after) |
|
Carriage Return |
CR |
cause a line break (after), except between CR and LF |
|
Line Feed |
LF |
cause a line break (after) |
|
Attached Characters and Combining Marks |
Combining Marks, control codes |
prohibit a line break between the character and the preceding character |
|
NL * | Next Line | NEL | cause a line break (after) |
Surrogates |
Surrogates |
should not occur in well-formed text |
|
WJ * | Word Joiner | WJ | prohibit line breaks before or after |
Zero Width Space |
ZWSP |
provide a break opportunity |
|
Non-breaking (“Glue”) |
NBSP, ZWNBSP, CGJ |
prohibit line breaks before or after |
|
Contingent Break Opportunity |
Inline Objects |
provide a line break opportunity contingent on additional information. |
|
Space |
Space |
generally provide a line break opportunity after the character, enable indirect breaks |
|
Break Opportunities |
|||
Break Opportunity Before and After |
EM Dash |
provide a line break opportunity before and after the character |
|
Break Opportunity After |
Spaces, Hyphens |
generally provide a line break opportunity after the character |
|
Break Opportunity Before |
Punctuation used in dictionaries |
generally provide a line break opportunity before the character |
|
Hyphen |
Hyphen-Minus |
provide a line break opportunity after the character, except in numeric context |
|
Characters Prohibiting Certain Breaks |
|||
Closing Punctuation |
“)”, “]”, “}”, etc. |
prohibit a line break before |
|
Exclamation/Interrogation |
“!”, “?” etc. |
prohibit line break before |
|
Inseparable |
Leaders |
allow only indirect line breaks between pairs. |
|
Non Starter |
small kana |
allow only indirect line break before |
|
Opening Punctuation |
“(“, “[“, “{“, etc. |
prohibit a line break after |
|
Ambiguous Quotation |
Quotation marks |
act like they are both opening and closing |
|
Numeric Context |
|||
Infix Separator (Numeric) |
. , |
prevent breaks after any and before numeric |
|
Numeric |
Digits |
form numeric expressions for line breaking purposes |
|
Postfix (Numeric) |
%, ¢ |
do not break following a numeric expression |
|
Prefix (Numeric) |
$, £, ¥, etc. |
do not break in front of a numeric expression |
|
Symbols Allowing Breaks |
/ |
prevent a break before, and allow a break after |
|
Other Characters |
|||
Ambiguous (Alphabetic or Ideographic) |
Characters with Ambiguous East Asian Width |
||
Ordinary Alphabetic and Symbol Characters |
Alphabets and regular symbols |
are alphabetic characters or symbols that are used with alphabetic characters |
|
H2 | Hangul LV Syllable | Hangul | form Korean syllable blocks |
H3 | Hangul LVT Syllable | Hangul | form Korean syllable blocks |
Ideographic |
Ideographs |
break before or after, except in some numeric context |
|
JL | Hangul L Jamo | Conjoining Jamo | form Korean syllable blocks |
JV | Hangul V Jamo | Conjoining Jamo | form Korean syllable blocks |
JT | Hangul T Jamo | Conjoining Jamo | form Korean syllable blocks |
Complex Context (South East Asian) |
South East Asian: Thai, Lao, Khmer |
provide a line break opportunity contingent on additional, language specific context analysis |
|
Unknown |
Unassigned, Private Use |
have as yet unknown line breaking behavior or unassigned code positions |
Lines are broken as result of one of two conditions. The first condition is the presence of an explicit line breaking character. The second condition results from a formatting algorithm having selected among available line break opportunities; ideally the chosen line break results in the optimal layout of the text.
Different formatting algorithms may use different methods to determine an optimal line break. For example, simple implementations consider a single line at a time, trying to find a locally optimal line break. A basic, yet widely used approach is to allow no compression or expansion of the inter-character and inter-word spaces and consider the longest line that fits. When compression or expansion is allowed, a locally optimal line break seeks to balance the relative merits of the resulting amounts of compression and expansion for different line break candidates.
When expanding or compressing inter-word space according to common typographical practice, only the spaces marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and U+3000 IDEOGRAPHIC SPACE are subject to compression, and only spaces marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and occasionally spaces marked by U+2009 THIN SPACE are subject to expansion. All other space characters normally have fixed width. When expanding or compressing inter-character space the presence of U+200B ZERO WIDTH SPACE or U+2060 WORD JOINER is always ignored.
Local custom or document style determines whether and to what degree expansion of inter-character space is allowed in justifying a line. In languages, such as German, where inter-character space is commonly used to mark e m p h a s i s (like this), allowing variable inter-character spacing would have the unintended effect of adding random emphasis, and should therefore be avoided.
In table headings that use Han ideographs, on the other hand, even extreme amounts of inter-character space commonly occur as short texts are spread out across the entire available space to distribute the characters evenly from end to end.
More complex formatting algorithms may take into account the interaction of line breaking decisions for the whole paragraph. The well known text layout system [TEX] implements an example of such a globally optimal strategy that may make complex tradeoffs across an entire paragraph to avoid unnecessary hyphenation and other legal, but inferior breaks. For a description of this strategy, see [Knuth78].
The definition of optimal line breaks is outside the scope of this document, as are methods for their selection. For the purpose of this document, what is important is not so much what defines the optimal amount of text on the line, but how to determine all legal line break opportunities. Whether and how any given line break opportunity is actually used is up to the full layout system. Some layout systems will further evaluate the raw line break opportunities returned from the line breaking algorithm and apply additional rules. [TEX] for example, uses line break opportunities based on hyphens only as a last resort.
Finally, most text layout systems will support an emergency mode which handles the case of an unusual line that contains no ordinary line break opportunities. In such line layout emergencies line breaks are placed with no regard to the ordinary line breaking behavior of the characters involved.
Three principal styles of context analysis determine line break opportunities.
The first, or Western style is commonly used for scripts employing the space character. Hyphenation is often used with space-based line breaking to provide additional line break opportunities—however, it requires knowledge of the language and in addition, it may need user interaction or overrides.
The second style of context analysis is used with East Asian ideographic and syllabic scripts. In these scripts, lines can break anywhere, except before or after certain characters. The precise set of prohibited line breaks may depend on user preference or local custom and is commonly tailorable.
Korean makes use of both styles of line break. When Korean text is justified, the second style is commonly used, even for interspersed Latin letters. But when ragged margins are used, the Western style (relying on spaces) is commonly used instead, even for ideographs.
The third style is used for scripts such as Thai, which do not use spaces, but which restrict word-breaks to syllable boundaries, the determination of which requires knowledge of the language comparable to that required by a hyphenation algorithm. Such an algorithm is beyond the scope of the Unicode Standard.
For multilingual text, the Western and East Asian styles can be unified into a single set of specifications, based on the information in this report. Unicode characters have explicit line breaking properties assigned to them. These can be utilized with these two styles of context analysis for line break opportunities. Customization for user preferences or document style can then be achieved by tailoring that specification.
In bidirectional text, line breaks takes are determined before applying rule L1 of the Unicode Bidirectional Algorithm [Bidi]. However, line breaking is strictly independent of directional properties of the characters or of any auxiliary information determined by the application of rules of that algorithm.
There is no single method for determining line breaks; the rules may change based on user preference and document layout. Therefore the information in this annex, including the specification of the line breaking algorithm, is informative, rather than normative. However, some characters have been encoded explicitly for their effect on line breaking. Users adding such characters to a text expect that they will have the desired effect. For that reason, these characters have been given normative line breaking behavior.
Conformant implementations must not tailor characters with normative line breaking classes to any of the informative line breaking classes, but may tailor characters with informative line breaking classes to one of the normative line breaking classes.
Higher-level protocols may further restrict, override, or extend the line breaking classes of certain characters in some contexts.
The specification of the Line Breaking Algorithm in this annex is informative. As stated in [Unicode] Section 3.2, Conformance Requirements, conformant implementations are not required to implement the Unicode Line Breaking Algorithm. The relationship between conformance to the Unicode Standard, and conformance to an individual Unicode Standard Annex (UAX) is described in more detail in the Unicode Standard in Section 3.2 Conformance Requirements.
There are many different ways to break lines of text, and the Unicode Standard does not restrict the ways in which implementations can do this. However, any Unicode-conformant implementation that purports to implement this specification must do so as described in the following clause. Implementations are free to deviate from this, as long as they do not purport to conform to this specification.
C1 | An implementation that claims conformance
to the default Unicode Line Breaking Algorithm shall produce the same results as the
algorithm published in this specification.
|
C2 | This specification defines default
behavior, which is to be used in the absence of tailoring
for particular languages and environments.
|
C3 | If tailoring is used by an implementation that
claims conformance to the default Unicode Line Breaking Algorithm,
the existence of such tailoring must be documented.
|
At times, this specification recommends best practice. These recommendations are not normative and conformance with this specification does not depend on their realization. These recommendations contain the expression "This specification recommends ...", or some similar wording.
This section provides detailed narrative descriptions of the line breaking behavior of many Unicode characters. In a few instances, the descriptions in this section provide additional detail about handling a given character at the end of a line, which goes beyond the simple determination of line breaks.
This section also summarizes the membership of character classes for each value of the line breaking property. Note that the mnemonic names for the line break classes are intended neither as exhaustive descriptions of their membership nor as indicators of their entire range of behaviors in the line breaking process. Instead their main purpose is to serve as unique, yet broadly mnemonic labels. In other words, as long as their line break behavior is identical, otherwise unrelated characters will be found grouped together in the same line break class.
The classification by property values defined in this section and in the data file is used as input into two algorithms defined in Section 6, Line Breaking Algorithm and Section 7, Pair-Table-based Implementation. These sections describe workable default line breaking methods. Section 8, Customization discusses how the default line breaking behavior can be tailored to the needs of particular languages for particular document styles and user preferences.
The full classification of all Unicode characters by their line breaking properties, is available in the file LineBreak.txt [Data] in the Unicode Character Database [UCD]. This is a tab-delimited, two column plain text file, with code position, and line breaking class. A comment at the end of each line indicates the character name. Ideographic, Hangul, Surrogate, and Private Use ranges are collapsed by giving a range in the first column.
As more scripts are added to the Unicode Standard and become more widely implemented and used on computers, more line breaking classes may be added, or the assignment of line breaking class may be changed for some characters. Implementations should not make any assumptions to the contrary. Any future updates will be reflected in the latest version of the data file. (See the Unicode Character Database [UCD] for any specific version of the datafile).
Line breaking classes are listed alphabetically. Each line breaking class is marked with an annotation in parentheses with the following meanings:
(A) — the class allows a break opportunity after in specified contexts
(XA) — the class prevents a break opportunity after in specified contexts
(B) — the class allows a break opportunity before in specified contexts
(XB) — the class prevents a break opportunity before in specified contexts
(P) — the class allows a break opportunity for a pair of same characters
(XP) — the class prevents a break opportunity for a pair of same characters
NOTE: The use of the letters B and A in these annotations marks the position of the break opportunity relative to the character. It is not to be confused with the use of the same letters in the other parts of this document, where they indicate position of the characters relative to the break opportunity.
Ambiguous characters act either like alphabetic characters (that is, those with the AL line breaking class) or like ideographs (that is characters with line breaking class ID), depending on context. In the absence of appropriate context information, they are treated as class AL.
As originally defined, this class contained all characters with East Asian Width property A (ambiguous width), and which would otherwise be AL in this classification. They take the AL line breaking class only when their resolved width is N (narrow) and take the line breaking class ID when their resolved width is W (wide). For more information on East Asian Width, and how to resolve it, see Unicode Standard Annex #11, East Asian Width [EAW].
The original definition included many Latin, Greek and Cyrillic characters for which a default assignment of the AL line breaking class better corresponds to modern practice. At the same time, the set of ambiguous characters has been extended to completely encompass the enclosed alphanumeric characters used for numbering of bullets.
As updated, with the exception of characters in the range U+0000..U+1FFF, this line breaking class includes all characters with East Asian Width A, plus the following characters:
24EA | CIRCLED DIGIT ZERO |
2780..2793 | DINGBAT CIRCLED SANS-SERIF DIGIT ONE..DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN |
The line breaking rules in Section 6, Line Breaking Algorithm and the pair table in Section 7, Pair Table-based Implementation, assume that all ambiguous characters have been resolved appropriately as part of assigning line breaking classes to the input characters.
Ordinary characters require other characters to provide break opportunities, otherwise no line breaks are allowed between pairs of them. However, this behavior is tailorable. In some Far Eastern documents it may be desirable to allow breaking between pairs of ordinary characters, particularly Latin characters and symbols.
NOTE: Use ZWSP as a manual override to provide break opportunities around alphabetic or symbol characters.
Except as listed explicitly below as part of another line breaking class, and except as assigned class AI or ID based on East Asian Width, this class contains the following characters:
ALPHABETIC — all characters of General Categories Lu, Ll, Lt, Lm, Lo
SYMBOLS — all characters of General Categories Sm, Sk, So
NON-DECIMAL NUMBERS — all characters of General Categories Nl and No
PUNCTUATION — all characters of General Categories Pc, Pd and Po
plus these characters:
0600..0603 | ARABIC NUMBER SIGN..ARABIC SIGN SAFHA |
06DD | ARABIC END OF AYAH |
070F | SYRIAC ABBREVIATION MARK |
2061..2063 | FUNCTION APPLICATION..INVISIBLE SEPARATOR |
These characters occur in the middle or at the beginning of words or alphanumeric or symbol sequences. However, when alphabetic characters are tailored to allow breaks, these characters should not allow breaks after.
Like the SPACE the characters in this class provide a break opportunity, but unlike SPACE they do not take part in determining indirect breaks. They can be subdivided into several categories.
Breaking spaces are the following subset of characters with General Category Zs:
1680 | OGHAM SPACE MARK |
2000 |
EN QUAD |
2001 |
EM QUAD |
2002 |
EN SPACE |
2003 |
EM SPACE |
2004 |
THREE-PER-EM SPACE |
2005 |
FOUR-PER-EM SPACE |
2006 |
SIX-PER-EM SPACE |
2008 |
PUNCTUATION SPACE |
2009 |
THIN SPACE |
200A |
HAIR SPACE |
205F | MEDIUM MATHEMATICAL SPACE |
The preceding list of space characters all have a specific width, but otherwise behave as breaking spaces. In setting a justified line, none of these spaces normally changes in width, except for THIN SPACE when used in mathematical notation. See also the SP property.
The Ogham space mark is rendered visibly between words but should be elided at the end of a line.
See the ID property for U+3000 IDEOGRAPHIC SPACE. For a list of all space characters in the Unicode Standard, see Section 6.2, General Punctuation in [Unicode].
0009 |
TAB |
Except for the effect of the location of the tab stops, the tab character acts similarly to a space for the purpose of line breaking.
00AD |
SOFT HYPHEN (SHY) |
SHY marks an optional place where a line break may occur inside a word. It can be used with all scripts. SHY is rendered invisibly and has no width: it merely indicates an optional line break. The rendering of the optional line break depends on the script. For the Latin script, rendering the line break typically means displaying a hyphen at the end of the line, however, some languages require a change in spelling surrounding a line break. For examples, see Section 5.3 Use of Soft Hyphen.
Breaking hyphens establish explicit break opportunities immediately after each occurrence.
058A |
ARMENIAN HYPHEN |
2010 |
HYPHEN |
2012 | FIGURE DASH |
2013 | EN DASH |
Hyphens are graphic characters with width. Because, unlike spaces, they print, they are included in the measured part of the preceding line, except where the layout style allows hyphens to hang into the margins.
The following are other forms of visible word dividers that provide break opportunities:
0F0B |
TIBETAN MARK INTERSYLLABIC TSHEG |
1361 |
ETHIOPIC WORDSPACE |
17D5 |
KHMER SIGN BARIYOOSAN |
10100 | AEGEAN WORD SEPARATOR LINE |
10101 | AEGEAN WORD SEPARATOR DOT |
10102 | AEGEAN CHECK MARK |
1039F | UGARITIC WORD DIVIDER |
The Tibetan thseg is a visible mark, but it functions effectively like a space to separate words (or other units) in Tibetan. It provides a break opportunity after itself, like space. For additional information, see Section 5.4 Tibetan Line Breaking.
The Ethiopian word space is a visible word delimiter and is kept on the previous line. In contrast, U+1360 ETHIOPIC SECTION MARK is typically used in a sequence of several such marks on a separate line, and separated by spaces. As such lines are typically marked with separate hard line breaks (BK), the section mark is treated like an ordinary symbol and given line break class AL.
2027 |
HYPHENATION POINT |
A hyphenation point is a raised dot, which is primarily used to visibly indicate syllabification of words. Syllable breaks are potential line break opportunities in the middle of words. It is mainly used in dictionaries and similar works. When an actual line break falls inside a word containing hyphenation point characters, the hyphenation point is rendered as a regular hyphen at the end of the line.
007C |
VERTICAL LINE |
In some dictionaries, a vertical bar is used instead of a hyphenation point. In this usage, U+0323 COMBINING DOT BELOW is used to mark stressed syllables, so all breaks are marked by the vertical bar. For an actual break opportunity, the vertical bar is rendered as a hyphen.
Historic texts, especially ancient ones, often do not use spaces, even for scripts where modern use of spaces is standard. Special punctuation was used to mark word boundaries in such texts. For modern text processing these should be treated as linebreak opportunities by default. WJ can be used to override this default, where necessary.
16EB | RUNIC SINGLE DOT PUNCTUATION |
16EC | RUNIC MULTIPLE DOT PUNCTUATION |
16ED | RUNIC CROSS PUNCTUATION |
2056 | THREE DOT PUNCTUATION |
2058 | FOUR DOT PUNCTUATION |
2059 | FIVE DOT PUNCTUATION |
205A | TWO DOT PUNCTUATION |
205B | FOUR DOT MARK |
205D | TRICOLON |
205E | VERTICAL FOUR DOTS |
DEVANAGARI DANDA is similar to a full stop. The danda or historically related symbols are used with several other Indic scripts. Unlike a full stop, the danda is not used in number formatting. DEVANAGARI DOUBLE DANDA marks the end of a verse. It also has analogues in other scripts.
0964 | DEVANAGARI DANDA |
0965 | DEVANAGARI DOUBLE DANDA |
0E5A | THAI CHARACTER ANGKHANKHU |
104A | MYANMAR SIGN LITTLE SECTION |
104B | MYANMAR SIGN SECTION |
1735 | PHILIPPINE SINGLE PUNCTUATION |
1736 | PHILIPPINE DOUBLE PUNCTUATION |
17D4 | KHMER SIGN KHAN |
17D5 | KHMER SIGN BARIYOOSAN |
17D8 | KHMER SIGN BEYYAL |
17DA | KHMER SIGN KOOMUUT |
10A56 | KHAROSHTHI PUNCTUATION DANDA |
10A57 | KHAROSHTHI PUNCTUATION DOUBLE DANDA |
0F85 | TIBETAN MARK PALUTA |
0F34 | TIBETAN MARK BSDUS RTAGS |
0F7F | TIBETAN SIGN RNAM BCAD |
0FBE | TIBETAN KU RU KHA |
0FBF | TIBETAN KU RU KHA BZHI MIG CAN |
For additional information, see Section 5.5 Tibetan Line Breaking.
Termination punctuation stays with the line, but otherwise allows a break after it. This is similar to EX, except that the latter may be separated by a space from the preceding word without allowing a break, whereas these marks are used without spaces.
1802 | MONGOLIAN COMMA |
1803 | MONGOLIAN FULL STOP |
1804 | MONGOLIAN COLON |
1805 | MONGOLIAN FOUR DOTS |
1808 | MONGOLIAN MANCHU COMMA |
1809 | MONGOLIAN MANCHU FULL STOP |
1A1E | BUGINESE PALLAWA |
2CF9 | COPTIC OLD NUBIAN FULL STOP |
2CFA | COPTIC OLD NUBIAN DIRECT QUESTION MARK |
2CFB | COPTIC OLD NUBIAN INDIRECT QUESTION MARK |
2CFC | COPTIC OLD NUBIAN VERSE DIVIDER |
2CFE | COPTIC FULL STOP |
2CFF | COPTIC MORPHOLOGICAL DIVIDER |
10A50 | KHAROSHTHI PUNCTUATION DOT |
10A51 | KHAROSHTHI PUNCTUATION SMALL CIRCLE |
10A52 | KHAROSHTHI PUNCTUATION CIRCLE |
10A53 | KHAROSHTHI PUNCTUATION CRESCENT BAR |
10A54 | KHAROSHTHI PUNCTUATION MANGALAM |
10A55 | KHAROSHTHI PUNCTUATION LOTUS |
00B4 |
ACUTE ACCENT |
In some dictionaries, stressed syllables are indicated with a spacing acute accent instead of the hyphenation point. In this case the accent moves to the next line, and the preceding line ends with a hyphen.
02C8 |
MODIFIER LETTER VERTICAL LINE |
02CC |
MODIFIER LETTER LOW VERTICAL LINE |
These characters are used in dictionaries to indicate stress and secondary stress when IPA is used. Both are prefixes to the stressed syllable in IPA. Breaking before them keeps them with the syllable.
NOTE: It is hard to find actual examples in most dictionaries because the pronunciation fields usually occur right after the headword, and the columns are wide enough to prevent line breaks in most pronunciations.
0F01 | TIBETAN MARK GTER YIG MGO TRUNCATED A |
0F02 | TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA |
0F03 | TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA |
0F04 | TIBETAN MARK INITIAL YIG MGO MDUN MA |
0F06 | TIBETAN MARK CARET YIG MGO PHUR SHAD MA |
0F07 | TIBETAN MARK YIG MGO TSHEG SHAD MA |
0F09 | TIBETAN MARK BSKUR YIG MGO |
0F0A | TIBETAN MARK BKA- SHOG YIG MGO |
0FD0 | TIBETAN MARK BSKA- SHOG GI MGO RGYAN |
0FD1 | TIBETAN MARK MNYAM YIG GI MGO RGYAN |
These characters are Tibetan head letters which allow a break before. For more information, see Section 5.5 Tibetan Line Breaking.
1806 |
MONGOLIAN TODO SOFT HYPHEN |
Despite its name, this Mongolian character is not an invisible control like SOFT HYPHEN, but rather a visible character like a regular hyphen. Unlike the hyphen, MONGOLIAN TODO SOFT HYPHEN stays with the following line. Whenever optional line breaks are to be marked, SOFT HYPHEN should be used instead.
2014 |
EM DASH |
The EM DASH is used to set off parenthetical text. Normally, it is used without spaces. However, this is language dependent. For example, in Swedish, spaces are used around the EM DASH,. Line breaks can occur before and after an EM DASH, but not between a pair of them. Such pairs are sometimes used instead of a single quotation dash. For that reason, the line should not be broken between EM DASHes even though not all fonts use connecting glyphs for the EM DASH.
Explicit breaks act independently of the surrounding characters.
000C |
FORM FEED |
FORM FEED separates pages. The text on the new page starts at the beginning of the line. No paragraph formatting is applied.
2028 |
LINE SEPARATOR |
The text after the Line Separator starts at the beginning of the line. No paragraph formatting is applied.
This is similar to HTML <BR>
2029 |
PARAGRAPH SEPARATOR |
The text of the new paragraph starts at the beginning of the line. Paragraph formatting is applied.
“NEW LINE FUNCTION (NLF)”
New line functions provide additional explicit breaks. They are not individual characters, but are expressed as sequences of the control characters NEL, LF, and CR. What particular sequence(s) form a NLF depends on the implementation and other circumstances as described in [Unicode] Section 5.8, Newline Guidelines.
If a character sequence for a new line function contains more than one character, it is kept together. The default behavior is to break after LF or CR, but not between CR and LF. Two additional line breaking classes have been added for convenience in this operation.
FFFC |
OBJECT REPLACEMENT CHARACTER |
By default there is a break opportunity both before and after the object. Object-specific line breaking behavior is implemented in the associated object itself, and where available can override the default to prevent either or both of the break opportunities. Note that this is best implemented by querying the object itself, not by replacing the CB line breaking class by another class.
The closing character of any set of paired punctuation must be kept with the preceding character, and the same applies to all forms of wide comma and full stop. This line break class contains the following characters:
3001..3002 |
IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP |
FE11 | PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA |
FE12 | PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP |
FE50 |
SMALL COMMA |
FE52 |
SMALL FULL STOP |
FF0C |
FULLWIDTH COMMA |
FF0E |
FULLWIDTH FULL STOP |
FF61 |
HALFWIDTH IDEOGRAPHIC FULL STOP |
FF64 |
HALFWIDTH IDEOGRAPHIC COMMA |
plus any characters of General Category Pe in the Unicode Character Database.
Combining character sequences are treated as units for the purpose of line breaking. The line breaking behavior of the sequence is that of the base character.
The preferred base character for showing combining marks in isolation is U+00A0 No-Break SPACE. If a line break before or after the combining sequence is desired, U+200B ZERO WIDTH SPACE can be used. The use of U+0020 SPACE as a base character is deprecated.
The CM line break class includes all combining characters with General Category Mc, Me, and Mn, unless listed explicitly elsewhere. This includes viramas.
Most control and formatting characters are ignored in line breaking and do not contribute to the line width. By giving them class CM, the line breaking behavior of the last preceding character that is not of class CM affects the line breaking behavior.
NOTE: When control codes and format characters are rendered visibly during editing, more graceful layout might be achieved by assigning them the AL or ID class instead.
The CM line break class includes all characters of General Category Cc and Cf, unless listed explicitly elsewhere.
000D |
CARRIAGE RETURN (CR) |
A CR indicates a mandatory break after, unless followed by a LF. See also the discussion under BK.
NOTE: On some platforms the sequence CR, CR, LF is used to indicate the location of actual line breaks, whereas CR LF is treated like a hard line break. As soon as a user edits the text, the location of all the CR CR LF may change as the new text breaks differently, while the relative position of the CR LF to the surrounding text stays the same. This convention allows an editor to return a buffer and the client is able to tell which text is displayed on which line, by counting CR CR LFs and CR LFs.
Characters in this line break class behave like closing characters, except in relation to postfix and ‘non-starter’ characters. They include:
0021 |
EXCLAMATION MARK |
003F |
QUESTION MARK |
05C6 | HEBREW PUNCTUATION NUN HAFUKHA |
060C | ARABIC COMMA |
061B | ARABIC SEMICOLON |
061E | ARABIC TRIPLE DOT PUNCTUATION MARK |
061F | ARABIC QUESTION MARK |
066A | ARABIC PERCENT SIGN |
06D4 | ARABIC FULL STOP |
0F0D | TIBETAN MARK SHAD |
0F0E | TIBETAN MARK NYIS SHAD |
0F0F | TIBETAN MARK TSHEG SHAD |
0F10 | TIBETAN MARK NYIS TSHEG SHAD |
0F11 | TIBETAN MARK RIN CHEN SPUNGS SHAD |
0F14 | TIBETAN MARK GTER TSHEG |
1944 | LIMBU EXCLAMATION MARK |
1945 | LIMBU QUESTION MARK |
2762 | HEAVY EXCLAMATION MARK ORNAMENT |
2763 | HEAVY HEART EXCLAMATION MARK ORNAMENT |
FE56..FE57 |
SMALL QUESTION MARK..SMALL EXCLAMATION MARK |
FF01 |
FULLWIDTH EXCLAMATION MARK |
FF1F |
FULLWIDTH QUESTION MARK |
FE15 | PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK |
FE16 | PRESENTATION FORM FOR VERTICAL QUESTION MARK |
Non-breaking characters prohibit breaks on either side, but that prohibition can be overridden by SP or ZW. In particular, when NBSP follows SPACE, there is a break opportunity after the SPACE and NBSP will go as visible space onto the next line. See also WJ. The following lists the characters of line break class GL with additional description.
00A0 |
NO-BREAK SPACE (NBSP) |
202F |
NARROW NO-BREAK SPACE (NNBSP) |
180E | MONGOLIAN VOWEL SEPARATOR (MVS) |
NO-BREAK SPACE is the preferred character to use where two words should be visually separated but kept on the same line, as in the case of a title and a name “Dr.<NBSP>Joseph Becker”. When SPACE follows NBSP, there is no break, because there never is a break in front of SPACE. NARROW NO-BREAK SPACE is used in Mongolian. The mongolian vowel separator acts like a NNBSP in its line breaking behavior. It additionally affects the shaping of certain vowel characters as described in [Unicode] Section 12.3, Mongolian.
034F |
COMBINING GRAPHEME JOINER |
This character has no visible glyph and its presence indicates that adjoining characters are to be treated as a graphemic unit, therefore preventing line breaks between them.
2007 |
FIGURE SPACE |
This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.
2011 |
NON-BREAKING HYPHEN (NBHY) |
This is the preferred character to use where words must be hyphenated but may not be broken at the hyphen.
0F08 | TIBETAN MARK SBRUL SHAD |
0F0C |
TIBETAN MARK DELIMITER TSHEG BSTAR |
0F12 | TIBETAN MARK RGYA GRAM SHAD |
The TSHEG BstAR looks exactly like a Tibetan tsheg, but can be used to prevent a break like no-break space. It inhibits breaking on either side. For more information see Section 5.5 Tibetan Line Breaking.
035D..0362 | COMBINING DOUBLE BREVE..COMBINING DOUBLE RIGHTWARDS ARROW BELOW |
These diacritics span two characters, thus no word or line breaks are possible on either side.
Some dictionaries use a character that looks like a vertical series of four dots to indicate places where there is a syllable, but no allowable break. This can be represented by a sequence of 205E VERTICAL FOUR DOTS followed by 2060 WORD JOINER.
All characters of Hangul Syllable Type LV.
Together with conjoining jamos, Hangul syllables form Korean Syllable Blocks which are kept together; see [Boundaries. Korean uses space-based line breaking in many styles of documents. In that case Hangul syllables and conjoining jamo are tailored to use class AL but the default is class ID. See also JL, JT, JV and H3.
All characters of Hangul Syllable Type LVT. See also JL, JT, JV and H2.
002D |
HYPHEN-MINUS |
Some additional context analysis is required to distinguish usage of this character as a hyphen from the use as minus sign (or indicator of numerical range). If used as hyphen, it acts like hyphen.
NOTE: Some typescript conventions use runs of HYPHEN-MINUS to stand in for longer dashes or horizontal rules. If actual character code conversion is not performed and it is desired to treat them like the characters or layout elements they stand for, line breaking needs to support these runs explicitly.
NOTE: The actual set of characters in this class includes characters other than Han ideographs.
Characters with this property do not require other characters to provide break opportunities; lines can ordinarily break before and after and between pairs of ideographic characters. The ID line break class consists of:
2E80..2FFF |
CJK, KANGXI RADICALS, DESCRIPTION SYMBOLS |
3000 |
IDEOGRAPHIC SPACE |
|
Hiragana (except small characters) |
|
Katakana (except small characters) |
3400..4DBF |
CJK UNIFIED IDEOGRAPHS EXTENSION A |
4E00..9FAF |
CJK UNIFIED IDEOGRAPHS |
F900..FAFF |
CJK COMPATIBILITY IDEOGRAPHS |
A000..A48F |
YI SYLLABLES |
A490..A4CF |
YI RADICALS |
FE62..FE66 |
SMALL PLUS SIGN to SMALL EQUALS SIGN |
FF10..FF19 |
WIDE DIGITS |
20000..2A6D6 | CJK UNIFIED IDEOGRAPHS EXTENSION B |
2F800..2FA1D | CJK COMPATIBILITY IDEOGRAPHS SUPPLEMENT |
plus all of the FULLWIDTH LATIN letters and all of the 3000-33FF blocks not covered elsewhere.
NOTE: Use 2060 WORD JOINER as a manual override to prevent break opportunities around characters of class ID.
U+3000 IDEOGRAPHIC SPACE may be subject to expansion or compression during line justification.
Korean is encoded with conjoining jamo, Hangul syllables or both. See also JL, JT, JV, H2 and H3. The following set of compatibility jamo are treated as ID by default.
3130..318F |
HANGUL COMPATIBILITY JAMO |
These characters are intended to be used in consecutive sequence. There is never a line break between two character of this class.
2024 | ONE DOT LEADER |
2025 | TWO DOT LEADER |
2026 | HORIZONTAL ELLIPSIS |
FE19 | PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS |
Horizontal ellipsis can be used as a three-dot leader.
002C | COMMA |
002E | FULL STOP |
003A | COLON |
003B | SEMICOLON |
037E | GREEK QUESTION MARK (canonically equivalent to 003B) |
0589 | ARMENIAN FULL STOP |
060D | ARABIC DATE SEPARATOR |
2044 | FRACTION SLASH |
FE10 | PRESENTATION FORM FOR VERTICAL COMMA |
FE13 | PRESENTATION FORM FOR VERTICAL COLON |
FE14 | PRESENTATION FORM FOR VERTICAL SEMICOLON |
Characters that usually occur inside a numerical expression may not be separated from the numeric characters that follow, unless a space character intervenes. For example, there is no break in “100.00” or “10,000”, nor in “12:59”.
Infix separators are sentence ending punctuation when not used in a numeric context. Therefore they always prevent breaks before.
The JL line break class consists of all characters of Hangul Syllable Type L.
Conjoining Jamos form Korean Syllable Blocks which are kept together; see [Boundaries]. Korean uses space-based line breaking in many styles of documents. In that case Hangul Syllables and Conjoining Jamo are tailored to use class AL but the default is class ID. See Section 8.1, Types of Tailoring. See also JT, JV, H2 and H3.
The JT line break class consists of all characters of Hangul Syllable Type T. See also JL, JV, H2 and H3.
The JV line break class consists of all characters of Hangul Syllable Type V. See also JL, JT, H2 and H3.
000A |
LINE FEED (LF) |
There is a mandatory break after any LF character, but see the discussion under BK.
0085 |
NEXT LINE (NEL) |
There is a mandatory break after any NEL character, but see the discussion under BK.
Non-starter characters cannot start a line, but unlike CL they may allow a break in some context when they follow one or more space characters. Non-starters include:
0E5A..0E5B |
THAI CHARACTER ANGKHANKHU..THAI CHARACTER KHOMUT |
17D4 |
KHMER SIGN KHAN |
17D6..17DA |
KHMER SIGN CAMNUC PII KUUH..KHMER SIGN KOOMUUT |
203C |
DOUBLE EXCLAMATION MARK |
3005 |
IDEOGRAPHIC ITERATION MARK |
301C |
WAVE DASH |
303C | MASU MARK |
303B | VERTICAL IDEOGRAPHIC ITERATION MARK |
309B.. 309E |
KATAKANA-HIRAGANA VOICED SOUND MARK to HIRAGANA VOICED ITERATION MARK |
30A0 | KATAKANA-HIRAGANA DOUBLE HYPHEN |
30FB..30FE |
KATAKANA MIDDLE DOT..KATAKANA VOICED ITERATION MARK |
A015 | YI SYLLABLE WU |
FE54..FE55 |
SMALL SEMICOLON..SMALL COLON |
FF1A..FF1B |
FULLWIDTH COLON.. FULLWIDTH SEMICOLON |
FF65 |
HALFWIDTH KATAKANA MIDDLE DOT |
FF70 |
HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK |
FF9E..FF9F | HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK |
plus all Hiragana, Katakana, and Halfwidth Katakana “small” characters
NOTE: Optionally, the NS restriction may be relaxed and characters treated like ID, to achieve a more permissive style of line breaking, particular in some East Asian document styles.
These characters behave like ordinary characters in the context of ordinary characters but activate the prefix and postfix behavior of prefix and postfix characters.
Numeric characters consist of DECIMAL DIGITS (All characters of General Category Nd, except FULL WIDTH) plus these characters:
066B | ARABIC DECIMAL SEPARATOR |
066C | ARABIC THOUSANDS SEPARATOR |
Unlike IS, the Arabic numeric punctuation does not occur as sentence terminal punctuation outside numbers.
The opening character of any set of paired punctuation must be kept with the following character.
The OP line break class consists of all characters of General Category Ps in the Unicode Character Database.
Characters that usually follow a numerical expression may not be separated from preceding numeric characters or preceding closing characters, even if one or more space characters intervene. For example, there is no break in “(12.00) %”
The list of post-fix characters is:
0025 |
PERCENT SIGN |
00A2 |
CENT SIGN |
00B0 |
DEGREE SIGN |
060B |
AFGHANI SIGN |
20300 |
PER MILLE SIGN |
2031 |
PER TEN THOUSAND SIGN |
2032..2037 |
PRIME..REVERSED TRIPLE PRIME |
20A7 |
PESETA SIGN |
2103 |
DEGREE CELSIUS |
2109 |
DEGREE FAHRENHEIT |
FDFC |
RIAL SIGN |
FE6A |
SMALL PERCENT SIGN |
FF05 |
FULLWIDTH PERCENT SIGN |
FFE0 |
FULLWIDTH CENT SIGN |
Alphabetic characters are also widely used as unit designators in a post-fix position. For purposes of line breaking, their classification as alphabetic is sufficient to keep them together with the preceding number.
Characters that usually precede a numerical expression may not be separated from following numeric characters or following opening characters, even if space character intervenes. For example, there is no break in “$ (100.00)”
The PR line break class consists of all currency symbols (General Category Sc) except as listed explicitly in PO as well as the following:
002B |
PLUS SIGN |
005C |
REVERSE SOLIDUS |
00B1 |
PLUS-MINUS |
2116 |
NUMERO SIGN |
2212 |
MINUS SIGN |
2213 |
MINUS-OR-PLUS-SIGN |
NOTE: Many currency symbols may be used either as prefix or as postfix, depending on local convention. When used in that way, these currency symbols should be treated as if they have line breaking class PO.
Some paired characters can be either opening or closing depending on usage. The default is to treat them as both opening and closing.
NOTE: If language information is available, it can be used to determine which character is used as opening and which as closing quote. See the information in [Unicode] Section 6.2, General Punctuation.
The QU line break class consists of characters of General Category Pf or Pi in the Unicode Character Database as well as:
0022 |
QUOTATION MARK |
0027 |
APOSTROPHE |
23B6 | BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET |
275B | HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT |
275C | HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT |
275D | HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT |
275E | HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT |
U+23B6 BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET is subtly different from the others in this class, in that it is both an opening and a closing punctuation character at the same time. However, its use is limited to certain vertical text modes in terminal emulation. Instead of creating a one of a kind class for this rarely used character, assigning it to the QU class approximates the intended behavior.
Runs of these characters require morphological analysis to determine break opportunities. This is similar to e.g. a hyphenation algorithm. For the characters that have this property, no line breaks will be found otherwise, therefore complex context analysis is mandatory.
NOTE: These characters can be mapped into their equivalent line breaking classes as result of dictionary lookup, thus permitting a logical separation of this algorithm from the morphological analysis.
If dictionary lookup is not available they should be treated as XX.
All characters of General Category Cf, Lo or Lm in these ranges:
0E00..0EFF |
THAI / LAO |
1000..109F |
MYANMAR |
1780..17FF |
KHMER |
Line break class SG comprises all code points with General Category Cs. The line breaking behavior of isolated surrogates is undefined.
NOTE: The use of this line breaking class is deprecated. It was of limited usefulness for UTF-16 implementations that are not supporting characters beyond the BMP. The correct implementation is to resolve a pair of surrogates into a supplementary character before line breaking.
0020 |
SPACE (SP) |
The space characters are explicit break opportunities, however spaces at the end of a line are not measured for fit. If there is a sequence of space characters, and breaking after any of the space characters would result in the same visible line, the line breaking position after the last space character in the sequence is the locally most optimal one. In other words, because the last character measured for fit is before the space character, any number of space characters are kept together invisibly on the previous line and the first non-space character starts the next line.
NOTE: SPACE, but none of the other breaking spaces, is used in determining an indirect break.
URLs are now so common in regular plain text, that they must be taken into account when assigning general-purpose line breaking properties. The SY line breaking property is intended to provide a break after, but not in front of digits so as to not break “1/2” or “06/07/99”.
002F |
SOLIDUS |
Slash (SOLIDUS) is allowed as an additional, limited break opportunity to improve layout of web addresses. As a side effect, some common abbreviations such as "w/o" or "A/S" which normally would not be broken, acquire a line break opportunity. The recommendation in this case is for the layout system not to utilize a line break opportunity allowed by SY unless the distance between it and the next line break opportunity exceeds an implementation defined minimal distance.
NOTE: Normally, symbols are treated as AL. However, additional symbols can be added to this line breaking class, or classes BA, BB, B2 by tailoring. This can be used to allow additional line breaks, for example after “=”. Mathematics requires additional specifications for line breaking, which are outside the scope of this document.
These characters glue together both left and right neighbor character such that they are kept on the same line.
2060 |
WORD JOINER (WJ) |
FEFF |
ZERO WIDTH NO-BREAK SPACE (ZWNBSP) |
The word joiner character is the preferred choice for an invisible character to keep other characters together that would otherwise be split across the line at a direct break. The character FEFF has the same effect, but because it is also used in an unrelated way as a byte order mark, the use of the WJ as the preferred interword glue simplifies the handling of FEFF.
By definition WJ and ZWNBSP take precedence over the action of SP, but not ZW.
The XX line break class consists of all characters with General Category Co and all code points with General Category Cn.
Unassigned code positions, private use characters and characters for which reliable line breaking information is not available are assigned this default line breaking property. The default behavior for this class is identical to class AL. Users can manually insert ZWSP or word joiner around characters of class XX to allow or prevent breaks as needed.
In addition, implementations can override or tailor this default behavior, for example by assigning characters the property ID or another class. Doing so may give better default behavior for their users. There are other possible means of determining the desired behavior of private use characters. For example one implementation might treat any private use character in ideographic context as ID, while another implementation might support a method for assigning specific properties to specific definitions of private use characters. The details of such use of private use characters are outside the scope of this standard.
For supplementary characters, a useful default is to treat characters in the range 0x10000 to 0x1FFFD as AL and characters in the range 0x20000 to 0x2FFFD, and 0x30000 to 0x3FFFD as ID, until the implementation can be revised to take into account the actual line breaking properties for these characters.
For more information on handling default property values for unassigned characters, see the discussion on default property values in Section 5.3, Unknown and Missing Characters of [Unicode].
The line breaking rules in Section 6, Line Breaking Algorithm and the pair table in Section 7, Pair Table-based Implementation, assume that all unknown characters have been assigned one of the other line breaking classes, such as AL, as part of assigning line breaking classes to the input characters.
200B |
ZERO WIDTH SPACE (ZWSP) |
This character is used to enable additional (invisible) break opportunities wherever SPACE cannot be used. As its name implies, it normally has no width. However, its presence between two characters does not prevent increased letter spacing in justification.
Dictionaries follow specific conventions that guide their use of special characters to indicate features of the terms they list. Marks used for some of these conventions may occur near line break opportunities and therefore interact with line breaking, for example, in one dictionary a natural hyphen in a word becomes a tilde dash when the word is split.
Examples of conventions used in several dictionaries are briefly described in this subsection. Where possible, the default line breaking properties for characters commonly used in dictionaries have been assigned so as to accommodate these and similar conventions. However, implementing the full conventions in dictionaries requires special support.
Looking up the noun “syllable” in eight dictionaries yields eight different conventions:
Dictionary of the English Language, Samuel Johnson, 1843 SY´LLABLE where ´ is an oversized U+02B9 and follows the vowel of the main syllable (not the syllable itself).
Oxford English Dictionary (1st Edition) si·lă'bl where · is a slightly raised middle dot indicating the vowel of the stressed syllable (similar to Johnson's acute). The letter ă is U+0103. The ' is an apostrophe.
Oxford English Dictionary (2nd Edition) has gone to IPA 'sIləb(ə)l where ' is U+02C8, I is U+026A, ə is U+0259 (both times). The ' comes before the stressed syllable. The () indicate the schwa may be omitted.
Chambers English Dictionary (7th Edition) sil´ə-bl where the stressed syllable is followed by ´ U+02B9, ə is U+0259, and - is a hyphen. When splitting a word like abate´- ment the stress mark ´ goes after stressed syllable followed by the hyphen. No special convention is used, when splitting at hyphen.
BBC English Dictionary sIləbl where I is <U+026A, U+0332>, ə is U+0259. The vowel of the stressed syllable is underlined.
Collins Cobuild English Language Dictionary sIləbə°l where I is <U+026A, U+0332> and has the same meaning as in the BBC English dictionary. The ə is U+0259 (both times). The ° is a U+2070 and indicates the schwa may be omitted.
Readers Digest Great Illustrated Dictionary. syl·la·ble (sílləb'l) The spelling of the word has hyphenation points (· is a U+2027) followed by phonetic spelling. The vowel of the stressed syllable is given an accent, rather than being followed by an accent. The ' is an apostrophe.
Webster's 3rd New International Dictionary. syl·la·ble /'siləbəl/ The spelling of the word has hyphenation points (· is a U+2027) and is followed by phonetic spelling. The stressed syllable is preceded by ' U+02C8. The ə's are schwas as usual. Webster splits words at the end of a line with a normal hyphen. A U+2E17 DOUBLE OBLIQUE HYPHEN indicates that a hyphenated word is split at the hyphen.
Unlike U+2010 HYPHEN, which always has a visible rendition, the character U+00AD SOFT HYPHEN (SHY) is an invisible format character that merely indicates a preferred intra-word line-break position. If the line is broken at that point, then whatever mechanism is appropriate for intra-word line-breaks should be invoked, just as if the line break had been triggered by another mechanism, such as a dictionary lookup. Depending on the language and the word, that may produce different visible results, such as:
Here are a few examples of spelling changes:
Each example shows the line break as “ / ” and any inserted hyphens. There are many other cases. The inserted hyphen glyph can take a wide variety of shapes, as appropriate for the situation. Examples include shapes like U+2010 HYPHEN, U+058A ARMENIAN HYPHEN, or U+180A MONGOLIAN NIRUGU, or U+1806 MONGOLIAN TODO SOFT HYPHEN.
When a SHY is used to represent a possible hyphenation location, the spelling is that of the word without hyphenation: “tug<SHY>gummi”. It is up to the line breaking implementation to make any necessary spelling changes when such a possible hyphenation is actually used.
Sometimes it is desirable to encode text that includes line breaking decisions and will not be further broken into lines. If such text includes hyphenations, the spelling must reflect the changes due to hyphenation: “tugg<U+2010>/ gummi”, including the appropriate character for any inserted hyphen. For a list of dash-like character in Unicode see Section 6.2, General Punctuation in [Unicode].
There are three types of hyphens: explicit hyphens, conditional hyphens, and dictionary-inserted hyphens resulting from a hyphenation process. There is no character code for the third kind of hyphen; therefore if a distinction is desired, the fact that a hyphen is dictionary-inserted must be represented out of band, or by using another control code instead of SHY.
The action of a hyphenation algorithm is equivalent to the insertion of a SHY. However, when a word contains an explicit SHY it is customarily treated as overriding the action of the hyphenator for that word.
In some fonts, noticeably Fraktur fonts, it is customary to use a double-stroke form of the hyphen, usually oblique. Such use is merely a font-based glyph variation and does not affect line breaking in any way. In texts using such a font, automatic hyphenation or SHY would also result in the display of a double-stroke, oblique hyphen.
In some dictionaries, such as Webster's 3rd New International Dictionary, double-stroke, oblique hyphens are used to indicate a hyphen at the end of the line that should be retained when the term shown is not line wrapped. It is not necessary to store a special character in the data, merely to substitute the glyph of any ordinary hyphen that ends up at the end of a line. In such convention, automatic hyphenation or SHY would result in the display of an ordinary hyphen without further substitution.
Certain linguistic notations make use of a double-stroke, oblique hyphen to indicate specific features. The U+2E17 DOUBLE OBLIQUE HYPHEN character used in this case is not a hyphen and does not represent a line break opportunity. Automatic hyphenation or SHY would result in the display of an ordinary hyphen.
The Tibetan script uses spaces sparingly, relying instead on the thseg. There is no punctuation equivalent to a period in Tibetan; Tibetan shad characters indicate the end of a "phrase" not a sentence. "Phrases" are often metrical, that is, written after every N syllables, and a new sentence can often start within the middle of a phrase. Sentence boundaries need to be determined grammatically rather than by punctuation.
Traditionally there is nothing akin to a paragraph in Tibetan text. It is typical to have many pages of text without a paragraph break, that is, without an explicit line break. The closest thing to a paragraph in Tibetan is a new section or topic starting with U+0F12 or U+0F08. However, these occur in-line: one section ends and a new one starts on the same line and the new section is marked only by the presence of one of these characters.
Some modern books, newspapers, and magazines format text more like English with a break before each section or topic - and (often) the title of the section on a separate line. Where this is done, authors do insert an explicit line break. Western punctuation (full stop, question mark, exclamation mark, comma, colon, semi colon, quotes) is starting to appear in Tibetan documents, particularly those published in India, Bhutan and Nepal. Because there are no formal rules for their use in Tibetan they get treated generically by default. In Tibetan documents published in China, CJK bracket and punctuation characters occur frequently; these should be treated as in Chinese written horizontally.
NOTE: The detailed rules for formatting Tibetan texts are complex, and the original assignment of line break classes was found to be wholly insufficient for the purpose. In Unicode 4.1.0 the assignment of line break classes for Tibetan has been revised significantly in an attempt to better model Tibetan line breaking behavior. No new rules or line break classes were added. As yet there is limited practical experience with the revised assignment of line break classes. As more experience is gained, some modifications, possibly including new rules or additional line break classes, can be expected. Nevertheless the current set of line break classes should provide a good starting point.
It is the stated intention of the Unicode Consortium to review these assignments in a future version and to furnish a more detailed and complete description of Tibetan line breaking and line formatting behavior.
UAX#29 Text Boundaries, [Boundaries], describes a particular method for boundary detection. It is based on a set of hierarchical rules and character classifications. That method is well suited for implementation of some of the advanced heuristics for line breaking.
A slightly simplified implementation of such an algorithm can be devised that uses a two dimensional table to resolve break opportunities between pairs or characters. It is described in Section 7, Pair Table-based Implementation.
The line breaking algorithm presented in this section can be expressed in a series of rules which take line breaking classes as input. The line breaking rules are stated in terms of regular expressions over the line breaking classes defined in Section 5.2, Description of Line Breaking Properties and three special symbols indicating the type of line break opportunity.
! Mandatory break at the indicated position
× No break allowed at the indicated position
÷ Break allowed at the indicated position
The rules are applied in order. That is, there is an implicit ”otherwise” at the front of each rule following the first. It is possible to construct alternate sets of such rules that are fully equivalent. To be equivalent an alternate set of rules must have the same effect.
The distinction between a direct and an indirect break is handled by explicitly considering the effect of SP in rule LB12. Because rules are applied in order, rule LB12 implies that a prohibited break in rules LB13– LB19 is equivalent to an indirect break.
The examples for each rule use representative characters, where ’H’ stands for an ideographs, ’h’ for small kana, ’9’ for digits. Except where a rule contains no expressions, the italicized text of the rule is intended merely as a handy summary.
Resolve line breaking classes:
LB 1 Assign a line breaking class to each code point of the input. Resolve AI, CB, SA, SG, and XX into other line breaking classes depending on criteria outside the scope of this algorithm.
Start and end of text:
LB 2a Never break at the start of text.
× sot
LB 2b Always break at the end of text.
! eot
These two rules are designed to deal with degenerate cases, so that there is at least one character on each line, and at least one line break for the whole text. Emergency line breaking behavior usually also allows line breaks anywhere on the line if a legal line break cannot be found. This has the effect of preventing text from running into the margins.
Mandatory breaks:
LB 3a Always break after hard line breaks (but never between CR and LF).
BK !
LB 3b Treat CR followed by LF, as well as CR, LF and NL as hard line breaks.
CR × LF
CR !
LF !
NL !
LB 3c Do not break before hard line breaks.
× ( BK | CR | LF | NL )
Note: A hard line break can consist of BK or a New Line Function (NLF) as described in Section 5.8 Newline Guidelines of [Unicode]. These three rules are designed to handle the line ending and line separating characters as described there.
Explicit breaks and non-breaks:
LB 4 Do not break before spaces or zero-width space.
× SP
× ZW
LB 5 Break after zero-width space.
ZW ÷
Combining Marks:
LB 6 [replaced by 18b and 18c].
See Section 8.3, Legacy Support for Space Character as Base for Combining Marks.
LB 7b Do not break a combining character sequence; treat it as if it has the LB class of the base character in all of the following rules.
Treat X CM* as if it were X.
Where X is any line break class except SP, BK, CR, LF, NL or ZW.
At any possible break opportunity between CM and a following character, CM behaves as if it had the type of its base character. Note that despite the summary title of this rule it is not limited to standard combining character sequences. For the purposes of line breaking, sequences containing most of the control codes or layout control characters are treated like combining sequences.
LB 7c Treat any remaining combining mark as AL.
Treat any remaining CM as it if were AL.
This catches the case where a CM is the first character on the line, or follows SP, BK, CR, LF, NL or ZW.
Opening and closing:
These have special behavior with respect to spaces, and therefore come before rule 12.
LB 8 Do not break before ‘]’ or ‘!’ or ‘;’ or ‘/’, even after spaces.
× CL
× EX
× IS
× SY
LB 9 Do not break after ‘[’, even after spaces.
OP SP* ×
LB 10 Do not break within ‘”[’, even with intervening spaces.
QU SP* × OP
LB 11 Do not break within ‘]h’, even with intervening spaces.
CL SP* × NS
LB 11a Do not break within ‘——’, even with intervening spaces.
B2 SP* × B2
Word Joiner:
LB 11b Do not break before or after WORD JOINER and related characters.
× WJ
WJ ×
Spaces:
SP ÷
Non-breaking characters:
LB 13 Do not break before or after NBSP and related characters.
× GL
GL ×
Special case rules:
LB 14 Do not break before or after ‘”’.
× QU
QU ×
LB 14a Break before and after unresolved CB.
÷ CB
CB ÷
Conditional breaks should be resolved external to the line breaking rules. However, the default action is to treat unresolved CB as breaking before and after.
LB 15 Do not break before hyphen-minus, other hyphens, fixed-width spaces, small kana and other non-starters, or after acute accents.
× BA
× HY
× NS
BB ×
LB 16 Do not break between two ellipses, or between letters or numbers and ellipsis.
AL × IN
ID × IN
IN × IN
NU × IN
Examples: ’9...’, ‘a...’, ‘H...’
Numbers:
Do not break alphanumerics.
LB 17 Do not break within ‘a9’, ‘3a’, or ‘H%’.
ID × PO
AL × NU
NU × AL
In general, lines should not be broken inside numbers of the form described by the following regular expression:
PR ? ( OP | HY ) ? NU (NU | SY | IS) * CL ? PO ?
Examples: $(12.35) 2,1234 (12)¢ 12.54¢
The default line breaking algorithm approximates this with the following rule, together with PR × AL and PR × ID, which handle numeric prefix puncutation. Note that some cases are already handled above, like ‘9,’, ‘[9’. For a tailoring that supports the regular expression directly, see Section 8.2, Examples of Customization.
LB 18 Do not break between the following pairs of classes.
CL × PO
HY × NU
IS × NU
NU × NU
NU × PO
PR × AL
PR × HY
PR × ID
PR × NU
PR × OP
SY × NU
Example pairs: ‘$9’, ‘$[’, ‘$-‘, ‘-9’, ‘/9’, ‘99’, ‘,9’, ‘9%’ ‘]%’
Korean syllable blocks
Conjoining jamo, Hangul syllables or combinations of both form Korean syllable Blocks. Such blocks are effectively treated as if they were Hangul syllables; no breaks can occur in the middle of a syllable block. See the Unicode Standard Annex #29: Text Boundaries [Boundaries] for more information on Korean Syllable Blocks.
LB 18b Do not break a Korean syllable.
JL × JL | JV | H2 | H3
JV | H2 × JV | JT
JT | H3 × JT
The effective line breaking class for the syllable block matches the line breaking class for Hangul syllables, which is ID by default. This is achieved by the following rule:
LB 18c Treat a Korean Syllable Block the same as ID.
JL | JV | JT | H2 | H3 × IN
JL | JV | JT | H2 | H3 × PO
PR × JL | JV | JT | H2 | H3
When Korean uses SPACE for line breaking, these classes and characters of class ID are often tailored to AL: see Section 8, Tailoring.
Finally, join alphabetic letters and break everything else.
LB 19 Do not break between alphabetics (“at”).
AL × AL
LB 19b Do not break between numeric punctuation and alphabetics ("e.g.").
IS × AL
ALL ÷
÷ ALL
A two-dimensional table can be used to resolve break opportunities between pairs of characters. The rows of the table are labeled by the possible values of the line breaking property of the leading character in the pair. The columns are labeled by the line breaking class for the following character of the pair. Each intersection is labeled with the resulting line break opportunity.
The Japanese standard JIS X 4051-1995 [JIS] provides an example of such a table-based definition. However, it uses line breaking classes whose membership is not solely determined by the line breaking property (as in this Annex), but in some cases by heuristic analysis or markup of the text.
The implementation provided here directly uses the line breaking classes defined above.
If two rows of the table have identical values and the corresponding columns also have identical values, then the two line breaking classes can be coalesced. For example, the JIS standard uses 20 classes of which only 14 appear to be unique. A minimal table representation is unique, except for trivial reordering of rows and columns.
Most of the rules in Section 6, Line Breaking Algorithm involve only pairs of characters, or they apply to a single line break class preceded or followed by any character. These rules can be represented directly in a pair table. However, rules LB9 - LB11 similarly require extended context to handle spaces.
By broadening the definition of pair from B A, where B is the line breaking class before a break, and A the one after, to B SP* A, where SP* is an optional run of space characters, the same table can be used to distinguish between cases where SP can or cannot provide a line break opportunity (that is, direct and indirect breaks). Rules equivalent to the ones given in Section 6, Line Breaking Algorithm can be formulated without explicit use of SP by using % to express indirect breaks instead. These rules can then be simplified to involve only pairs of classes, that is, only constructions of the form:
B ÷ A
B % A
B ^ A
where either A or B may be empty. These simplified rules can be automatically translated into a pair table, as in Table 2 below. Line breaking analysis then proceeds by pair table lookup as explained below.
Rule LB7b requires extended context for handling combining marks. This extended context must also be built into the code that interprets the pair table. For convenience in detecting the condition where A = CM, the symbols # and @ are used, instead of % and ^, respectively. See Section 7.5, Combining Marks.
Table 2 implements the line breaking behavior described in this Annex, with the limitation that only context of the form B SP* A is considered. BK, CR, LF, NL and SP classes are handled explicitly in the outer loop as given in the code sample below. Pair context of the form B CM* can be handled by handling the special entries @ and # in the driving loop, as explained in Section 7.5, Combining Marks. Conjoining jamos are considered separately in Section 7.6, Conjoining Jamos. In Table 2, the rows are labeled with the B class and the columns are labeled with the A class.
OP | CL | QU | GL | NS | EX | SY | IS | PR | PO | NU | AL | ID | IN | HY | BA | BB | B2 | ZW | CM | WJ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OP | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | @ | ^ |
CL | _ | ^ | % | % | ^ | ^ | ^ | ^ | _ | % | _ | _ | _ | _ | % | % | _ | _ | ^ | # | ^ |
QU | ^ | ^ | % | % | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | # | ^ |
GL | % | ^ | % | % | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | # | ^ |
NS | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | # | ^ |
EX | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | # | ^ |
SY | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | # | ^ |
IS | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | % | _ | _ | % | % | _ | _ | ^ | # | ^ |
PR | % | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | % | % | _ | % | % | _ | _ | ^ | # | ^ |
PO | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | # | ^ |
NU | _ | ^ | % | % | % | ^ | ^ | ^ | _ | % | % | % | _ | % | % | % | _ | _ | ^ | # | ^ |
AL | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | % | _ | % | % | % | _ | _ | ^ | # | ^ |
ID | _ | ^ | % | % | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | # | ^ |
IN | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | % | % | % | _ | _ | ^ | # | ^ |
HY | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | # | ^ |
BA | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | # | ^ |
BB | % | ^ | % | % | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | # | ^ |
B2 | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | ^ | ^ | # | ^ |
ZW | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | ^ | _ | _ |
CM | _ | ^ | % | % | % | ^ | ^ | ^ | _ | _ | % | % | _ | % | % | % | _ | _ | ^ | # | ^ |
WJ | % | ^ | % | % | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | # | ^ |
Resolved outside the pair table: XX SP BK SG CR LF CB SA AI NL
Table 2 uses the following notation:
^ denotes a prohibited break: B ^ A is equivalent to B SP* × A; in other words, never break before A and after B, even if one or more spaces intervene.
% denotes an indirect break opportunity. B % A is equivalent to B × A and B SP+ ÷ A; in other words, do not break before A, unless one or more spaces follow B.
@ denotes a prohibited break for combining marks: B @ A is equivalent to B SP* × A, where A is of class CM. For more details see >Section 7.5, Combining Marks.
# denotes an indirect break opportunity for combining marks following a space. B # A is equivalent to (B × A and B SP+ ÷ A) where A is of cla5ss CM.
_ denotes a direct break opportunity (equivalent to ÷ as defined above).
Hovering over the cells in a browser with tool-tips enabled reveals the rule number that determines the breaking status for the pair in question. When a pair must be tested with and without intervening spaces, multiple rules are given. Hovering over a line breaking class name gives a representative member of the class and additional information. Clicking on any line break class name anywhere in the document jumps to the definition.
The following two functions provide sample code [Code]
that demonstrates how the pair table is used. For a
complete implementation of the line breaking algorithm, if
statements to handle the line breaking classes CR,
LF, and NL
need to be added. They have been omitted here for brevity, but see
Section 7.7, Explicit Breaks.
The sample code assumes that the line breaking classes
AI,
CB,
SG, and XX
have been resolved according to rule LB1 as part of initializing the pcls
array. The code further assumes that the
complex line break analysis for characters with line break class
SA is
handled in function findComplexBreak
, for which the following
placeholder is given:
// placeholder function for complex break analysis // cls - resolved line break class, may differ from pcls[0] // pcls - pointer to array of line breaking classes (input) // pbrk - pointer to array of line breaking opportunities (output) // cch - remaining length of input int findComplexBreak(enum break_class cls, enum break_class *pcls, enum break_action *pbrk, int cch) { if (!cch) return 0; for (int ich = 0; ich < cch; ich++) { // .. do complex break analysis here // and report any break opportunities in pbrk .. if (pcls[ich] != SA) break; } return ich; }
The entries in the example pair table correspond to the following enumeration. For diagnostic purposes, the sample code returns these value to indicate not only the location but also the type of rule that triggered a given break opportunity.
enum break_action { DIRECT_BRK = 0, // _ in table INDIRECT_BRK, // % in table COMBINING_INDIRECT_BRK, // # in table COMBINING_PROHIBITED_BRK, // @ in table PROHIBITED_BRK, // ^ in table EXPLICTI_BRK }; // ! in rules
Because the contexts involved in
indirect breaks of the form B SP* A are of indefinite length,
they need to be handled explicitly in the driver code. The sample
implementation of a findLineBrk
function below remembers the
line break
class for the last characters seen, but skips any occurrence of
SP without
resetting this value. Once character A is encountered, a simple
lookback is used to see if it is preceded by a
SP. This lookback is only
necessary if B % A.
// handle spaces separately, all others by table // pcls - pointer to array of line breaking classes (input) // pbrk - pointer to array of line break opportunities (output) // cch - number of elements in the arrays (“count of characters”) (input) // ich - current index into the arrays (variable) (returned value) // cls - current resolved line break class for 'before' character (variable) // fTailorSPCM - selects a tailoring to keep SP CM together (see section 8.3) int findLineBrk(enum break_class *pcls, enum break_action *pbrk, int cch, bool fTailorSPCM) { if (!cch) return 0; enum break_class cls = pcls[0]; // class of 'before' character // loop over all pairs in the string up to a hard break for (int ich = 1; (ich < cch) && (cls != BK); ich++) { // handle explicit breaks here (see Section 7.7)
// handle spaces explicitly if (pcls[ich] == SP) { pbrk[ich-1] = PROHIBITED_BRK; // apply rule LB4: × SP continue; // do not update cls } // handle complex scripts in a separate function if (pcls[ich] == SA) { ich += findComplexBreak(cls, &pcls[ich-1], &pbrk[ich-1], cch - (ich-1)); if (ich < cch) cls = pcls[ich]; continue; } // lookup pair table information in brkPairs[before, after]; enum break_action brk = brkPairs[cls][pcls[ich]]; pbrk[ich-1] = brk; // save break action in output array if (brk == INDIRECT_BRK) { // resolve indirect break if (pcls[ich - 1] == SP) // if context is A SP * B pbrk[ich-1] = INDIRECT_BRK; // break opportunity else // else pbrk[ich-1] = PROHIBITED_BRK; // no break opportunity } // handle breaks involving a combining mark (see Section 7.5) // save cls of 'before' character (unless bypassed by 'continue') cls = pcls[ich]; } // always break at the end pbrk[ich-1] = EXPLICIT_BRK; return ich; }
The function returns all the break opportunities in the array pointed to
by pbrk
, using the values in the table. On return pbrk[ich]
is the type of break after the character at index ich
.
A common optimization in implementation is to determine only the nearest line break opportunity prior to the position of the first character that would cause the line to become overfull. Such an optimization requires backwards traversal of the string instead of forwards as shown in the sample code.
The implementation of combining marks in the pair table presents an additional complication because rule LB7b defines a context X CM* that is of arbitrary length. There are some similarities to the way contexts of the form B SP* A that are involved in indirect breaks are evaluated. However, contexts of the form SP CM* or CM* SP also need to be handled, while rule LB7c requires some CM* to be treated like AL.
The latter can be reflected directly in the example pair table in Table 2 by assigning the same values in the row marked CM as in the row marked AL. This is equivalent to rewriting the rules LB8—LB20 by duplicating any expression that contains an AL with another expression that contains a CM. For example, in LB16
AL × IN
becomes
AL × IN
CM × IN.
This is fully equivalent to rule LB7c because rule LB7b already accounts for all CMs that are not supposed to be treated like AL.
Rule LB7b is implemented in the
example pair table in Table 2 by assigning a special # entry in the column
marked CM for all rows referring to a line
break class that allows a
direct or indirect break after
itself. (Note that the intersection between the row for class
ZW and the column
for class CM must be assigned '_'
because of rule LB5.) The # corresponds to a break_action
value of
COMBINING_INDIRECT_BREAK
,
which triggers the following code in the sample implementation:
else if (brk == COMBINING_INDIRECT_BRK) { // resolve combining mark break pbrk[ich-1] = PROHIBITED_BRK; // don't break before CM if (pcls[ich-1] == SP){ if (!fTailorSPCM) // untailored: pbrk[ich-1] = COMBINING_INDIRECT_BRK; // apply rule SP ÷ else { pbrk[ich-1] = PROHIBITED_BRK; // optionally, keep SP CM together if (ich > 1) pbrk[ich-2] = ((pcls[ich - 2] == SP) ? INDIRECT_BRK : DIRECT_BRK); } } else // apply rule LB7b: X CM * -> X continue; // don't update cls }
The last remembered line break class
in variable cls
is
not updated, except for those cases covered by rule LB7c. A tailoring of
rule LB7b that keeps the last SPACE character preceding a combining mark,
if any, and therefore breaks before that SPACE character can easily be
implemented as shown in the sample code.
Rows for line break classes that
prohibit breaks after must be assigned a special entry '@' which corresponds
to a break action of COMBINING_PROHIBITED_BREAK
and triggers the following
code:
else if (brk == COMBINING_PROHIBITED_BRK) { // this is the case OP SP* CM pbrk[ich-1] = COMBINING_PROHIBITED_BRK; // no break allowed if (pcls[ich-1] != SP) continue; // apply rule LB7b: X CM* -> X }
The only line break class that unconditionally prevents breaks across a following SP is OP. This code ensures that OP CM is handled according to rule LB7c and OP SP CM is handled as OP SP AL according to rule LB7c.
For Korean syllable blocks, a simple pair table can be constructed based on the information in rule LB18b, and shown in Table 3 below.
H2 | H3 | JL | JV | JT | |
---|---|---|---|---|---|
H2 | _ | _ | _ | % | % |
H3 | _ | _ | _ | _ | % |
JL | % | % | % | % | _ |
JV | _ | _ | _ | % | % |
JT | _ | _ | _ | _ | % |
The pair table for Korean syllable blocks in Table 3 can be merged with the example pair table in Table 2 by adding the cells from Table 3 beyond the lower right corner of Table 2. Next, according to rule LB18c, any empty cells in the new rows are filled with the same values as the existing row for class ID, and any empty cells for the new columns are filled with the same values as the existing column for class ID. Such a merged table can be handled with the same sample code as above.
Handling explicit breaks is straightforward in the driver code, although
it does clutter up the loop condition and body of the loop a bit. For completeness, the following sample
shows how to change the loop condition and add if
statements to the
loop that handle BK, CR, and LF. Because NL and BK behave identically by default, this
code assumes that BK has been substituted for NL.
// handle case where input starts with an LF if (cls == LF) cls = BK; // loop over all pairs in the string up to a hard break for (int ich = 1; (ich < cch) && (cls != BK) && (cls != CR || pcls[ich] == LF); ich++) { // handle BK and LF explicitly if (pcls[ich] == BK || pcls[ich] == LF) { pbrk[ich-1] = PROHIBITED_BRK; cls = BK; continue; } // handle CR explicitly if(pcls[ich] == CR) { pbrk[ich-1] = PROHIBITED_BRK; cls = CR; continue; } // handle spaces explicitly...
A real world line breaking algorithm must be tailorable to some degree to meet user or document requirements.
In Korean, for example, two distinct line breaking modes occur, which can be summarized as breaking after each character, or breaking after spaces (as in Latin text). The former tends to occur when text is set justified, the latter, when ragged margins are used. In that case, even ideographs are only broken at space characters.
In Japanese for example, tighter and looser specifications of prohibited line breaks may be used.
Specialized text or specialized text constructs may need specific line breaking behavior that differs from the default line breaking rules given in this annex. This may require additional tailorings beyond those considered in this section. For example, the rules given here are insufficient for mathematical equations, whether inline or in display format. Likewise, text which commonly contains lengthy URLs might benefit from special tailoring that suppresses SY × NU from rule LB18 within the scope of a URL to allow breaks after a '/' separated segment in the URL regardless of whether the next segment starts with a digit or not.
The remainder of this section gives an overview of common types of tailorings and examples of how to customize the pair table implementation of the line breaking algorithm for these tailorings.
There are three principal ways of tailoring the sample implementation of the line breaking algorithm:
Beyond these three straightforward customization steps, it is always possible to augment the algorithm itself, for example by providing specialized rules to recognize and break common constructs, such as URLs, numeric expressions, etc. Such open ended customizations place no limits on possible changes, other than the requirement that characters with normative line breaking properties be correctly implemented.
Note: Reference [Cedar97] reports on a real world implementation of a pair table-based implementation of a line breaking algorithm substantially similar to the one presented here, and including the types of customizations presented in this section. That implementation simultaneously met the requirements of customers in many European and East Asian countries with a single implementation of the algorithm.
Example 1. The exact method of resolving the line break class for characters wtih class SA is not specified in the default algorithm. One method of implementing line breaks for complex scripts is to invoke context-based classification for all runs of characters with class SA. For example a dictionary-based algorithm could return different classes for Thai letters depending on their context: letters at the start of Thai words would become BB and other Thai letters would become AL. Alternatively, for text consisting of or predominantly containing characters with line breaking class SA, it may be useful instead defer the determination of line breaks to a different algorithm entirely. Section 7.4, Sample Code sketches such approach in which the interface to the dictionary-based algorithm directly reports break opportunities.
Example 2. To implement terminal style line breaks, it would be necessary to allow breaks inside a run of spaces. This requires a change in the way the driver loop handles spaces and therefore cannot be simply done by customizing the pair-table. However, the additional task of line wrapping runs of spaces could also be performed after the fact at the layout system level while leaving unchanged the actual line breaking algorithm.
Example 3. Depending on the nature of the document, Korean uses either implicit breaking around characters (type 2 as defined above in Section 3, Introduction) or uses spaces (type 1). Space-based layout is common in magazines and other informal documents with ragged margins, while books, with both margins justified, use the other type, as it affords more line break opportunities and therefore leads to better justification. Reference [Suign98] shows how the necessary customizations can be elegantly handled by selectively altering the interpretation of the pair entries. Only the intersections of ID/ID, AL/ID and ID/AL are affected. For alphabetic style line breaking, breaks for these four cases require space; for ideographic style line breaking, these four cases do not require spaces. Therefore, one defines a pseudo-action, which is then resolved into either direct or indirect break action based on user selection of the preferred behavior for a given text.
Example 4. Sometimes in Far Eastern context it is required to allow alphabetic characters and digit strings to break anywhere. According to reference [Suign98] this can again be done in the same way as Korean. In this case the intersections of NU/NU, NU/AL, AL/AL, and AL/NU are affected.
Example 5. Some users prefer to relax the requirement that Kana syllables to be kept together, for example, the syllable kyu, spelled with the two kanas KI and “small yu” would no longer be kept together as if KI and yu were atomic. This customization can be handled via the first method by changing the classification of the Kana small characters from NS to ID as needed.
Example 6. Some implementations may wish to tailor the line breaking algorithm to resolve grapheme clusters according to UAX #29: Text Boundaries [Boundaries] as a first stage. Generally, the line break algorithm does not create line break opportunities within default grapheme clusters, therefore such a tailoring would be expected to produce results that for most practical cases are close to what are defined by the default algorithm. However, if such a tailoring is chosen, characters that are members of line break class CM but not part of the definition of default grapheme clusters must still be handled by rules LB7b and LB7c, or by some additional tailoring.
Example 7. Regular expression-based line breaking engines might get better results using a tailoring that directly implements the following regular expression for numeric expressions
PR ? ( OP | HY ) ? NU (NU | SY | IS) * CL ? PO ?
together with PR × AL and PR × ID from rule LB18. In that case, LB8 must be tailored as follows
[^NU] × CL
× EX
[^NU] × IS
[^NU] × SY
otherwise single digits may be handled by rule LB8 before being handled in the regular expression.
Example 8. Some implementations may wish to tailor the algorithm to omit rule LB7b due to the added complexity of its indefinite length context. Because combining marks are most commonly applied to characters of class AL, rule LB7c alone generally produces acceptable results for such implementations.
As stated in [Unicode], Section 7.7, Combining Marks, combining characters are shown in isolation by applying them to U+00A0 NO-BREAK SPACE (NBSP). In earlier versions, this recommendation included the use of U+0020 SPACE. This use of SPACE for this purpose is now deprecated because it has been found to lead to many complications in text processing. For either NBSP or SPACE the visual appearance is the same, but the line breaking behavior is different. Under the current rules, SP CM* will allow a break between SP and CM*, which could result in a new line starting with a combining mark. Previously, whenever the base character was SP, the sequences CM* or SP CM* were defined to act like an indivisible cluster allowing breaks on either side like ID.
Where backwards compatibility with documents created under the prior practice is desired, the following tailoring should be applied in place of the deprecated rule LB7a.
In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP in the same cases as one would break before an ID.
Treat SP CM* as if it were ID.
The application of this rule should be limited to those CM characters with General Category M.
[Bidi] | Unicode Standard Annex #9: Unicode Bidirectional
Algorithm http://www.unicode.org/reports/tr9/ |
[Boundaries] | Unicode Standard Annex #29, Text
Boundaries. http://www.unicode.org/reports/tr29/ For information on grapheme cluster boundaries |
[Cedar97] | Cy Cedar, David Veintimilla, Michel Suignard and Asmus Freytag, Report from the Trenches: Microsoft Publisher goes Unicode, Proceedings of the Eleventh International Unicode Conference, San Jose, CA 1997 |
[Code] | Sample code implementing the pair table http://www.unicode.org/Public/PROGRAMS/LineBreakSampleCpp/ Contains the code samples shown in this document together with driver code |
[Data] | Line Break property data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/LineBreak.txt For the current version, see: http://www.unicode.org/Public/4.1.0/ucd/LineBreak.txt For other versions, see: http://www.unicode.org/versions/ |
[EAW] | Unicode Standard Annex #11, East Asian Width. http://www.unicode.org/reports/tr11/ For a definition of East Asian Width |
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
[Feedback] | http://www.unicode.org/reporting.html For reporting errors and requesting information online. |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[HangulST] | The latest version of the Hangul Syllable Types property data file is http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt |
[JIS] | JIS X 4051-1995. Line Composition Rules for Japanese Documents. (『日本語文晝の行組版方法』) Japanese Standards Association. 1995. |
[Knuth78] | Donald E. Knuth and Michael F. Plass, Breaking Lines into Paragraphs, republished in Digital Typography, CSLI 78, (Stanford, California: CLSI Publications 1997) |
[Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[Suign98] | Michel Suignard, Worldwide Typography and How to Apply JIS X 4051-1995 to Unicode, Proceedings of the Twelfth International Unicode/ISO 10646 Conference, Tokyo, Japan, 1998 |
[TEX] | Donald E. Knuth, TEX, the Program, Volume B of Computers & Typesetting, (Reading, Massachusetts: Addison-Wesley 1986) |
[Unicode] | The Unicode Standard, Version 4.0, (Reading, Massachusetts: Addison-Wesley Developers Press 2003, ISBN 0-321-18578-1) or online as http://www.unicode.org/versions/Unicode4.0.0/ |
[UCD] | Unicode Character Database http://www.unicode.org/ucd/ For an overview of the Unicode Character Database and a list of its associated files see http://www.unicode.org/Public/UNIDATA/UCD.html |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
The initial assignments of properties are based on input by Michel Suignard. Mark Davis provided algorithmic verification and formulation of the rules. Ken Whistler, Rick McGowan and other members of the editorial committee provided valuable feedback. Tim Partridge enlarged the information on dictionary usage. Sun Gi Hong reviewed the information on Korean and provided copious printed samples. Eric Muller reanalyzed the behavior of the soft hyphen and collected the samples. Christopher Fynn provided the background information on Tibetan line break. Andrew West, Kamal Mansour, Andrew Glass, Daniel Yacob, and Peter Kirk suggested improvements for Mongolian, Arabic, Kharoshthi, Ethiopic, and Hebrew punctuation characters respectively. Many others provided additional review of the rules and property assignments.
This section indicates the changes introduced by each revision.
Revision 17:
Revision 15:
Revision 14:
[Revision 13, being a proposed update, is superseded and no longer publicly available. Only modifications between revisions 12 and 14 are tracked here.]
Revision 12:
Revision 10:
Revision 9:
Revision 8:
Revision 7:
Revision 6:
[No change history is available for earlier revisions.]
Copyright © 1998-2005 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.