|
|
Version | 4.0 (proposed) |
Authors | Asmus Freytag (asmus@unicode.org) |
Date | 2002-11-14 |
This Version | http://www.unicode.org/unicode/reports/tr14/tr14-13 |
Previous Version | http://www.unicode.org/unicode/reports/tr14/tr14-12 |
Latest Version | http://www.unicode.org/unicode/reports/tr14 |
Tracking Number | 13 |
This report presents the specification of line breaking properties for Unicode characters.
This document is a proposed update of a previously approved Unicode Standard Annex. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress. The links in this document to the data files do not work. Preliminary datafiles for the proposed update are available at http://www.unicode.org/Public/BETA.
[Notes to reviewers are indicated like this.]
[TBD: After Approval of the Update, this boilerplate will be revised]
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. Note that conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).
The Unicode Standard [U4.0] presents limited description of some of the characters with specific function in line-breaking, but does not give a complete specification. This Unicode Standard Annex provides the needed information in a way that reflects best practices. The Unicode Standard assigns normative line-breaking properties to those characters that are intended to explicitly influence line-breaking and for which the line-breaking behavior is therefore expected to be identical across all implementations.
For all other characters informative line-breaking properties are provided. For these characters, considerable variation in line-breaking behavior can be expected, including variation based on local or stylistic preferences.
Following the formal definitions and summary of the line-breaking task and a brief section on conformance requirements, there are four main sections:
All terms not defined here shall be as defined in the Unicode Standard [U4.0]. The notation defined in this technical report differs somewhat from the notation defined elsewhere in the Unicode Standard. All other notation used here without an explicit definition shall be as defined in the Unicode Standard .
Line fitting - the process of determining how much text will fit on a line of text, given the available space between the margins and the actual display width of the text.
Line Break - the position in the text where one line ends and the next one starts.
Line Break Opportunity - a place where a line is allowed to end. Whether a given position in the text is a valid line break opportunity depends on the line breaking rules in force, as well as on context.
Line Breaking - the process of selecting that part of a text that can be displayed on a line. In other words, selecting one among several line breaking opportunities such that the resulting line is optimal or ends at a user-requested explicit line break.
Line Breaking Property - A character property with enumerated values, as set out in Table 1 and separated into normative and informative. Line breaking property values are used to classify characters, and taken in context, determine the type of break.
Line Breaking Class - a class of characters with the line breaking property value.
Mandatory Break - a line must break following a character that has the mandatory break property. Such a break is also known as a forced break and is indicated in the rules as B !, where B is the character with the mandatory break property.
Direct Break - a line breaking opportunity exists between two adjacent characters of the given line breaking classes. This indicated in the rules below as B ÷ A, where B is the character class of the character before and A is the character class of the character after the break. If they are separated by one or more space characters, a break opportunity also exists after the last space. In the pair table, the optional space characters are not shown.
Indirect Break - a line breaking opportunity exists between two characters of the given line breaking classes only if they are separated by one or more spaces. In this case, a break opportunity exists after the last space. No break opportunity exists if the characters are immediately adjacent. This is indicated in the pair table below as B % A, where B is the character class of the character before and A is the character class of the character after the break. Even though space characters are not shown in the pair table, an indirect break can only occur if one or more spaces follow B. In the notation of the rules in Section 6, Line Breaking Algorithm this would be represented as two rules: B × A and B SP+ ÷ A.
Prohibited Break - no line breaking opportunity exists between two characters of the given line breaking classes, even if they are separated by one or more space characters. This is indicated in the pair table below as B ^ A, where B is the character class of the character before and A is the character class of the character after the break and the optional space characters are not shown. In the notation of the the rules in Section 6, Line Breaking Algorithm this would be expressed as a rule of the form: B SP* × A.
Hyphenation - Hyphenation uses language specific rules to provide additional line breaking opportunities within a word. Hyphenation improves the layout of narrow columns, especially for languages with many longer words, such as German or Finnish. For the purpose of this document, it is assumed that hyphenation is equivalent to insertion of soft hyphen characters. All other aspects of hyphenation are outside the scope of this document.
Table 1 Line Breaking Classes (* = normative)
Class |
Descriptive Name |
|
Examples |
|
Characters with this property... |
Normative Line Breaking Classes |
|||||
Mandatory Break |
|
NL, PS |
|
cause a line break (after) |
|
Carriage Return |
|
CR |
|
cause a line break (after), except between CR and LF |
|
Line Feed |
|
LF |
|
cause a line break (after) |
|
Attached Characters and Combining Marks |
|
Combining Marks |
|
prohibit a line break between the character and the preceding character |
|
Surrogates |
|
Surrogates |
|
should not occur in well-formed text |
|
Zero Width Space |
|
ZWSP |
|
provide a break opportunity |
|
Non-breaking (“Glue”) |
|
NBSP, ZWNBSP, WJ,CGJ |
|
prohibit line breaks before or after. |
|
Contingent Break Opportunity |
|
Inline Objects |
|
provide a line break opportunity contingent on additional information. |
|
Space |
|
Space |
|
generally provide a line break opportunity after the character, enable indirect breaks |
|
Break Opportunities |
|||||
Break Opportunity Before and After |
|
EM Dash |
|
provide a line break opportunity before and after the character |
|
Break Opportunity After |
|
Spaces, Hyphens |
|
generally provide a line break opportunity after the character |
|
Break Opportunity Before |
|
Punctuation used in dictionaries |
|
generally provide a line break opportunity before the character. |
|
Hyphen |
|
Hyphen-Minus |
|
provide a line break opportunity after the character, except in numeric context |
|
Characters Prohibiting Certain Breaks |
|||||
Closing Punctuation |
|
“)”, “]”, “}”, etc. |
|
prohibit a line break before |
|
Exclamation/Interrogation |
|
“!”, “?” etc. |
|
prohibit line break before |
|
Inseparable |
|
Leaders |
|
allow only indirect line breaks between pairs. |
|
Non Starter |
|
small kana |
|
allow only indirect line break before |
|
Opening Punctuation |
|
“(“, “[“, “{“, etc. |
|
prohibit a line break after |
|
Ambiguous Quotation |
|
Quotation marks |
|
act like they are both opening and closing |
|
Numeric Context |
|||||
Infix Separator (Numeric) |
|
. , |
|
prevent breaks after any and before numeric |
|
Numeric |
|
Digits |
|
form numeric expressions for line breaking purposes |
|
Postfix (Numeric) |
|
%, ¢ |
do not break following a numeric expression |
||
Prefix (Numeric) |
|
$, £, ¥, etc. |
|
don't break in front of a numeric expression |
|
Symbols Allowing Breaks |
|
/ |
|
prevent a break before, and allow a break after |
|
Other Characters |
|||||
Ambiguous (Alphabetic or Ideographic) |
|
Characters with Ambiguous East Asian Width |
|
||
Ordinary Alphabetic and Symbol Characters |
|
Alphabets and regular symbols |
|
are alphabetic characters or symbols that are used with alphabetic characters |
|
Ideographic |
|
Ideographs, Hangul |
|
break before or after, except in some numeric context |
|
Leading Jamo |
|
Leading conjoining Jamo Consonants |
|
start Hangul syllables |
|
Vowel Jamo |
|
Conjoining Jamo Vowel |
|
continue Hangul syllables |
|
Trailing Jamo |
|
Trailing conjoining Jamo Consonants |
|
terminate Hangul syllables |
|
Complex Context (South East Asian) |
|
South East Asian: Thai, Lao, Khmer |
|
provide a line break opportunity contingent on additional, language specific context analysis |
|
Unknown |
|
Unassigned |
|
are all characters with (as yet) unknown line breaking behavior or unassigned code positions |
Lines are broken as result of either of two conditions. The first condition is the presence of an explicit line breaking character. The second condition results from a formatting algorithm having selected among available line breaking opportunities the particular one that results in the optimal layout of the text.
The definition of optimal line break is outside the scope of this document. Different formatting algorithms may use different methods of determining an optimal break. For example, simple implementations just consider a line at a time, trying to find a locally optimal line break. A common approach is to allow no compression or expansion of the inter-character and inter-word spaces and consider the longest line that fits. When compression or expansion is allowed, a locally optimal line break seeks to balance the relative merits of the resulting amounts of compression and expansion for different line break candidates.
More complex algorithms may take into account the interaction of line breaking decisions for the whole paragraph. The well known text layout system [TEX] implements a example of such a globally optimal strategy that may make complex tradeoffs to avoid unnecessary hyphenation and other legal, but inferior breaks. For a description of this strategy, see [Knuth78].
When expanding or compressing inter-word space, only the space marked by U+0020 SPACE and U+3000 IDEOGRAPHIC SPACE are normally subject to compression, and only spaces marked by U+0020 SPACE, and occasionally spaces marked by U+202F THIN SPACE are subject to expansion. All other space characters have fixed width.
Whether to allow expansion of inter-character space to justify a line, and how much, depends on local custom. In some languages, for example, German, inter-character space is commonly used to mark e m p h a s i s (like this). In such languages, allowing variable inter-character spacing would have the unintended effect of adding random emphasis, and should therefore be avoided.
In table headings that use Han ideographs, on the other hand, even extreme amounts of inter-character space commonly occur as short texts are spread out across the entire available space to distribute the characters evenly from end to end.
For the purpose of this document, what is important is not so much what defines the optimal amount of text on the line, but how line breaking opportunities are determined.
Three principal styles of context analysis determine line-breaking opportunities.
The first is commonly used for scripts employing the space character. Hyphenation is often used with space-based line breaking to provide additional line break opportunities - however, it requires knowledge of the language and potentially user interaction or overrides.
The second style of context is used with East Asian ideographic and syllabic scripts. The precise set of prohibited line breaks may depend on user preference or local custom.
Korean makes use of both styles of line break. When Korean text is laid out justified, the second style is commonly used, even for interspersed Latin letters. But when ragged margins are used, the first style (relying on spaces) is commonly used instead, even for ideographs.
The third style is used for scripts such as Thai, which do not use spaces, but which restrict word-breaks to syllable boundaries, the determination of which requires knowledge of the language comparable to that required by a hyphenation algorithm. Such an algorithm is beyond the scope of the Unicode Standard.
For multilingual text, styles one and two can be unified into a single set of specifications, based on the information provided in this report. Some Unicode characters have explicit line breaking properties assigned to them. These can be utilized with these two styles of context analysis for line break opportunities. Customization for user preferences or document style can then be achieved by tailoring that specification.
Determining the line breaks in bidirectional text takes place before applying rule L1 of the Unicode Bidirectional Algorithm [UAX 9]. However, it is strictly independent of directional properties of the characters or of any auxiliary information determined by the application of rules of that algorithm.
There is no single method for determining line breaks, in fact, the rules may change based on user preference and document layout. Therefore the information in this annex, including the specification of the line breaking algorithm, is informative, rather than normative. However, there are some characters which have been encoded explicitly for the purpose of their effect on line breaking. Users adding such characters to a text must be able to expect that they will have the desired effect. For that reason, these characters have been given normative line breaking behavior.
As stated in [U4.0] Section 3.2, Conformance Requirements conformant implementations are not required to implement the Unicode Linebreaking Algorithm. However, if they purport to implement it, they must do so in accordance with the specifications in this Annex. If the algorithm has been customized or tailored, that fact must be noted as set out in [U4.0] Section 3.2 Versions of the Unicode Standard.
[NOTE TO REVIEWERS: This section has been clarified and contains changes that require UTC approval]
The main emphasis in this section is to provide additional description of the line breaking behavior and to summarize the membership of character classes for each value of the line breaking property.
The classification by properties defined here is used as input into two algorithms defined below that implement workable default line breaking methods. In a few instances, the descriptions in this section provide additional detail about handling a given character at the end of a line, which goes beyond the simple determination of line breaks.
The full classification of all Unicode characters by their line breaking properties, as of the time of publication of this document, is available in the current version of the file LineBreak.txt [Data] in the Unicode Character Database [UCD]. This is a tab-delimited, two column plain text file, with code position, line breaking class. A comment at the end of each line indicates the character name. Ideographic, Hangul, Surrogate, and Private Use ranges are collapsed by giving a range in the first column.
As more scripts are added to the Unicode Standard, and more scripts become more widely implemented and used on computers, more line breaking classes may be added, or the assignment of line breaking class may be changed for some characters. Implementations should not make any assumptions to the contrary. Any future updates will be reflected in the latest version of the data file. (See the Unicode Character Database [UCD] for any specific version of the datafile).
Line breaking classes are listed alphabetically. Each line breaking class is marked with an annotation in parenthesis for easy reference showing that...
(A) - the class allows a break opportunity after in specified contexts
(XA) - the class prevents a break opportunity after in specified contexts
(B) - the class allows a break opportunity before in specified contexts
(XB) - the class prevents a break opportunity before in specified contexts
(P) - the class allows a break opportunity for a pair of same characters
(XP) - the class prevents a break opportunity for a pair of same characters
NOTE: The use of the letters B and A in these annotations marks the position of the break opportunity relative to the character. It is not to be confused with the use of the same letters in the other parts of this document, where they indicate position of the characters relative to the break opportunity.
Characters with East Asian Width property A (ambiguous width), and which would otherwise be AL in this classification. They take on the AL line breaking class only when their resolved width is N (narrow) and take the ID line breaking class when their resolved width is W (wide). For more information on East Asian Width, and how to resolve it, see Unicode Standard Annex #11, East Asian Width [EAW]. In the absence of information needed to resolve their East Asian Width, they are treated as class AL.
Require other characters to provide break opportunities, otherwise no breaking between pairs of ordinary characters. However, this is tailorable. In some Far Eastern documents it may be desirable to allow breaking between pairs of ordinary characters.
NOTE: use ZWSP as a manual override to provide break opportunities around alphabetic or symbol characters.
ALPHABETIC — all characters of General Categories Lu, Ll, Lt, Lm, Lo,
except as they appear below.
SYMBOLS — all characters of General Categories Sm, Sk, So, except as they
appear below.
Like the SPACE the characters in this class provide a break opportunity, but unlike SPACE they do not take part in determining indirect breaks. They can be subdivided into several categories.
The following subset of characters with General Category Zs
2000 |
EN QUAD |
2001 |
EM QUAD |
2002 |
EN SPACE |
2003 |
EM SPACE |
2004 |
THREE-PER-EM SPACE |
2005 |
FOUR-PER-EM SPACE |
2006 |
SIX-PER-EM SPACE |
2008 |
PUNCTUATION SPACE |
2009 |
THIN SPACE |
200A |
HAIR SPACE |
205F | MEDIUM MATHEMATICAL SPACE |
The preceding list of space characters all have a specific width, but behave otherwise as breaking spaces. In setting a justified line, normally none of these spaces, except for THIN SPACE when used in mathematical notation, will change in width. See also the SP property.
See the ID property for U+3000 IDEOGRAPHIC SPACE. For a list of all space characters in the Unicode Standard, see Section 6.2 in [U4.0].
0009 |
TAB |
Except for the effect of the location of the tab stops, the tab character acts similarly to a space for the purpose of line breaking.
00AD |
SOFT HYPHEN (SHY) |
SHY is rendered invisibly and has no width, it merely indicates an optional line break. The rendering of the optional line break depends on the script. For the Latin script rendering the line break typically means displaying a hyphen at the end of the line, however, some languages require a change in spelling surrounding a line break. For examples see Section 5.3 Additional Details on use of Soft Hyphen.
Breaking hyphens establish explicit break opportunities immediately after each occurrence.
058A |
ARMENIAN HYPHEN |
2010 |
HYPHEN |
2012 | FIGURE DASH |
2013 | EN-DASH |
Hyphens are graphic characters with width. Since, unlike spaces, they print, they are included in the measured part of the preceding line, except where the layout style allows hyphens to hang into the margins.
0F0B |
TIBETAN MARK INTERSYLLABIC TSHEG |
1361 |
ETHIOPIC WORDSPACE |
1680 |
OGHAM SPACE MARK |
17D5 |
KHMER SIGN BARIYOOSAN |
The Tibetan thseg is a visible mark, but it functions effectively like a space to separate words (or other units) in Tibetan. It provides a break opportunity after itself, like space.
The Ethiopian word space is a visible word delimiter and is kept on the line before.
The Ogham space mark is rendered visibly between words but should be elided at the end of a line.
2027 |
HYPHENATION POINT |
A hyphenation point is a raised dot, which is used primarily to visibly indicate syllabification of words. Syllable breaks are potential line breaking opportunities in the middle of words. It is mainly used in dictionaries and similar works. When an actual line break falls inside a word containing hyphenation point characters, the hyphenation point is rendered as a regular hyphen at the end of the line.
007C |
VERTICAL LINE |
In some dictionaries, a vertical bar is used instead of a hyphenation point. In this usage, U+0323 COMBINING DOT BELOW is used to mark stressed syllables, so all breaks are marked by the vertical bar. For an actual break opportunity, the vertical bar is rendered as a hyphen.
00B4 |
ACUTE ACCENT |
In some dictionaries, stressed syllables are indicated with a spacing acute accent instead of the hyphenation point. In this case the accent would move to the next line, and the preceding line ended with a hyphen.
02C8 |
MODIFIER LETTER VERTICAL LINE |
02CC |
MODIFIER LETTER LOW VERTICAL LINE |
These characters are used in dictionaries to indicate stress and secondary stress when IPA is used. Both are prefixes to the stressed syllable in IPA. Therefore, the only sensible way to break them is to keep them with the syllable; that is to break before them.
NOTE: It is hard to find actual examples in most dictionaries, since the pronunciation fields usually occur right after the headword, and the columns are wide enough to prevent line breaks in most pronunciations.
1806 |
MONGOLIAN TODO SOFT HYPHEN |
Despite its name, the Mongolian soft hyphen is not an invisible control like SOFT HYPHEN, but rather a visible character like a regular hyphen. Unlike the hyphen it stays with the following line. SOFT HYPHEN should be used whenever optional line breaks are to be marked in any script.
2014 |
EM DASH |
The EM DASH is used to set off parenthetical text, normally without spaces, however, this is language dependent, for example, in Swedish, spaces are used around the EM DASH. Line breaks can occur before and after an EM DASH, but not between two em dashes. Pairs of em dashes are sometimes used instead of a single quotation dash. For that reason, the line should not be broken between em dashes event though not all fonts use connecting glyphs for the EM DASH.
Explicit breaks act independently of the surrounding characters.
000C |
FORM FEED |
Form Feed separates pages. The text on the new page starts at the beginning of the line. No paragraph formatting is applied.
2028 |
LINE SEPARATOR |
The text after the Line Separator starts at the beginning of the line. No paragraph formatting is applied.
This is similar to HTML <BR>
2029 |
PARAGRAPH SEPARATOR |
The text of the new paragraph starts at the beginning of the line. Paragraph formatting is applied.
“NEW LINE FUNCTION (NLF)”
New line functions provide additional explicit breaks. They are not individual characters, but are expressed as sequences of control characters NEL, LF, and CR. What particular sequence(s) form a NLF depends on the implementation and other circumstances as described in [U4.0] Section 5.8, Newline Guidelines.
If a character sequence for a new line function contains more than one character, it is kept together. The default behavior is to break after LF or CR, but not between CR and LF. Two additional line breaking classes have been added for convenience in this operation.
FFFC |
OBJECT REPLACEMENT CHARACTER |
By default there is a break opportunity both before and after the object. Object-specific line break behavior is implemented in the associated object itself, and where available can override the default to prevent either or both of the break opportunities. Note, that this is best implemented by querying the object itself, not by replacing the CB line breaking class by another class.
The closing character of any set of paired punctuation must be kept with the preceding character, and the same applies to all forms of wide comma and full stop.
3001..3002 |
IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP |
FE50 |
SMALL COMMA |
FE52 |
SMALL FULL STOP |
FF0C |
FULLWIDTH COMMA |
FF0E |
FULLWIDTH FULL STOP |
FF61 |
HALFWIDTH IDEOGRAPHIC FULL STOP |
FF64 |
HALFWIDTH IDEOGRAPHIC COMMA |
plus any characters of General Category Pe in the Unicode Character Database.
Combining character sequences are treated as units for the purposes of line breaking. The line-breaking behavior of the sequence is that of the base character. If U+0020 SPACE is used as a base character, it is treated as AL instead of SP.
All characters with General Category Mc, Me, and Mn.
1160..11F9 |
Conjoining Jamos |
A sequence of conjoining Jamos is used to make up a Hangul syllable. Breaks are only allowed around the entire Hangul syllable, and then the line break properties are the same for precomposed Hangul syllables as for conjoined sequence of Jamos.
NOTE: for the purpose of determining line break opportunities, non-initial conjoining Jamos behave like combining marks, while the initial combining Jamos have the same property as Hangul Syllables.
Most controls and formatting characters are ignored in line breaking and do not contribute to the line width. By giving them class CM, the line breaking behavior of the last preceding character that is not of class CM affects the line breaking behavior.
NOTE: When control codes and format characters are rendered visibly during editing, more graceful layout might be achieved by assigning them the AL or ID class instead.
All characters of General Category Cc and Cf, unless listed explicitly elsewhere.
000D |
CARRIAGE RETURN (CR) |
A CR indicates a mandatory break after, unless followed by a LF.
NOTE: On some platforms the sequence CR, CR, LF is used to indicate the location of actual line breaks, whereas CR LF is treated like a hard line break. As soon as a user edits the text, the location of all the CR CR LF may change as the new text breaks differently, while the relative position of the CR LF to the surrounding text stay the same. This convention allows an editor to return a buffer and the client is able to tell which text is displayed on which line, by counting CR CR LFs and CR LFs.
These behave like closing characters, except in relation to postfix and ‘non-starter’ characters
0021 |
EXCLAMATION MARK |
003F |
QUESTION MARK |
2762 | HEAVY EXCLAMATION MARK ORNAMENT |
2763 | HEAVY HEART EXCLAMATION MARK ORNAMENT |
FE56..FE57 |
SMALL QUESTION MARK..SMALL EXCLAMATION MARK |
FF01 |
FULLWIDTH EXCLAMATION MARK |
FF1F |
FULLWIDTH QUESTION MARK |
The action of these characters is to glue together both left and right neighbor character such that they are kept on the same line. If they follow a space character, they still allow a break.
2060 |
WORD JOINER (WJ) |
FEFF |
ZERO WIDTH NO-BREAK SPACE (ZWNBSP) |
The word joiner character is the preferred choice for an invisible character to keep other characters together that would otherwise be split across the line at a direct break. The character FEFF has the same effect, but since it is also used in an unrelated way as a byte order mark the use of the WJ as the preferred interword glue will simplify the handling of FEFF. By definition WJ and ZWNBSP take precedence over the action of SP and ZW.
[NOTE TO REVIEWERS: The preceding sentence clashes with the opening sentence, which is intended to apply to all the other characters of this class. UTC may need to change some line breaking classes for some characters to allow a distinction in behavior, or UTC may need to assert that the other characters in this group must also override the effect of SPACE. The resolution of this issue could also affect the statement of rules in Section 6 Line Breaking Algorithm and the pair table entries in Section 7 Pair-Table Based Implementation. One possibility is also to give FIGURE SPACE the NU class to limit its action to numeric contexts.]
00A0 |
NO-BREAK SPACE (NBSP) |
202F |
NARROW NO-BREAK SPACE (NNBSP) |
180E | MONGOLIAN VOWEL SEPARATOR (MVS) |
NO-BREAK SPACE is the preferred character to use where two words should be visually separated but kept on the same line, as in the case of a title and a name “Dr.<NBSP>Joseph Becker”. NARROW NO-BREAK SPACE is used in Mongolian. The mongolian vowel separator acts like a NNBSP in its line breaking behavior. It additionally affects the shaping of certain vowel characters as described in [U4.0] Section 12.3 Mongolian.
034F |
COMBINING GRAPHEME JOINER |
This character has no visible glyph and its presence indicates that adjoining characters are to be treated as a graphemic unit, therefore preventing line breaks between them.
2007 |
FIGURE SPACE |
This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.
2011 |
NON-BREAKING HYPHEN (NBHY) |
This is the preferred character to use where words must be hyphenated but may not be broken at the hyphen.
0F0C |
TIBETAN MARK DELIMITER TSHEG BSTAR |
This looks exactly like a Tibetan tsheg, but can be used to prevent a break. It inhibits breaking on either side, like no-break space.
Some dictionaries use a character that looks like a vertical series of four dots to indicate places where there is a syllable, but no allowable break. This character has not been encoded in Unicode yet, but is an example of a character that should be given the GL property.
002D |
HYPHEN-MINUS |
Some additional context analysis is required to distinguish usage of this character as a hyphen from the use as minus sign (or indicator of numerical range). If used as hyphen, it acts like HYPHEN.
NOTE: In some practice, runs of HYPHEN-MINUS are used to stand in for longer dashes or horizontal rules. If it is desired to treat them like the characters or layout elements they stand for, and actual character code conversion is not performed, line breaking will need to support these special cases explicitly.
NOTE: The name ideographic for this line breaking class was chosen pars pro toto. The actual set of characters in this class includes characters other than Han ideographs.
Characters with this property do not require other characters to provide break opportunities, lines can ordinarily break before and after and between pairs of ideographic characters.
1100..115F |
Initial Conjoining Jamos |
2E80..2FFF |
CJK, KANGXI RADICALS, DESCRIPTION SYMBOLS |
3000 |
IDEOGRAPHIC SPACE |
|
HIRAGANA (except small characters) |
|
KATAKANA (except small characters) |
3130..318F |
HANGUL COMPATIBILITY JAMO |
3400..4DBF |
CJK UNIFIED IDEOGRAPHS EXTENSION A |
4E00..9FAF |
CJK UNIFIED IDEOGRAPHS |
F900..FAFF |
CJK COMPATIBILITY IDEOGRAPHS |
AC00..D7AF |
HANGUL SYLLABLES |
A000..A48F |
YI SYLLABLES |
A490..A4CF |
YI RADICALS |
FE62..FE66 |
SMALL PLUS SIGN to SMALL EQUALS SIGN |
FF10..FF19 |
WIDE DIGITS |
20000..2A6D6 | CJK UNIFIED IDEOGRAPHS EXTENSION B |
2F800..2FA1D | CJK COMPATIBILITY IDEOGRAPHS SUPPLEMENT |
plus all of the FULLWIDTH LATIN letters and all of the 3000-33FF blocks not covered elsewhere
NOTE: use 2060 WORD JOINER as a manual override to prevent break opportunities around characters of class ID.
These characters are intended to be used in consecutive sequence. They therefore prevent line breaks absolutely in a series of two character of this class.
2024 |
ONE DOT LEADER |
2025 |
TWO DOT LEADER |
2026 |
HORIZONTAL ELLIPSIS |
Horizontal ellipsis can be used as a three-dot leader.
002C |
COMMA |
002E |
FULL STOP |
003A |
COLON |
003B |
SEMICOLON |
0589 |
ARMENIAN FULL STOP |
Characters that usually occur inside a numerical expression may not be separated from following numeric characters, unless space character intervenes. Since they are otherwise sentence ending punctuation, they prevent breaks before.
There is no break in “100.00” or “10,000”, nor in “12:59”
Standard Hangul syllables are of the form L * L V* V T*, where L, V, and T are Hangul Syllable Types. Characters of the corresponding line breaking classes JL, JV, and JT must be kept together, such that Hangul syllables are not broken. By default Hangul syllables are of class ID, therefore there is a break opportunity before the first L and after the last V or T, depending on whether the syllable ends in a V or T. Syllables may not always be of standard form, see section 3.12 Conjoining Jamo Behavior in [U4.0].
All characters of Hangul Syllable Type L.
All characters of Hangul Syllable Type V. See line breaking class JL.
All characters of Hangul Syllable Type T. See line breaking class JL.
000A |
LINE FEED (LF) |
There is a mandatory break after any LF character.
Some characters cannot start a line, but unlike CL they may allow a break in some context when they are following one or more space characters.
All characters with General Category Lm (Letter, Modifier) and East Asian Width type W or H, and all characters with General Category Sk (Symbol, Modifier) and East Asian width type W plus the following characters:
0E5A..0E5B |
THAI CHARACTER ANGKHANKHU..THAI CHARACTER KHOMUT |
17D4 |
KHMER SIGN KHAN |
17D6..17DA |
KHMER SIGN CAMNUC PII KUUH..KHMER SIGN KOOMUUT |
203C |
DOUBLE EXCLAMATION MARK |
2044 |
FRACTION SLASH |
3005 |
IDEOGRAPHIC ITERATION MARK |
301C |
WAVE DASH |
309B.. 309E |
KATAKANA-HIRAGANA VOICED SOUND MARK to HIRAGANA VOICED ITERATION MARK |
30FB |
KATAKANA MIDDLE DOT |
30FD |
KATAKANA ITERATION MARK |
FE54..FE55 |
SMALL SEMICOLON..SMALL COLON |
FF1A..FF1B |
FULLWIDTH COLON.. FULLWIDTH SEMICOLON |
FF65 |
HALFWIDTH KATAKANA MIDDLE DOT |
FF70 |
HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK |
FF9E..FF9F | HALFWIDTH KATAKANA VOICED SOUND MARK - HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK |
Plus all Hiragana, Katakana, and Halfwidth Katakana “small” characters
NOTE: Optionally, the NS restriction may be relaxed and characters treated like ID, to achieve a more permissive style of line breaking.
Behave like ordinary characters in the context of ordinary characters, activate the prefix and postfix behavior of prefix and postfix characters
DECIMAL DIGITS (All characters of General Category Nd, except FULL WIDTH)
The opening character of any set of paired punctuation must be kept with the following character
All characters of General Category Ps in the Unicode Character Database.
Characters that usually follow a numerical expression may not be separated from preceding numeric characters or preceding closing characters, even if one or more space characters intervene.
For example, there is no break in “(12.00) %”
The list of post-fix characters is:
0025 |
PERCENT SIGN |
00A2 |
CENT SIGN |
00B0 |
DEGREE SIGN |
2030 |
PER MILLE SIGN |
2031 |
PER TEN THOUSAND SIGN |
2032..2037 |
PRIME..REVERSED TRIPLE PRIME |
20A7 |
PESETA SIGN |
2103 |
DEGREE CELSIUS |
2109 |
DEGREE FAHRENHEIT |
2126 |
OHM SIGN |
FE6A |
SMALL PERCENT SIGN |
FF05 |
FULLWIDTH PERCENT SIGN |
FFE0 |
FULLWIDTH CENT SIGN |
Characters that usually precede a numerical expression may not be separated from following numeric characters or following opening characters, even if space character intervenes.
There is no break in “$ (100.00)”
All currency symbols (General Category Sc) except as listed explicitly in PO and the following:
002B |
PLUS SIGN |
005C |
REVERSE SOLIDUS |
00B1 |
PLUS-MINUS |
2116 |
NUMERO SIGN |
2212 |
MINUS SIGN |
2213 |
MINUS-OR-PLUS-SIGN |
Some paired characters can be either opening or closing depending on usage. The default is to treat them as both opening and closing.
NOTE: If language information is available, it can be used to determine which character is used as opening and which as closing quote. (See the information in [U4.0] Section 6.2, General Punctuation)
Characters of General Category Pf or Pi in the Unicode Character Database as well as:
0022 |
QUOTATION MARK |
0027 |
APOSTROPHE |
23B6 | BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET |
275B | HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT |
275C | HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT |
275D | HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT |
275E | HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT |
Note: 23B6 is subtly different from the others in this class, in that it is both an opening and a closing punctuation character at the same time. However, its use is limited to certain vertical text modes in terminal emulation. Instead of creating a one of a kind class for this rarely used character, assigning it to the QU class approximates the intended behavior.
Runs of these characters require morphological analysis to determine break opportunities. This is similar to e.g. a hyphenation algorithm. For the characters that have this property, no line breaks will be found otherwise, therefore complex context analysis is mandatory.
NOTE: These characters can be mapped into their equivalent line breaking classes as result of dictionary lookup, thus permitting a logical separation of this algorithm from the morphological analysis.
If dictionary lookup is not available they should be treated as XX.
All characters of General Category Lo or Lm in these ranges:
0E00..0EFF |
THAI / LAO |
1000..109F |
MYANMAR |
1780..17FF |
KHMER |
All code points with General Category Cs. The line break behavior of isolated surrogates is undefined.
NOTE: The use of this line breaking class is deprecated. It was of limited usefulness for UTF-16 implementations that are not supporting characters beyond the BMP. The correct implementation is to resolve a pair of surrogates into a supplementary character before line breaking.
0020 |
SPACE (SP) |
The space characters are explicit break opportunities, but spaces at the end of a line are not measured for fit. If there is a sequence of space characters, and breaking after any of the space characters would result in the same visible line, the line breaking position after the last space character in the sequence is the locally most optimal one. In other words, since the last character measured for fit is before the space character, any number of space characters are kept together invisibly on the previous line and the first non-space character starts the next line.
NOTE: SPACE, but none of the other breaking spaces, is used in determining an indirect break.
URLs are common enough now in regular plain text, that they must be taken into account when assigning general-purpose line breaking properties. The SY line break property is intended to provide a break after, but not in front of digits so as to not break “1/2” or “06/07/99”.
002F |
SOLIDUS |
Slash (SOLIDUS) is allowed as an additional, limited break opportunity to improve layout of web addresses
NOTE: Normally, symbols are treated as AL. If it is desired to allow other breaks, more symbols can be added to this line breaking class, or classes BA, BB, B2 by tailoring, for example “=”. Mathematics requires additional specifications for line breaking, which are outside the scope of this document.
All characters with General Category Co and all codepoints with General Category Cn.
Unassigned code positions, private use characters and characters for which reliable line breaking information is not available are assigned this default line breaking property. The default behavior for this class is identical to class AL. Users can manually insert ZWSP or WORD JOINER around characters of class XX to allow or prevent breaks as needed.
In addition, implementations can override or tailor this default behavior, e.g. by assigning characters the property ID or another class, if that is more likely to give the correct default behavior for their users, or use other means to determine the correct behavior. For example one implementation might treat any private use character in ideographic context as ID, while another implementation might support a method for assigning specific properties to specific definitions of private use characters. The details of such use of private use characters are outside the scope of this standard.
For supplementary characters, a useful default is to treat characters in the range 0x10000 to 0x1FFFD as AL and characters in the range 0x20000 to 0x2FFFD, and 0x30000 to 0x3FFFD as ID, until the implementation can be revised to take into account the actual line breaking properties for these characters.
For more information on handling default property values for unassigned characters see the discussion on default property values in [Section 5.3] of The Unicode Standard, Version 4.0.
200B |
ZERO WIDTH SPACE (ZWSP) |
This character does not have width. It is used to enable additional (invisible) break opportunities wherever SPACE cannot be used.
Dictionaries follow strict standards that guide their use of characters to indicate features of the terms listed. Some of these conventions mark places that can also serve as line breaking opportunities and therefore interact with line breaking and are described here. Where appropriate, these characters have been inserted in the list of characters for the corresponding line breaking class above.
However, implementing the full conventions in dictionaries requires special support. Looking up the noun “syllable” in eight dictionaries yields eight different conventions, in one dictionary a natural hyphen in a word becomes a tilde dash if the word is split.
Dictionary of the English Language, Samuel Johnson, 1843 SY´LLABLE where ´ is a U+02B9 (and a large one at that) and follows the vowel of the main syllable (not the syllable itself).
Oxford English Dictionary (1st Edition) si·lâ'bl where · is a slightly above middle dot indicating the vowel of the stressed syllable (similar to Johnson's acute). The letter â is really U+0103. The ' is an apostrophe.
Oxford English Dictionary (2nd Edition) has gone to IPA 'sIleb(e)l where ' is U+02C8, I is U+026A, e is U+0259 (both times). The ' comes before the stressed syllable. The () indicate the schwa may be omitted.
Chambers English Dictionary (7th Edition) sil´e-bl where the stressed syllable is followed by ´ U+02B9, e is U+0259, - is a hyphen when splitting a word like abate´- ment the stress mark ´ goes after stressed syllable followed by the hyphen. No special convention if splitting at hyphen.
BBC English Dictionary sIlebl where I is U+026A U+0332, e is U+0259. The vowel of the stressed syllable is underlined.
Collins Cobuild English Language Dictionary sIlebe°l where I is U+026A U+0332, and means the same as the BBC. The e is U+0259 (both times). The ° is a U+2070 and indicates the schwa may be omitted.
Readers Digest Great Illustrated Dictionary. syl·la·ble (sílleb'l) The spelling of the word has hyphenation points (· is a U+2027) followed by phonetic spelling. The vowel of the stressed syllable is given an accent (rather than being followed by an accent). The letter e is a schwa in the actual example and ' is apostrophe.
Webster's 3rd New International Dictionary. syl·la·ble /'silebel/ The spelling of the word has hyphenation points (· is a U+2027) and is followed by phonetic spelling. The stressed syllable is preceded by ' U+02C8. The e's are schwas as usual. Webster splits words at the end of a line with a normal hyphen. When a hyphenated word is split at the hyphen this is indicated by a double hyphen which looks like a light version of the German Fraktur hyphen (short equals sign with a slight slope up to the right).
Unlike U+2010 HYPHEN, which always has a visible rendition, the character U+00AD SOFT HYPHEN (SHY) is an invisible format character that merely indicates a preferred intra-word line-break position. If the line is broken at that point, then whatever mechanism is appropriate for intra-word line-breaks should be invoked, just as if the line break had been triggered by another mechanism, such as a dictionary lookup. Depending on the language and the word, that may produce different visible results, such as:
Here are some example of spelling changes:
Each example shows the line break as “ / ” and any inserted hyphens. There are many other cases. The inserted hyphen glyph, if any, can be take a wide variety of shapes, as appropriate for the situation. Examples include shapes like U+2010 HYPHEN, U+058A ARMENIAN HYPHEN, or U+180A MONGOLIAN NIRUGU, or U+1806 MONGOLIAN TODO SOFT HYPHEN.
When a SHY is used to represent a possible hyphenation location, the spelling is that of the word without hyphenation: “tug<SHY>gummi”. It is up to the line-breaking implementation to make any necessary spelling changes when such a possible hyphenation becomes actual.
Sometimes it's desirable to encode text that will not be further broken into lines, in other words, text that includes line breaking decisions. If such text includes hyphenations, the spelling must reflect the changes due to hyphenation: “tugg<U+2010>/ gummi”, including the appropriate character for any inserted hyphen. For a list of dash-like character in Unicode see Section 6.2, General Punctuation in [U4.0].
There are three types of hyphens: Explicit hyphens, conditional hyphens, and dictionary-inserted hyphens (as a result of a hyphenation process). There is no character code for the third kind of hyphen; therefore if it is desired to make the distinction, the fact that a hyphen is dictionary-inserted must be represented out of band, or by using another control code instead of SHY.
The action of a hyphenation algorithm is equivalent to the insertion of a SHY. However, when a word contains an explicit SHY it is customarily treated as overriding the action of the hyphenator for that word.
UAX#29, Boundaries, describes a particular method for boundary detection. It is based on a set of hierarchical rules and character classifications. That method is well suited for implementation of some of the advanced heuristics for line breaking.
A slightly simplified implementation of such an algorithm can be devised that uses a two dimensional table to resolve break opportunities between pairs or characters. It is described in Section 7, Pair-Table Based Implementation.
The line breaking algorithm presented in this section can be expressed in a series of rules which take line breaking classes as input. The line breaking rules are stated in terms of regular expressions over the line breaking classes defined above and three special symbols indicating the type of line break opportunity.
! Mandatory break at the indicated position
× No break allowed at the indicated position
÷ Break allowed at the indicated position
The rules are applied in order. That is, there is an implicit ”otherwise” at the front of each rule following the first. It is possible to construct alternate sets of such rules that are fully equivalent, i.e. they have the same effect.
The distinction between direct and indirect break is handled by explicitly considering the effect of SP in rule LB12. Because rules are applied in order, rule LB12 implies that a prohibited break in rules 13-19 is equivalent to an indirect break.
The examples for each rule use representative characters, where ’H’ stands for an ideographs, ’h’ for small kana, ’9’ for digits.
Resolve line breaking classes:
LB 1 Assign a line break category to each character of the input. Resolve AI, CB, SA, SG, XX into other line breaking classes depending on criteria outside the scope of this algorithm.
Start and end of text:
LB 2a Never break at the start of text
× sot
LB 2b Always break at the end of text
! eot
These two rules are designed to deal with degenerate cases. Their effect is to have at least one character on each line, and at least one line break for the whole text. Emergency line breaking behavior usually also allows line breaks anywhere on the line if a legal line break cannot be found. This has the effect of preventing text to run over the margins.
Mandatory breaks:
LB 3a Always break after hard line breaks (but never between CR and LF).
BK !
CR × LF
CR !
LF !
[ED: The sequence of expressions *inside* this rule matters: the CR × LF must occur before the CR ! instead of at the end as in the last approved version. I've moved it to where it logically belongs, but it would be better if it was separated into an earlier rule in order. ]
LB 3b Don’t break before hard line breaks.
× ( BK | CR | LF )
Explicit breaks and non-breaks:
LB 4 Don’t break before spaces or zero-width space.
× SP
× ZW
LB 5 Break after zero-width space.
ZW ÷
Combining Marks:
At any possible break opportunity between CM and a following character, CM behaves as if it had the type of its base character. Virama are treated as CM so they work correctly. Jamo are classified as JL, JV, or JT; no breaks can occur in the middle of a syllable formed by these Jamos. The effective line breaking class for the syllable should match the line breaking class for Hangul Syllables.
LB 6 Don’t break grapheme clusters (before combining marks, around virama or on sequences of conjoining Jamos.
Treat X CM* as if it were X
Treat a sequence J L* JL JV * JV JT* as if it were a Hangul Syllable
(See the Unicode Standard Annex #29[Boundaries] for other rules regarding grapheme clusters.)
As stated in section 7.9 of The Unicode Standard, Version 3.0 [U3.0], combining characters are shown in isolation by applying them to either U+0020 SPACE (SP) or U+00A0 NO BREAK SPACE (NBSP). The visual appearance is the same, but the line breaking result is different. Correspondingly, if there is no base, or if the base character is SP, CM* or SP CM* behave like ID.
LB 7 In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP CM* in the same cases as one would break before an ID.
Treat SP CM* as if it were ID
Opening and closing:
These have special behavior with respect to spaces, and so come before rule 12.
LB 8 Don’t break before ‘]’ or ‘!’ or ‘;’ or ‘/’, even after spaces.
× CL
× EX
× IS
× SY
LB 9 Don’t break after ‘[’, even after spaces.
OP SP* ×
LB 10 Don’t break within ‘”[’, , even with intervening spaces.
QU SP* × OP
LB 11 Don’t break within ‘]h’, even with intervening spaces.
CL SP* × NS
LB 11a Don’t break within ‘——’, even with intervening spaces.
B2 × B2
Non-breaking characters:
LB 11b Don’t break before or after NBSP, WORD JOINER and related characters
× GL
GL ×
Spaces:
LB 12 Break after spaces
SP ÷
Many existing implementations reverse the order of precedence between rules LB11b and LB12.
Special case rules:
LB 14 Don’t break before or after ‘”’
× QU
QU ×
LB 14a Break before and after unresolved CB
÷ CB
CB ÷
Conditional breaks should be resolved external to the line break rules. However, the default action is to treat unresolved CB as breaking before and after.
LB 15 Don’t break before hyphen-minus, other hyphens, fixed-width spaces, small kana and other non- starters, or after acute accents:
× BA
× HY
× NS
BB ×
[The X HY rule is proposed to be moved here to disallow breaking '-3'. This change is subject to review and approval by the UTC.]
LB 16 Don’t break between two ellipses, or between letters or numbers and ellipsis:
AL × IN
ID × IN
IN × IN
NU × IN
Examples: ’9...’, ‘a...’, ‘H...’
Numbers:
Don't break alphanumerics.
LB 17 Don’t break within ‘a9’, ‘3a’, or ‘H%’
ID × PO
AL × NU
NU × AL
Numbers are of the form PR ? ( OP | HY ) ? NU (NU | IS) * CL ? PO ?
Examples: $(12.35) 2,1234 (12)¢ 12.54¢
This is approximated with the following rules. (Some cases are already handled above, like ‘9,’, ‘[9’.) Regular expression-based linebreak engines will get better results implementing the above regular expression for numeric expressions.
LB 18 Don’t break between the following pairs of classes.
CL × PO
HY × NU
IS × NU
NU × NU
NU × PO
PR × AL
PR × HY
PR × ID
PR × NU
PR × OP
SY × NU
Example pairs: ‘$9’, ‘$[’, ‘$-‘, ‘-9’, ‘/9’, ‘99’, ‘,9’, ‘9%’ ‘]%’
LB 18b Break after hyphen-minus, and before acute accents:
HY ÷
÷ BB
Finally, join alphabetic letters and break everything else.
LB 19 Don’t break between alphabetics (“at”)
AL × AL
LB 20 Break everywhere else
ALL ÷
÷ ALL
A two dimensional table can be used to resolve break opportunities between pairs of characters. The rows of the table are labeled by the possible values of the line breaking property of the leading character in the pair; the columns are labeled by the line breaking class for the following character of the pair. Each intersection is labeled with the resulting line breaking opportunity.
The Japanese standard JIS X 4051-1995 [JIS] provides an example of such a table-based definition. However, it uses line breaking classes whose membership is not solely determined by the line breaking property (as in this Annex), but in some cases by heuristic analysis or markup of the text.
The implementation provided here directly uses the line breaking classed defined above.
If two rows of the table have identical values and the corresponding columns also have identical values, the two line breaking classes can be coalesced. For example, the JIS standard uses 20 classes of which only 14 appear to be unique. A minimal table representation is unique, except for trivial reordering of rows and columns.
Rules LB 6, and LB 8 - LB11 require extended context for handling combining marks and spaces. This extended context must be built into the code that interprets the pair table.
By broadening the definition of pair from B A, where B is the line breaking class before a break, and A the one after, to B SP* A, where SP* is an optional run of space characters, the same table can be used to distinguish between cases where SP can or cannot provide a line breaking opportunity (i.e. direct and indirect breaks). Rules equivalent to the ones given in Section 6 Line Breaking Algorithm can be formulated without explicit use of SP, by instead using % to express indirect breaks. These rules can then be simplified to involve only pairs of classes, e.g. only constructions of the form
B ÷ A
B % A
B ^ A
where either A or B may be empty. These simplified rules can then be automatically translated into a pair table, as in the example below. Line break analysis then proceeds by pair table lookup.
The following example table implements the line breaking behavior described in this Annex, within the limitation that only context of the form B SP* A is considered. BK, CR, LF and SP classes are handled explicitly in the outer loop as given in the code sample below. B CM* can be handled approximately in the table, or explicitly in the driving loop, as explained in Section 7.5 Combining Marks. Using the table for CM is equivalent to making the simplifying assumption that combining marks are only applied to base characters of line breaking class AL. Conjoining Jamos are considered separately in Section 7.6 Conjoining Jamos.
‘After’ class |
||||||||||||||||||||
OP | CL | QU | GL | NS | EX | SY | IS | PR | PO | NU | AL | ID | IN | HY | BA | BB | B2 | ZW | CM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OP | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ | ^ |
CL | _ | ^ | % | ^ | ^ | ^ | ^ | ^ | _ | % | _ | _ | _ | _ | % | % | _ | _ | ^ | % |
QU | ^ | ^ | % | ^ | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | ^ |
GL | % | ^ | % | ^ | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | ^ |
NS | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | % |
EX | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | % |
SY | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | % |
IS | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | % |
PR | % | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | % | % | _ | % | % | _ | _ | ^ | ^ |
PO | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | % |
NU | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | % | % | _ | % | % | % | _ | _ | ^ | ^ |
AL | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | % | _ | % | % | % | _ | _ | ^ | ^ |
ID | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | % |
IN | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | % | % | % | _ | _ | ^ | % |
HY | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | % | _ | _ | _ | % | % | _ | _ | ^ | % |
BA | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | _ | ^ | % |
BB | % | ^ | % | ^ | % | ^ | ^ | ^ | % | % | % | % | % | % | % | % | % | % | ^ | % |
B2 | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | _ | _ | _ | _ | _ | % | % | _ | ^ | ^ | % |
ZW | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | ^ | % |
CM | _ | ^ | % | ^ | % | ^ | ^ | ^ | _ | % | _ | _ | _ | % | % | % | _ | _ | ^ | ^ |
Hovering over the cells in a browser enabled for tool-tips, reveals the rule number that determines the breaking status in the case in question. When a case has to be tested with and without intervening spaces, multiple rules are given.
The following two functions demonstrate how the pair table is used. For a complete implementation of the line breaking algorithm, if statements to handle the following line breaking classes need to be added: CR, LF, CB, SG, XX. They have been omitted here for brevity.
// placeholder function for complex break analysis int findComplexBreak(int *pcls, int *pbrk, int cch) { if (!cch) return 0; int cls = pcls[0]; for(int ich = 0; ich < cch; ich++) { // .. do complex break analysis here if (pcls[ich] != SA) break; } return ich; } enum break_action { DBK = 0, // direct break (blank in table) IBK, // indirect break (% in table) PBK }; // prohibited break (^ in table)
// handle spaces separately, all others by table // pcls - pointer to array of line breaking classes (input) // pbrk - pointer to array of line break opportunities (output) // cch - number of elements in the arrays (“count of characters”) (input) // ich - current index into the arrays (variable) int findLineBrk1(int *pcls, int *pbrk, int cch) { if (!cch) return O; int cls = pcls[0]; // loop over all pairs in the string for (int ich = 1; (ich < cch) && (cls != BK); ich++) { // handle spaces if (pcls[ich] == SP) { pbrk[ich-1] = PBK; continue; } // handle complex scripts if (pcls[ich] == SA) { ich += findComplexBreak(&pcls[ich-1], &pbrk[ich-1], cch - (ich-1)); if (ich < cch) cls = pcls[ich]; continue; } // lookup pair table information int brk = brkPairs[cls][pcls[ich]]; if (brk == IBK) { pbrk[ich-1] = ((pcls[ich - 1] == SP) ? IBK : PBK); } else { pbrk[ich-1] = brk; } cls = pcls[ich]; } // always break at the end pbrk[ich-1] = DBK; return ich; }
The function returns all the break opportunities in the array pointed to by pbrk, using the values in the table.
If one makes the simplifying assumption that combining marks are only applied to AL, or that applying a combining mark turns the combination into AL, then CM can be handled in the table as shown. (Such an assumption does not hold when conjoining Jamos are used).
Therefore it is preferable to handle CM outside of the pair table in the driver code. Adding a simple statement in the loop
// handle combining marks if (pcls[ich] == CM){ if(pcls[ich-1] == SP){ cls = ID; if (ich > 1) pbrk[ich-2] = brkPairs[pcls[ich-2]][ID] == DBK ? DBK : PBK; } pbrk[ich-1] = PBK; continue; }
would have the effect of letting the CM take on the class of the preceding non-CM characters. It also takes care of rule LB7, treating a combining mark applied to a SP as if it was ID. This also requires a statement in the setup part before the loop to cover the case of a missing base character at the beginning of the line:
// handle missing base character if (cls == CM) cls = ID;
In principle, line break analysis would follow grapheme cluster boundary detection. This would handled combining character sequences and conjoining Jamo sequences as units. However, in order to do the analysis in one pass, combining character sequences can be handled approximately as described above, and pair table entries for conjoining Jamo can be added to the pair table as described here.
Table 3 Additional Pair Table Entries for Conjoining Jamos
‘After’ class | ||||
---|---|---|---|---|
OT | JL | JV | JT | |
OT | @T2 | @ID | @ID | @ID |
JL | @ID | % | % | _ |
JV | @ID | _ | % | % |
JT | @ID | _ | _ | % |
This table describes only the additions needed to the Example Pair Table in Table 2, and uses as shorthand notation. The cell labeled @T2 stands for the entire Table 2, which contains the pair entries for all the other line breaking classes, here referred to as OT. The cells labeled @ID are shorthand for rows and columns containing the pair table entries for all combinations of the given Jamo class with any of the other line break classes. By default, these rows and columns have the same values as the row and column for class ID in Table 2. However, a common tailoring is to given them the same values as for class AL instead.
A real world line breaking algorithm must be tailorable to some degree to meet user or document requirements.
In Korean, for example, two distinct line breaking modes may occur, which can be summarized as breaking after each character, or breaking after spaces (as in Latin text). The former tends to occur when text is set justified, the latter, when ragged margins are used. In that case, even Ideographs are only broken at space characters.
In Japanese for example, tighter and looser specifications of prohibited line breaks may be used.
The remainder of this section gives an overview of common types of tailorings and examples of how these can be used to customize the algorithm as needed.
There are three principal ways of tailoring the line break algorithm:
Beyond these three straightforward customization steps, it is always possible to augment the algorithm itself, for example by providing specialized rules to recognize and break common constructs, such as URLs. Such open ended customizations place no limits to possible changes, other than to correctly implement characters with normative line-breaking properties.
Example 1. One method of implementing line breaks for complex scripts is to invoke context-based classification for all runs of characters with class SA. For example a dictionary-based algorithm could return different classes for Thai letters depending on their context: letters at the start of Thai words would become BB and other Thai letters would become AL. The sample code sketches a different approach where the dictionary-based algorithm directly reports break opportunities.
Example 2. To implement terminal style line breaks, it would be necessary to allow breaks inside a run of spaces. This cannot be done in the pair-table, but requires a change in the way the driver loop handles spaces.
Example 3. Depending on the nature of the document, Korean uses either implicit breaking around characters (type 2 as defined above in section 3 Description) or uses spaces (type 1). Space based layout is common in informal documents with ragged margins, such as magazines, while books, with both margins justified, use the other type, as it affords more line break opportunities and therefore leads to better justification. Reference [Suign98] shows how the necessary customizations can be elegantly handled by selectively altering the interpretation of the pair entries. Only the intersection of ID/ID, AL/ID and ID/AL are affected. For alphabetic style line breaking, breaks for these four cases require space, for ideographic style line breaking, these four cases do not require spaces. Therefore, he defines a pseudo-action, which is then resolved into either direct or indirect break action based on user selection of the preferred behavior for a given (piece of) text.
Example 4. Sometimes allowing alphabetic characters and digit strings to break anywhere is required in Far Eastern context. According to reference [Suign98] this can again be done in the same way, this time affecting the intersections of NU/NU, NU/AL, AL/AL, and AL/NU.
Example 5. Some users prefer to force Kana syllables to be kept together, i.e. the syllable kyu, spelled with the two kanas KI and “small yu” would be kept together even though KI and yu are normally atomic. This customization can be handled via the first method, by changing the classification of the Kana small characters from ID to NS as needed.
Reference [Cedar97] reports on a real world implementation of a pair-table based implementation of a line breaking algorithm substantially similar to the one presented here, and including the types of customizations presented in this section. This implementation was able to simultaneously meet the requirements of customers in many European and East Asian countries with a single implementation of the algorithm.
[TBD: To be further updated for the final, approved version.]
[Boundaries] | Unicode Standard Annex #29, Text
Boundaries. http://www.unicode.org/unicode/reports/tr29 For information on grapheme cluster boundaries |
[Cedar97] | Cy Cedar, David Veintimilla, Michel Suignard and Asmus Freytag, Report from the Trenches: Microsoft Publisher goes Unicode, Proceedings of the Eleventh International Unicode Conference, San Jose, CA 1997 |
[Data] | The version of the line break property data file at the
time of the publication of this document is http://www.unicode.org/Public/3.2-Update/LineBreak-3.2.0.txt The latest version of the data file is http://www.unicode.org/Public/UNIDATA/LineBreak.txt |
[EAW] | Unicode Standard Annex #11, East Asian Width. http://www.unicode.org/unicode/reports/tr11 For a definition of East Asian Width |
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/unicode/faq/ For answers to common questions on technical issues. |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[JIS] | JIS X 4051-1995. Line Composition Rules for Japanese Documents. ( 『日本語文晝の行組版方法』) Japanese Standards Association. 1995. |
[Knuth78] | Donald E. Knuth and Michael F. Plass, Breaking Lines into Paragraphs, republished in Digital Typography, CSLI 78, (Stanford, California: CLSI Publications1997) |
[Reports] | Unicode Technical Reports http://www.unicode.org/unicode/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[Suign98] | Michel Suignard, Worldwide Typography and How to Apply JIS X 4051-1995 to Unicode, Proceedings of the Twelfth International Unicode/ISO 10646 Conference, Tokyo, Japan, 1998 |
[TeX] | Donald E. Knuth, TEX, the Program, Volume B of Computers & Typesetting, (Reading, Massachusetts: Addison-Wesley 1986) |
[U3.0] | The Unicode Standard, Version 3.0, (Reading, Massachusetts: Addison-Wesley Developers Press 2000) or online as http://www.unicode.org/unicode/uni2book/u2.html |
[U3.1] | Unicode Standard Annex #27: Unicode 3.1 http://www.unicode.org/unicode/reports/tr27/ |
[U3.2] | Unicode Standard Annex #28: Unicode 3.2 http://www.unicode.org/unicode/reports/tr28/ |
[U4.0] | The Unicode Standard, Version 4.0, (Reading, Massachusetts: Addison-Wesley Developers Press 2003) or online as http://www.unicode.org/unicode/uni2book/u2.html |
[UCD] | Unicode Character Database http://www.unicode.org/ucd/ For an overview of the Unicode Character Database and a list of its associated files see http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html |
[UAX9] | Unicode Standard Annex #27: Unicode Bidirectinal
Algorithm http://www.unicode.org/unicode/reports/tr9/ |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/unicode/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
The initial assignments of properties are based on input by Michel Suignard. Mark Davis provided algorithmic verification and formulation of the rules. Ken Whistler, Rick McGowan and other members of the editorial committee provided valuable feedback. Tim Partridge enlarged the information on dictionary usage. Sun Gi Hong reviewed the information on Korean and provided copious printed samples. Eric Muller reanalyzed the behvior of the soft hyphen and collected the samples.
Change from Revision 12:
Change from Revision 11:
[Revision 11, being a proposed update, is superseded and no longer publicly available]Change from Revision 10:
Change from Revision 9:
Change from Revision 8:
Change from Revision 7:
Change from Revision 6:
Change from Revision 5:
Copyright © 1998-2003 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.