| Version | Unicode 5.1.0 |
| Authors | Asmus Freytag (asmus@unicode.org), Andy Heninger (andy.heninger@gmail.com) |
| Date | 2008-03-31 |
| This Version | http://www.unicode.org/reports/tr14/tr14-22.html |
| Previous Version | http://www.unicode.org/reports/tr14/tr14-19.html |
| Latest Version | http://www.unicode.org/reports/tr14/ |
| Revision | 22 |
This annex presents the Unicode line breaking algorithm along with detailed descriptions of each of the character classes established by the Unicode line breaking property. The line breaking algorithm produces a set of "break opportunities", or positions that would be suitable for wrapping lines when preparing text for display. A model implementation using pair tables is also provided.
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
The text of The Unicode Standard [Unicode] presents a limited description of some of the characters with specific functions in line breaking, but does not give a complete specification of line breaking behavior. This annex provides more detailed information about default line breaking behavior reflecting best practices for the support of multilingual texts.
For most Unicode characters, considerable variation in line breaking behavior can be expected, including variation based on local or stylistic preferences. For that reason, the line breaking properties provided for these characters are informative. Some characters are intended to explicitly influence line breaking. Their line breaking behavior is therefore expected to be identical across all implementations. As described in this annex, the Unicode Standard assigns normative line breaking properties to those characters. The Unicode Line Breaking Algorithm is a tailorable set of rules that uses these line breaking properties in context to determine line break opportunities.
This annex opens with formal definitions, a summary of the line breaking task and the context in which it occurs in overall text layout followed by a brief section on conformance requirements. Three main sections follow:
The final two sections discuss issues of customization and implementation.
All terms not defined here shall be as defined in the Unicode Standard [Unicode5.0]. The notation defined in this annex differs somewhat from the notation defined elsewhere in the Unicode Standard. All other notation used here without an explicit definition shall be as defined elsewhere in the Unicode Standard.
LD1 Line Fitting: The process of determining how much text will fit on a line of text, given the available space between the margins and the actual display width of the text.
LD2 Line Break: The position in the text where one line ends and the next one starts.
LD3 Line Break Opportunity: A place where a line is allowed to end.
LD4 Line Breaking: The process of selecting one among several line break opportunities such that the resulting line is optimal or ends at a user-requested explicit line break.
LD5 Line Breaking Property: A character property with enumerated values, as listed in Table 1, and separated into normative and informative values.
LD6 Line Breaking Class: A class of characters with the same line breaking property value.
The Line Breaking Classes are described in Section 5.1, Description of Line Breaking Properties.
LD7 Mandatory Break: A line must break following a character that has the mandatory break property.
Such a break is also known as a forced break and is indicated in the rules as B !, where B is the character with the mandatory break property.
LD8 Direct Break: A line break opportunity exists between two adjacent characters of the given line breaking classes.
A direct break is indicated in the rules below as B ÷ A, where B is the character class of the character before and A is the character class of the character after the break. If they are separated by one or more space characters, a break opportunity exists instead after the last space. In the pair table, the optional space characters are not shown.
LD9 Indirect Break: A line break opportunity exists between two characters of the given line breaking classes only if they are separated by one or more spaces.
An indirect break is indicated in the pair table in Table 2 as B % A, where B is the character class of the character before and A is the character class of the character after the break. Even though space characters are not shown in the pair table, an indirect break can occur only if one or more spaces follow B. In the notation of the rules in Section 6, Line Breaking Algorithm, this would be represented as two rules: B × A and B SP+ ÷ A where the “+” sign means one or more occurrences.
LD10 Prohibited Break: No line break opportunity exists between two characters of the given line breaking classes, even if they are separated by one or more space characters.
A prohibited break is indicated in the pair table in Table 2 as B ^ A, where B is the character class of the character before and A is the character class of the character after the break, and the optional space characters are not shown. In the notation of the rules in Section 6, Line Breaking Algorithm, this would be expressed as a rule of the form: B SP* × A.
LD11 Hyphenation: Hyphenation uses language-specific rules to provide additional line break opportunities within a word.
Table 1 lists all of line breaking classes by name, also giving their class abbreviation and their status as tailorable or not. The examples and brief indication of line breaking behavior in this table are merely typical, not exhaustive. Section 5.1, Description of Line Breaking Properties, provides a detailed description of each line breaking class, including detailed overview of the line breaking behavior for characters of that class.
Table 1. Line Breaking Classes (* = non-tailorable)
| Class |
Descriptive Name |
Examples |
Characters with This Property... |
|
Non-tailorable Line Breaking Classes |
|||
|
Mandatory Break |
NL, PS |
Cause a line break (after) |
|
|
Carriage Return |
CR |
Cause a line break (after), except between CR and LF |
|
|
Line Feed |
LF |
Cause a line break (after) |
|
|
Attached Characters and Combining Marks |
Combining marks, control codes |
Prohibit a line break between the character and the preceding character |
|
| NL * | Next Line | NEL | Cause a line break (after) |
|
Surrogates |
Surrogates |
Do not occur in well-formed text |
|
| WJ * | Word Joiner | WJ | Prohibit line breaks before and after |
|
Zero Width Space |
ZWSP |
Provide a break opportunity |
|
| GL * | Non-breaking (“Glue”) | CGJ, NBSP, ZWNBSP | Prohibit line breaks before and after |
| SP * | Space | SPACE | Enable indirect line breaks |
|
Break Opportunities |
|||
|
Break Opportunity Before and After |
Em dash |
Provide a line break opportunity before and after the character |
|
|
Break Opportunity After |
Spaces, hyphens |
Generally provide a line break opportunity after the character |
|
|
Break Opportunity Before |
Punctuation used in dictionaries |
Generally provide a line break opportunity before the character |
|
|
Hyphen |
HYPHEN-MINUS |
Provide a line break opportunity after the character, except in numeric context |
|
| CB | Contingent Break Opportunity | Inline objects | Provide a line break opportunity contingent on additional information |
|
Characters Prohibiting Certain Breaks |
|||
|
Closing Punctuation |
“)”, “]”, “}”, etc. |
Prohibit line breaks before |
|
|
Exclamation/ |
“!”, “?”, etc. |
Prohibit line breaks before |
|
|
Inseparable |
Leaders |
Allow only indirect line breaks between pairs |
|
|
Nonstarter |
small kana |
Allow only indirect line breaks before |
|
|
Opening Punctuation |
“(“, “[“, “{“, etc. |
Prohibit line breaks after |
|
|
Ambiguous Quotation |
Quotation marks |
Act like they are both opening and closing |
|
|
Numeric Context |
|||
|
Infix Separator (Numeric) |
. , |
Prevent breaks after any and before numeric |
|
|
Numeric |
Digits |
Form numeric expressions for line breaking purposes |
|
|
Postfix (Numeric) |
%, ¢ |
Do not break following a numeric expression |
|
|
Prefix (Numeric) |
$, £, ¥, etc. |
Do not break in front of a numeric expression |
|
|
Symbols Allowing Break After |
/ |
Prevent a break before, and allow a break after |
|
|
Other Characters |
|||
|
Ambiguous (Alphabetic or Ideographic) |
Characters with Ambiguous East Asian Width |
Act like AL when the resolved EAW is N; otherwise, act as ID |
|
|
Ordinary Alphabetic and Symbol Characters |
Alphabets and regular symbols |
Are alphabetic characters or symbols that are used with alphabetic characters |
|
| H2 | Hangul LV Syllable | Hangul | Form Korean syllable blocks |
| H3 | Hangul LVT Syllable | Hangul | Form Korean syllable blocks |
|
Ideographic |
Ideographs |
Break before or after, except in some numeric context |
|
| JL | Hangul L Jamo | Conjoining jamo | Form Korean syllable blocks |
| JV | Hangul V Jamo | Conjoining jamo | Form Korean syllable blocks |
| JT | Hangul T Jamo | Conjoining jamo | Form Korean syllable blocks |
|
Complex Context Dependent (South East Asian) |
South East Asian: Thai, Lao, Khmer |
Provide a line break opportunity contingent on additional, language-specific context analysis |
|
|
Unknown |
Unassigned, private-use |
Have as yet unknown line breaking behavior or unassigned code positions |
|
Lines are broken as result of one of two conditions. The first condition is the presence of a mandatory line breaking character. The second condition results from a formatting algorithm having selected among available line break opportunities; ideally the chosen line break results in the optimal layout of the text.
Different formatting algorithms may use different methods to determine an optimal line break. For example, simple implementations consider a single line at a time, trying to find a locally optimal line break. A basic, yet widely used approach is to allow no compression or expansion of the intercharacter and interword spaces and consider the longest line that fits. More complex formatting algorithms often take into account the interaction of line breaking decisions for the whole paragraph. The well-known text layout system [TEX] implements an example of such a globally optimal strategy that may make complex tradeoffs across an entire paragraph to avoid unnecessary hyphenation and other legal, but inferior breaks. For a description of this strategy, see [Knuth78].
When compression or expansion is allowed, a locally optimal line break seeks to balance the relative merits of the resulting amounts of compression and expansion for different line break candidates. When expanding or compressing interword space according to common typographical practice, only the spaces marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and U+3000 IDEOGRAPHIC SPACE are subject to compression, and only spaces marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and occasionally spaces marked by U+2009 THIN SPACE are subject to expansion. All other space characters normally have fixed width. When expanding or compressing intercharacter space, the presence of U+200B ZERO WIDTH SPACE or U+2060 WORD JOINER is always ignored.
Local custom or document style determines whether and to what degree expansion of intercharacter space is allowed in justifying a line. In languages, such as German, where intercharacter space is commonly used to mark e m p h a s i s (like this), allowing variable intercharacter spacing would have the unintended effect of adding random emphasis, and is therefore best avoided. In table headings that use Han ideographs, even extreme amounts of intercharacter space commonly occur as short texts are spread out across the entire available space to distribute the characters evenly from end to end.
In linebreaking it is necessary to distinguish between two related tasks. The first is the determination of all legal line break opportunities, given a string of text. This is the scope of the Unicode Line Break Algorithm. The second task is the selection of the actual location for breaking a given line of text. This selection not only takes into account the width of the line compared to the width of the text, but may also apply an additional prioritization of line breaks based on aesthetic and other criteria. What defines an optimal choice for a given line break is outside the scope of this annex, as are methods for its selection.
Finally, text layout systems may support an emergency mode that handles the case of an unusual line that contains no otherwise permitted line break opportunities. In such line layout emergencies, line breaks may be placed with no regard to the ordinary line breaking behavior of the characters involved. The details of such an emergency mode are outside the scope of this annex, however, it is recommended that grapheme clusters be kept together.
Three principal styles of context analysis determine line break opportunities.
The Western style is commonly used for scripts employing the space character. Hyphenation is often used with space-based line breaking to provide additional line break opportunities—however, it requires knowledge of the language and it may need user interaction or overrides.
The second style of context analysis is used with East Asian ideographic and syllabic scripts. In these scripts, lines can break anywhere, except before or after certain characters. The precise set of prohibited line breaks may depend on user preference or local custom and is commonly tailorable.
Korean makes use of both styles of line break. When Korean text is justified, the second style is commonly used, even for interspersed Latin letters. But when ragged margins are used, the Western style (relying on spaces) is commonly used instead, even for ideographs.
The third style is used for scripts such as Thai, which do not use spaces, but which restrict word breaks to syllable boundaries, the determination of which requires knowledge of the language comparable to that required by a hyphenation algorithm. Such an algorithm is beyond the scope of the Unicode Standard.
For multilingual text, the Western and East Asian styles can be unified into a single set of specifications, based on the information in this annex. Unicode characters have explicit line breaking properties assigned to them. These properties can be utilized to implement the effect of both of these two styles of context analysis for line break opportunities. Customization for user preferences or document style can then be achieved by tailoring that specification.
In bidirectional text, line breaks are determined before applying rule L1 of the Unicode Bidirectional Algorithm [Bidi]. However, line breaking is strictly independent of directional properties of the characters or of any auxiliary information determined by the application of rules of that algorithm.
There is no single method for determining line breaks; the rules may differ based on user preference and document layout. Therefore the information in this annex, including the specification of the line breaking algorithm, allows for the necessary flexibility in determining line breaks according to different conventions. However, some characters have been encoded explicitly for their effect on line breaking. Users adding such characters to a text expect that they will have the desired effect. For that reason, these characters have been given required line breaking behavior.
To handle certain situations, some line breaking implementations use techniques that cannot be expressed within the framework of the Unicode Line Breaking Algorithm. Examples include the use of dictionaries of words for languages that do not use spaces, such as Thai; recognition of the language of the text in order to choose among different punctuation conventions; the use of dictionaries of common abbreviations or contractions to resolve ambiguities with periods or apostrophes; or a deeper analysis of common syntaxes for numbers or dates, and so on. The conformance requirements permit variations of this kind.
Processes which support multiple modes for determining line breaks are also accommodated. This situation can arise with marked-up text, rich text, style sheets, or other environments in which a higher-level protocol can carry formatting instructions that prevent or force line breaks in positions that differ from those specified by the Unicode Line Break Algorithm. The approach taken here is to require that such processes have a conforming default line break behavior, and to disclose that they also include overrides or optional behaviors that are invoked via a higher-level protocol.
The methods by which a line layout process chooses optimal line breaks from among the available break opportunities is outside the scope of this specification. The behavior of a line layout process in situations where there are no suitable break opportunities is also outside of the scope of this specification.
UAX14-C1. A process that determines line breaks in Unicode text, and that purports to implement the Unicode Line Breaking Algorithm, shall do so in accordance with the specifications in this annex. In particular, the following three subconditions shall be met:
UAX14-C2. If an implementation has a default line breaking operation which conforms to UAX14-C1, but also has overrides based on a higher-level protocol, that fact must be disclosed and any behavior that differs from that specified by the rules of Section 6.1, Non-tailorable Line Breaking Rules, must be documented.
Example: An xml format provides markup which disables all line breaking over some span of text. When the markup is not in place, the default behavior is in conformance according to UAX14-C1. As long as the existence of the option is disclosed, that format can be said to conform to the Unicode Line Breaking Algorithm according to UAX14-C2.
As is the case for all other Unicode algorithms, this specification is a logical description—particular implementations can have more efficient mechanisms as long as they produce the same results. See C18 in Chapter 3, Conformance, of [Unicode]. While only disclosure of tailorings is required in the conformance clauses, documentation of the differences in behaviors is strongly encouraged.
This section provides detailed narrative descriptions of the line breaking behavior of many Unicode characters. In many instances, the descriptions in this section provide additional informative detail about handling a given character at the end of a line, or during line layout, which goes beyond the simple determination of line breaks. In some cases, the text also gives guidance as to preferred characters for achieving a particular effect in line breaking.
This section also summarizes the membership of character classes for each value of the line breaking property. Note that the mnemonic names for the line break classes are intended neither as exhaustive descriptions of their membership nor as indicators of their entire range of behaviors in the line breaking process. Instead, their main purpose is to serve as unique, yet broadly mnemonic labels. In other words, as long as their line break behavior is identical, otherwise unrelated characters will be found grouped together in the same line break class.
The classification by property values defined in this section and in the data file is used as input into two algorithms defined in Section 6, Line Breaking Algorithm, and Section 7, Pair Table-Based Implementation. These sections describe workable default line breaking methods. Section 8, Customization, discusses how the default line breaking behavior can be tailored to the needs of particular languages for particular document styles and user preferences.
The full classification of all Unicode characters by their line breaking properties is available in the file LineBreak.txt [Data14] in the Unicode Character Database [UCD]. This is a tab-delimited, two-column, plain text file, with code position and line breaking class. A comment at the end of each line indicates the character name. Ideographic, Hangul, Surrogate, and Private Use ranges are collapsed by giving a range in the first column.
As more scripts are added to the Unicode Standard and become more widely implemented and used on computers, more line breaking classes may be added or the assignment of line breaking class may be changed for some characters. Implementers must not make any assumptions to the contrary. Any future updates will be reflected in the latest version of the data file. (See the Unicode Character Database [UCD] for any specific version of the data file.)
Line breaking classes are listed alphabetically. Each line breaking class is marked with an annotation in parentheses with the following meanings:
(A)—the class allows a break opportunity after in specified contexts
(XA)—the class prevents a break opportunity after in specified contexts
(B)—the class allows a break opportunity before in specified contexts
(XB)—the class prevents a break opportunity before in specified contexts
(P)—the class allows a break opportunity for a pair of same characters
(XP)—the class prevents a break opportunity for a pair of same characters
Note: The use of the letters B and A in these annotations marks the position of the break opportunity relative to the character. It is not to be confused with the use of the same letters in the other parts of this annex, where they indicate the positions of the characters relative to the break opportunity.
Some characters that ordinarily act like alphabetic or symbol characters (which have the AL line breaking class) are treated like ideographs (line breaking class ID) in certain East Asian legacy contexts. Their line breaking behavior therefore depends on the context. In the absence of appropriate context information, they are treated as class AL, see the note at the end of this description.
As originally defined, the line break class AI contained all characters with East_Asian_Width value A (ambiguous width) that would otherwise be AL in this classification. For more information on East_Asian_Width and how to resolve it, see Unicode Standard Annex #11, East Asian Width [EAW].
The original definition included many Latin, Greek, and Cyrillic characters. These characters are now classified by default as AL because use of the AL line breaking class better corresponds to modern practice. Where strict compatibility with older legacy implementations is desired, some of these characters need to be treated as ID in certain contexts. This can be done by always tailoring them to ID or by continuing to classify them as AI and resolving them to ID where required.
As part of the same revision, the set of ambiguous characters has been extended to completely encompass the enclosed alphanumeric characters used for numbering of bullets.
As updated, the AI line breaking class includes all characters with East Asian Width A that are outside the range U+0000..U+1FFF, plus the following characters:
| 24EA | CIRCLED DIGIT ZERO |
| 2780..2793 | DINGBAT CIRCLED SANS-SERIF DIGIT ONE..DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN |
Characters with the line break class AI with East_Asian_Width value A typically take the AL line breaking class when their resolved East_Asian_Width is N (narrow) and take the line breaking class ID when their resolved width is W (wide). The remaining characters are then resolved to AL or ID in a consistent fashion. The details of this resolution are not specified in this annex. The line breaking rules in Section 6, Line Breaking Algorithm, and the pair table in Section 7, Pair Table-Based Implementation, merely require that all ambiguous characters have been resolved appropriately as part of assigning line breaking classes to the input characters.
Note: The canonical decompositions of characters of class AI are not necessarily of class AI themselves, or conversely. The East_Asian_Width property A on which the definition of AI is largely based, does not preserve canonical equivalence. In the context of line breaking, the fact that a character has been assigned class AI means that the line break implementation must resolve it to either AL or ID, in the absence of further tailoring. If preserving canonical equivalence is desired, an implementation is free to make sure that the resolved line break classes preserve canonical equivalence. Unless compatibility with particular legacy behavior is important, it may be sufficient to map all such characters to AL. This achieves a canonically equivalent resolution of line breaking classes, and is compatible with emerging modern practice that treats these characters increasingly like regular alphabetic characters.
Ordinary characters require other characters to provide break opportunities; otherwise, no line breaks are allowed between pairs of them. However, this behavior is tailorable. In some Far Eastern documents, it may be desirable to allow breaking between pairs of ordinary characters—particularly Latin characters and symbols.
Note: Use ZWSP as a manual override to provide break opportunities around alphabetic or symbol characters.
Except as listed explicitly below as part of another line breaking class, and except as assigned class AI or ID based on East Asian Width, this class contains the following characters:
ALPHABETIC—all remaining characters of General Categories Lu, Ll, Lt, Lm,
and Lo
SYMBOLS—all remaining characters of General Categories Sm, Sk, and So
NON-DECIMAL NUMBERS—all remaining characters of General Categories Nl, and No
PUNCTUATION—all remaining characters of General Categories Pc, Pd, and Po
Plus these characters:
| 0600..0603 | ARABIC NUMBER SIGN..ARABIC SIGN SAFHA |
| 06DD | ARABIC END OF AYAH |
| 070F | SYRIAC ABBREVIATION MARK |
| 2061..2064 | FUNCTION APPLICATION..INVISIBLE PLUS |
These characters occur in the middle or at the beginning of words or alphanumeric or symbol sequences. However, when alphabetic characters are tailored to allow breaks, these characters should not allow breaks after.
Like SPACE, the characters in this class provide a break opportunity; unlike SPACE, they do not take part in determining indirect breaks. They can be subdivided into several categories.
Breaking spaces are the following subset of characters with General_Category Zs:
| 1680 | OGHAM SPACE MARK |
|
2000 |
EN QUAD |
|
2001 |
EM QUAD |
|
2002 |
EN SPACE |
|
2003 |
EM SPACE |
|
2004 |
THREE-PER-EM SPACE |
|
2005 |
FOUR-PER-EM SPACE |
|
2006 |
SIX-PER-EM SPACE |
|
2008 |
PUNCTUATION SPACE |
|
2009 |
THIN SPACE |
|
200A |
HAIR SPACE |
| 205F | MEDIUM MATHEMATICAL SPACE |
All of these space characters have a specific width, but otherwise behave as breaking spaces. In setting a justified line, none of these spaces normally changes in width, except for THIN SPACE when used in mathematical notation. See also the SP property.
The Ogham space mark may be rendered visibly between words but it is recommended that it be elided at the end of a line. For more information, see Section 5.7,, Word Separator Characters.
See the ID property for U+3000 IDEOGRAPHIC SPACE. For a list of all space characters in the Unicode Standard, see Section 6.2, General Punctuation, in [Unicode5.0].
|
0009 |
TAB |
Except for the effect of the location of the tab stops, the tab character acts similarly to a space for the purpose of line breaking.
|
00AD |
SOFT HYPHEN (SHY) |
SHY marks the place where an optional line break may occur inside a word. It can be used with all scripts. SHY is rendered invisibly and has no width: it merely indicates an optional line break. The rendering of the optional line break depends on the script. For the Latin script, rendering the line break typically means displaying a hyphen at the end of the line; however, some languages require a change in spelling surrounding an optional line break. For examples, see Section 5.4, Use of Soft Hyphen.
Breaking hyphens establish explicit break opportunities immediately after each occurrence.
|
058A |
ARMENIAN HYPHEN |
|
2010 |
HYPHEN |
| 2012 | FIGURE DASH |
| 2013 | EN DASH |
Hyphens are graphic characters with width. Because, unlike spaces, they are visible, they are included in the measured part of the preceding line, except where the layout style allows hyphens to hang into the margins. For additional information about how to format line breaks resulting from the presence of hyphens, see Section 5.3, Use of Hyphen.
The following are other forms of visible word dividers that provide break opportunities:
| 05BE | HEBREW PUNCTUATION MAQAF |
|
0F0B |
TIBETAN MARK INTERSYLLABIC TSHEG |
|
1361 |
ETHIOPIC WORDSPACE |
| 17D8 | KHMER SIGN BEYYAL |
| 17DA | KHMER SIGN KOOMUUT |
The Tibetan tsheg is a visible mark, but it functions effectively like a space to separate words (or other units) in Tibetan. It provides a break opportunity after itself. For additional information, see Section 5.6, Tibetan Line Breaking.
The Ethiopic word space is a visible word delimiter and is kept on the previous line. In contrast, U+1360 ETHIOPIC SECTION MARK is typically used in a sequence of several such marks on a separate line, and separated by spaces. As such lines are typically marked with separate hard line breaks (BK), the section mark is treated like an ordinary symbol and given line break class AL.
|
2027 |
HYPHENATION POINT |
A hyphenation point is a raised dot, which is mainly used in dictionaries and similar works to visibly indicate syllabification of words. Syllable breaks frequently also are potential line break opportunities in the middle of words. When an actual line break falls inside a word containing hyphenation point characters, the hyphenation point is usually rendered as a regular hyphen at the end of the line.
|
007C |
VERTICAL LINE |
In some dictionaries, a vertical bar is used instead of a hyphenation point. In this usage, U+0323 COMBINING DOT BELOW is used to mark stressed syllables, so all breaks are marked by the vertical bar. For an actual break opportunity, the vertical bar is rendered as a hyphen in such usage.
Historic texts, especially ancient ones, often do not use spaces, even for scripts where modern use of spaces is standard. Special punctuation was used to mark word boundaries in such texts. For modern text processing it is recommended to treat these as line break opportunities by default. WJ can be used to override this default, where necessary.
| 16EB | RUNIC SINGLE DOT PUNCTUATION |
| 16EC | RUNIC MULTIPLE DOT PUNCTUATION |
| 16ED | RUNIC CROSS PUNCTUATION |
| 2056 | THREE DOT PUNCTUATION |
| 2058 | FOUR DOT PUNCTUATION |
| 2059 | FIVE DOT PUNCTUATION |
| 205A | TWO DOT PUNCTUATION |
| 205B | FOUR DOT MARK |
| 205D | TRICOLON |
| 205E | VERTICAL FOUR DOTS |
| 2E19 | PALM BRANCH |
| 2E2A | TWO DOTS OVER ONE DOT PUNCTUATION |
| 2E2B | ONE DOT OVER TWO DOTS PUNCTUATION |
| 2E2C | SQUARED FOUR DOT PUNCTUATION |
| 2E2D | FIVE DOT PUNCTUATION |
| 2E30 | RING POINT |
| 10100 | AEGEAN WORD SEPARATOR LINE |
| 10101 | AEGEAN WORD SEPARATOR DOT |
| 10102 | AEGEAN CHECK MARK |
| 1039F | UGARITIC WORD DIVIDER |
| 103D0 | OLD PERSIAN WORD DIVIDER |
| 1091F | PHOENICIAN WORD DIVIDER |
| 12470 | CUNEIFORM PUNCTUATION SIGN OLD ASSYRIAN WORD DIVIDER |
DEVANAGARI DANDA is similar to a full stop. The danda or historically related symbols are used with several other Indic scripts. Unlike a full stop, the danda is not used in number formatting. DEVANAGARI DOUBLE DANDA marks the end of a verse. It also has analogues in other scripts.
| 0964 | DEVANAGARI DANDA |
| 0965 | DEVANAGARI DOUBLE DANDA |
| 0E5A | THAI CHARACTER ANGKHANKHU |
| 0E5B | THAI CHARACTER KHOMUT |
| 104A | MYANMAR SIGN LITTLE SECTION |
| 104B | MYANMAR SIGN SECTION |
| 1735 | PHILIPPINE SINGLE PUNCTUATION |
| 1736 | PHILIPPINE DOUBLE PUNCTUATION |
| 17D4 | KHMER SIGN KHAN |
| 17D5 | KHMER SIGN BARIYOOSAN |
| 1B5E | BALINESE CARIK SIKI |
| 1B5F | BALINESE CARIK PAREREN |
| A8CE | SAURASHTRA DANDA |
| A8CF | SAURASHTRA DOUBLE DANDA |
| AA5D | CHAM PUNCTUATION DANDA |
| AA5E | CHAM PUNCTUATION DOUBLE DANDA |
| AA5F | CHAM PUNCTUATION TRIPLE DANDA |
| 10A56 | KHAROSHTHI PUNCTUATION DANDA |
| 10A57 | KHAROSHTHI PUNCTUATION DOUBLE DANDA |
| 0F34 | TIBETAN MARK BSDUS RTAGS |
| 0F7F | TIBETAN SIGN RNAM BCAD |
| 0F85 | TIBETAN MARK PALUTA |
| 0FBE | TIBETAN KU RU KHA |
| 0FBF | TIBETAN KU RU KHA BZHI MIG CAN |
| 0FD2 | TIBETAN MARK NYIS TSHEG |
For additional information, see Section 5.6, Tibetan Line Breaking.
Termination punctuation stays with the line, but otherwise allows a break after it. This is similar to EX, except that the latter may be separated by a space from the preceding word without allowing a break, whereas these marks are used without spaces.
| 1804 | MONGOLIAN COLON |
| 1805 | MONGOLIAN FOUR DOTS |
| 1B5A | BALINESE PANTI |
| 1B5B | BALINESE PAMADA |
| 1B5C | BALINESE WINDU |
| 1B5D | BALINESE CARIK PAMUNGKAH |
| 1B60 | BALINESE PAMENENG |
| 1C3B | LEPCHA PUNCTUATION TA-ROL |
| 1C3C | LEPCHA PUNCTUATION NYET THYOOM TA-ROL |
| 1C3D | LEPCHA PUNCTUATION CER-WA |
| 1C3E | LEPCHA PUNCTUATION TSHOOK CER-WA |
| 1C3F | LEPCHA PUNCTUATION TSHOOK |
| 1C7E | OL CHIKI PUNCTUATION MUCAAD |
| 1C7F | OL CHIKI PUNCTUATION DOUBLE MUCAAD |
| 2CFA | COPTIC OLD NUBIAN DIRECT QUESTION MARK |
| 2CFB | COPTIC OLD NUBIAN INDIRECT QUESTION MARK |
| 2CFC | COPTIC OLD NUBIAN VERSE DIVIDER |
| 2CFF | COPTIC MORPHOLOGICAL DIVIDER |
| 2E0E..2E15 | EDITORIAL CORONIS..UPWARDS ANCORA |
| 2E17 | OBLIQUE DOUBLE HYPHEN |
| A60D | VAI COMMA |
| A60F | VAI QUESTION MARK |
| A92E | KAYAH LI SIGN CWI |
| A92F | KAYAH LI SIGN SHYA |
| 10A50 | KHAROSHTHI PUNCTUATION DOT |
| 10A51 | KHAROSHTHI PUNCTUATION SMALL CIRCLE |
| 10A52 | KHAROSHTHI PUNCTUATION CIRCLE |
| 10A53 | KHAROSHTHI PUNCTUATION CRESCENT BAR |
| 10A54 | KHAROSHTHI PUNCTUATION MANGALAM |
| 10A55 | KHAROSHTHI PUNCTUATION LOTUS |
Characters of this line break class move to the next line at a line break and thus provide a line break opportunity before.
|
00B4 |
ACUTE ACCENT |
| 1FFD | GREEK OXIA |
In some dictionaries, stressed syllables are indicated with a spacing acute accent instead of the hyphenation point. In this case the accent moves to the next line, and the preceding line ends with a hyphen. The oxia is canonically equivalent to the acute accent.
|
02DF |
MODIFIER LETTER CROSS ACCENT |
A cross accent also appears in some dictionaries to mark the stress of the following syllable, and should be handled in the same way as the other stress marking characters in this section. The accent should not be separated from the syllable it marks by a break.
|
02C8 |
MODIFIER LETTER VERTICAL LINE |
|
02CC |
MODIFIER LETTER LOW VERTICAL LINE |
These characters are used in dictionaries to indicate stress and secondary stress when IPA is used. Both are prefixes to the stressed syllable in IPA. Breaking before them keeps them with the syllable.
Note: It is hard to find actual examples in most dictionaries because the pronunciation fields usually occur right after the headword, and the columns are wide enough to prevent line breaks in most pronunciations.
| 0F01 | TIBETAN MARK GTER YIG MGO TRUNCATED A |
| 0F02 | TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA |
| 0F03 | TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA |
| 0F04 | TIBETAN MARK INITIAL YIG MGO MDUN MA |
| 0F06 | TIBETAN MARK CARET YIG MGO PHUR SHAD MA |
| 0F07 | TIBETAN MARK YIG MGO TSHEG SHAD MA |
| 0F09 | TIBETAN MARK BSKUR YIG MGO |
| 0F0A | TIBETAN MARK BKA- SHOG YIG MGO |
| 0FD0 | TIBETAN MARK BSKA- SHOG GI MGO RGYAN |
| 0FD1 | TIBETAN MARK MNYAM YIG GI MGO RGYAN |
| 0FD3 | TIBETAN MARK INITIAL BRDA RNYING YIG MGO MDUN MA |
| A874 | PHAGS-PA SINGLE HEAD MARK |
| A875 | PHAGS-PA DOUBLE HEAD MARK |
Tibetan head letters allow a break before. For more information, see Section 5.6, Tibetan Line Breaking.
|
1806 |
MONGOLIAN TODO SOFT HYPHEN |
Despite its name, this Mongolian character is not an invisible control like SOFT HYPHEN, but rather a visible character like a regular hyphen. Unlike the hyphen, MONGOLIAN TODO SOFT HYPHEN stays with the following line. Whenever optional line breaks are to be marked invisibly, SOFT HYPHEN should be used instead.
|
2014 |
EM DASH |
The EM DASH is used to set off parenthetical text. Normally, it is used without spaces. However, this is language dependent. For example, in Swedish, spaces are used around the EM DASH. Line breaks can occur before and after an EM DASH. Because EM DASHes are sometimes used in pairs instead of a single quotation dash, the default behavior is not to break the line between even though not all fonts use connecting glyphs for the EM DASH.
Explicit breaks act independently of the surrounding characters. No characters can be added to the BK class as part of tailoring, but implementations are not required to support the VT character.
|
000C |
FORM FEED (FF) |
| 000B | LINE TABULATION (VT) |
FORM FEED separates pages. The text on the new page starts at the beginning of the line. In some layout modes there may be no visible advance to a new “page”.
|
2028 |
LINE SEPARATOR (LS) |
The text after the Line Separator starts at the beginning of the line. This is similar to HTML <BR>.
|
2029 |
PARAGRAPH SEPARATOR (PS) |
The text of the new paragraph starts at the beginning of the line. This character defines a paragraph break, causing suitable formatting to be applied, for example, interparagraph spacing or first line indentation. LS, FF, VT as well as CR, LF and NL do not define a paragraph break.
Newline Functions are defined in the Unicode Standard as providing additional mandatory breaks. They are not individual characters, but are encoded as sequences of the control characters NEL, LF, and CR. If a character sequence for a Newline Function contains more than one character, it is kept together. The particular sequences that form an NLF depend on the implementation and other circumstances as described in Section 5.8, Newline Guidelines, of [Unicode5.0].
This specification defines the NLF implicitly. It defines the three character classes CR, LF, and NL. Their line break behavior, defined in rule LB5 in Section 6.1, Non-tailorable Line Breaking Rules, is to break after NL, LF, or CR, but not between CR and LF.
By default, there is a break opportunity both before and after any inline object. Object-specific line breaking behavior is implemented in the associated object itself, and where available can override the default to prevent either or both of the default break opportunities. Using U+FFFC OBJECT REPLACEMENT CHARACTER allows the object anchor to take a character position in the string.
| FFFC | OBJECT REPLACEMENT CHARACTER |
Object-specific line break behavior is best implemented by querying the object itself, not by replacing the CB line breaking class by another class.
The closing character of any set of paired punctuation should be kept with the preceding character, and the same applies to all forms of wide comma and full stop. This is desirable, even when there are intervening space characters, so as to prevent the appearance of a bare closing punctuation mark at the head of a line. The CL line break class contains the following characters plus any characters of General_Category Pe in the Unicode Character Database.
|
3001..3002 |
IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP |
| FE11 | PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA |
| FE12 | PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP |
|
FE50 |
SMALL COMMA |
|
FE52 |
SMALL FULL STOP |
|
FF0C |
FULLWIDTH COMMA |
|
FF0E |
FULLWIDTH FULL STOP |
|
FF61 |
HALFWIDTH IDEOGRAPHIC FULL STOP |
|
FF64 |
HALFWIDTH IDEOGRAPHIC COMMA |
Combining character sequences are treated as units for the purpose of line breaking. The line breaking behavior of the sequence is that of the base character.
The preferred base character for showing combining marks in isolation is U+00A0 No-Break SPACE. If a line break before or after the combining sequence is desired, U+200B ZERO WIDTH SPACE can be used. The use of U+0020 SPACE as a base character is deprecated.
For most purposes, combining characters take on the properties of their base characters, and that is how the CM class is treated in rule LB9 of this specification. As a result, if the sequence <0021, 20E4> is used to represent a triangle enclosing an exclamation point, it is effectively treated as EX, the line break class of the exclamation mark. If U+2061 CAUTION SIGN had been used, which also looks like an exclamation point inside a triangle, it would have the line break class of AL. Only the latter corresponds to the line breaking behavior expected by users for this symbol. To avoid surprising behavior, always use a base character that is a symbol or letter (Line Break AL) when using enclosing combining marks (General_Category Me).
The CM line break class includes all combining characters with General_Category Mc, Me, and Mn, unless listed explicitly elsewhere. This includes viramas.
Most control and formatting characters are ignored in line breaking and do not contribute to the line width. By giving them class CM, the line breaking behavior of the last preceding character that is not of class CM affects the line breaking behavior.
Note: When control codes and format characters are rendered visibly during editing, more graceful layout might be achieved by treating them as if they had the line break class of the visible symbols instead, that is AL or ID. Such visible modes do not violate the constraint on tailorability, because they are logically equivalent to having temporarily substituted symbol characters, such as the characters from the Control Pictures block, or in some cases, character sequences, for the actual control characters.
The CM line break class includes all characters of General_Category Cc and Cf, unless listed explicitly elsewhere.
|
000D |
CARRIAGE RETURN (CR) |
A CR indicates a mandatory break after, unless followed by a LF. See also the discussion under BK.
Note: On some platforms the character sequence <CR, CR, LF> is used to indicate the location of actual line breaks, whereas <CR, LF> is treated like a hard line break. As soon as a user edits the text, the location of all the <CR, CR, LF> sequences may change as the new text breaks differently, while the relative position of any <CR, LF> to the surrounding text stays the same. This convention allows an editor to return a buffer and the client to tell which text is displayed on which line by counting the number of <CR, CR, LF> and <CR, LF> sequences. This convention is essentially equivalent to markup that captures the result of applying the line break algorithm, not a tailoring of the CR character. The <CR, CR, LF> sequences are thus not considered part of the plain text content.
Characters in this line break class behave like closing characters, except in relation to postfix (PO) and non-starter characters (NS).
|
0021 |
EXCLAMATION MARK |
|
003F |
QUESTION MARK |
| 05C6 | HEBREW PUNCTUATION NUN HAFUKHA |
| 061B | ARABIC SEMICOLON |
| 061E | ARABIC TRIPLE DOT PUNCTUATION MARK |
| 061F | ARABIC QUESTION MARK |
| 06D4 | ARABIC FULL STOP |
| 07F9 | NKO EXCLAMATION MARK |
| 0F0D | TIBETAN MARK SHAD |
| 0F0E | TIBETAN MARK NYIS SHAD |
| 0F0F | TIBETAN MARK TSHEG SHAD |
| 0F10 | TIBETAN MARK NYIS TSHEG SHAD |
| 0F11 | TIBETAN MARK RIN CHEN SPUNGS SHAD |
| 0F14 | TIBETAN MARK GTER TSHEG |
| 1802 | MONGOLIAN COMMA |
| 1803 | MONGOLIAN FULL STOP |
| 1808 | MONGOLIAN MANCHU COMMA |
| 1809 | MONGOLIAN MANCHU FULL STOP |
| 1944 | LIMBU EXCLAMATION MARK |
| 1945 | LIMBU QUESTION MARK |
| 2762 | HEAVY EXCLAMATION MARK ORNAMENT |
| 2763 | HEAVY HEART EXCLAMATION MARK ORNAMENT |
| 2CF9 | COPTIC OLD NUBIAN FULL STOP |
| 2CFE | COPTIC FULL STOP |
| 2E2E | REVERSED QUESTION MARK |
| A60C | VAI SYLLABLE LENGTHENER |
| A60E | VAI FULL STOP |
| A876 | PHAGS-PA MARK SHAD |
| A877 | PHAGS-PA MARK DOUBLE SHAD |
| FE15 | PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK |
| FE16 | PRESENTATION FORM FOR VERTICAL QUESTION MARK |
| FE56..FE57 | SMALL QUESTION MARK..SMALL EXCLAMATION MARK |
| FF01 | FULLWIDTH EXCLAMATION MARK |
| FF1F | FULLWIDTH QUESTION MARK |
Non-breaking characters prohibit breaks on either side, but that prohibition can be overridden by SP or ZW. In particular, when NBSP follows SPACE, there is a break opportunity after the SPACE and NBSP will go as visible space onto the next line. See also WJ. The following lists the characters of line break class GL with additional description.
|
00A0 |
NO-BREAK SPACE (NBSP) |
|
202F |
NARROW NO-BREAK SPACE (NNBSP) |
| 180E | MONGOLIAN VOWEL SEPARATOR (MVS) |
NO-BREAK SPACE is the preferred character to use where two words are to be visually separated but kept on the same line, as in the case of a title and a name “Dr.<NBSP>Joseph Becker”. When SPACE follows NBSP, there is no break, because there never is a break in front of SPACE. NARROW NO-BREAK SPACE is used in Mongolian. The mongolian vowel separator acts like a NNBSP in its line breaking behavior. It additionally affects the shaping of certain vowel characters as described in Section 13.2, Mongolian, of [Unicode5.0].
NARROW NO-BREAK SPACE (NNBSP) is a narrow version of NO-BREAK SPACE, which except for its display width behaves exactly the same in its line breaking behavior. It is regularly used in Mongolian in certain grammatical contexts (before a particle), where it also influences the shaping of the glyphs for the particle. In Mongolian text, the NNBSP is typically displayed with 1/3 the width of a normal space character.
When NARROW NO-BREAK SPACE occurs in French text, it should be interpreted as an “espace fine insécable”.
The MONGOLIAN VOWEL SEPARATOR is equivalent to a NNBSP in its line breaking behavior, but has different effects in controlling the shaping of its preceding and following characters. It constitutes a word-internal space and is typically displayed with half the width of a NNBSP.
|
034F |
COMBINING GRAPHEME JOINER |
This character has no visible glyph and its presence indicates that adjoining characters are to be treated as a graphemic unit, therefore preventing line breaks between them. The use of grapheme joiner affects other processes, such as sorting, therefore, U+2060 WORD JOINER should be used if the intent is to merely prevent a line break.
|
2007 |
FIGURE SPACE |
This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.
|
2011 |
NON-BREAKING HYPHEN (NBHY) |
This is the preferred character to use where words need to be hyphenated but may not be broken at the hyphen. Because of this use as a substitute for ordinary hyphen, the appearance of this character should match that of U+2010 HYPHEN.
| 0F08 | TIBETAN MARK SBRUL SHAD |
|
0F0C |
TIBETAN MARK DELIMITER TSHEG BSTAR |
| 0F12 | TIBETAN MARK RGYA GRAM SHAD |
The TSHEG BstAR looks exactly like a Tibetan tsheg, but can be used to prevent a break like no-break space. It inhibits breaking on either side. For more information, see Section 5.6, Tibetan Line Breaking.
| 035C..0362 | COMBINING DOUBLE BREVE BELOW..COMBINING DOUBLE RIGHTWARDS ARROW BELOW |
These diacritics span two characters, so no word or line breaks are possible on either side.
This class includes all characters of Hangul Syllable Type LV.
Together with conjoining jamos, Hangul syllables form Korean Syllable Blocks, which are kept together; see [Boundaries]. Korean uses space-based line breaking in many styles of documents. To support these, Hangul syllables and conjoining jamo need to be tailored to use class AL. The default in this specification is class ID, which supports the case of Korean documents not using space-based line breaking. See Section 8.1, Types of Tailoring. See also JL, JT, JV, and H3.
This class includes all characters of Hangul Syllable Type LVT. See also JL, JT, JV, and H2.
|
002D |
HYPHEN-MINUS |
Some additional context analysis is required to distinguish usage of this character as a hyphen from its usage as a minus sign (or indicator of numerical range). If used as hyphen, it acts like hyphen, which has line break class BA.
Note: Some typescript conventions use runs of HYPHEN-MINUS to stand in for longer dashes or horizontal rules. If actual character code conversion is not performed and it is desired to treat them like the characters or layout elements they stand for, line breaking needs to support these runs explicitly.
Note: This class includes characters other than Han ideographs.
Characters with this property do not require other characters to provide break opportunities; lines can ordinarily break before and after and between pairs of ideographic characters. The ID line break class consists of the following characters:
|
2E80..2FFF |
CJK, KANGXI RADICALS, DESCRIPTION SYMBOLS |
|
3000 |
IDEOGRAPHIC SPACE |
|
3040..309F |
Hiragana (except small characters) |
|
30A0..30FF |
Katakana (except small characters) |
|
3400..4DB5 |
CJK UNIFIED IDEOGRAPHS EXTENSION A |
|
4E00..9FBB |
CJK UNIFIED IDEOGRAPHS |
|
F900..FAD9 |
CJK COMPATIBILITY IDEOGRAPHS |
|
A000..A48F |
YI SYLLABLES |
|
A490..A4CF |
YI RADICALS |
|
FE62..FE66 |
SMALL PLUS SIGN to SMALL EQUALS SIGN |
|
FF10..FF19 |
WIDE DIGITS |
| 20000..2A6D6 | CJK UNIFIED IDEOGRAPHS EXTENSION B |
| 2F800..2FA1D | CJK COMPATIBILITY IDEOGRAPHS SUPPLEMENT |
It also includes all of the FULLWIDTH LATIN letters and all of the blocks in the range 3000..33FF not covered elsewhere.
Note: Use U+2060 WORD JOINER as a manual override to prevent break opportunities around characters of class ID.
U+3000 IDEOGRAPHIC SPACE may be subject to expansion or compression during line justification.
Korean is encoded with conjoining jamo, Hangul syllables, or both. See also JL, JT, JV, H2, and H3. The following set of compatibility jamo is treated as ID by default.
|
3130..318F |
HANGUL COMPATIBILITY JAMO |
These characters are intended to be used in consecutive sequence. There is never a line break between two character of this class.
| 2024 | ONE DOT LEADER |
| 2025 | TWO DOT LEADER |
| 2026 | HORIZONTAL ELLIPSIS |
| FE19 | PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS |
Horizontal ellipsis can be used as a three-dot leader.
Characters that usually occur inside a numerical expression may not be separated from the numeric characters that follow, unless a space character intervenes. For example, there is no break in “100.00” or “10,000”, nor in “12:59”.
| 002C | COMMA |
| 002E | FULL STOP |
| 003A | COLON |
| 003B | SEMICOLON |
| 037E | GREEK QUESTION MARK (canonically equivalent to 003B) |
| 0589 | ARMENIAN FULL STOP |
| 060C | ARABIC COMMA |
| 060D | ARABIC DATE SEPARATOR |
| 07F8 | NKO COMMA |
| 2044 | FRACTION SLASH |
| FE10 | PRESENTATION FORM FOR VERTICAL COMMA |
| FE13 | PRESENTATION FORM FOR VERTICAL COLON |
| FE14 | PRESENTATION FORM FOR VERTICAL SEMICOLON |
When not used in a numeric context, infix separators are sentence-ending punctuation. Therefore they always prevent breaks before.
Note: Figure Space, not being a punctuation mark, has been given the line break class GL.
The JL line break class consists of all characters of Hangul Syllable Type L.
Conjoining jamos form Korean Syllable Blocks, which are kept together; see [Boundaries]. Korean uses space-based line breaking in many styles of documents. To support these, Hangul syllables and conjoining jamo need to be tailored to use class AL. The default in this specification is class ID, which supports the case of Korean documents not using space-based line breaking. See Section 8.1, Types of Tailoring. See also JT, JV, H2, and H3.
The JT line break class consists of all characters of Hangul Syllable Type T. See also JL, JV, H2, and H3.
The JV line break class consists of all characters of Hangul Syllable Type V. See also JL, JT, H2, and H3.
|
000A |
LINE FEED (LF) |
There is a mandatory break after any LF character, but see the discussion under BK.
|
0085 |
NEXT LINE (NEL) |
The NL class acts like BK in all respects (there is a mandatory break after any NEL character). It cannot be tailored, but implementations are not required to support the NEL character; see the discussion under BK.
Nonstarter characters cannot start a line, but unlike CL they may allow a break in some contexts when they follow one or more space characters. Nonstarters include
|
17D6 |
KHMER SIGN CAMNUC PII KUUH |
|
203C |
DOUBLE EXCLAMATION MARK |
| 203D | INTERROBANG |
| 2047 | DOUBLE QUESTION MARK |
| 2048 | QUESTION EXCLAMATION MARK |
| 2049 | EXCLAMATION QUESTION MARK |
|
3005 |
IDEOGRAPHIC ITERATION MARK |
|
301C |
WAVE DASH |
| 303C | MASU MARK |
| 303B | VERTICAL IDEOGRAPHIC ITERATION MARK |
|
309B.. 309E |
KATAKANA-HIRAGANA VOICED SOUND MARK..HIRAGANA VOICED ITERATION MARK |
| 30A0 | KATAKANA-HIRAGANA DOUBLE HYPHEN |
|
30FB..30FE |
KATAKANA MIDDLE DOT..KATAKANA VOICED ITERATION MARK |
| A015 | YI SYLLABLE WU (misnomer for YI SYLLABLE ITERATION MARK) |
|
FE54..FE55 |
SMALL SEMICOLON..SMALL COLON |
|
FF1A..FF1B |
FULLWIDTH COLON.. FULLWIDTH SEMICOLON |
|
FF65 |
HALFWIDTH KATAKANA MIDDLE DOT |
|
FF70 |
HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK |
| FF9E..FF9F | HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK |
plus all Hiragana, Katakana, and Halfwidth Katakana “small” characters.
Note: Optionally, the NS restriction may be relaxed and some or all characters treated like ID to achieve a more permissive style of line breaking, especially in some East Asian document styles.
For additional information about U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN, see Section 5.5, Use of Double Hyphen.
These characters behave like ordinary characters (AL) in the context of most characters but activate the prefix and postfix behavior of prefix and postfix characters.
Numeric characters consist of decimal digits (all characters of General_Category Nd), except those with East_Asian_Width F (Fullwidth), plus these characters:
| 066B | ARABIC DECIMAL SEPARATOR |
| 066C | ARABIC THOUSANDS SEPARATOR |
Unlike IS characters, the Arabic numeric punctuation does not occur as sentence terminal punctuation outside numbers.
The opening character of any set of paired punctuation should be kept with the following character. This is desirable, even when there are intervening space characters, so as to prevent the appearance of a bare opening punctuation mark at the end of a line. The OP line break class consists of all characters of General_Category Ps in the Unicode Character Database, plus
| 00A1 | INVERTED EXCLAMATION MARK |
| 00BF | INVERTED QUESTION MARK |
| 2E18 | INVERTED INTERROBANG |
Note: The first two of these characters used to be classed AI based on their East_Asian_Width assignment of A. Such characters are normally resolved to either ID or AL. However, the characters listed above are used as punctuation marks in Spanish, where they would behave more like a character of class OP.
Characters that usually follow a numerical expression may not be separated from preceding numeric characters or preceding closing characters, even if one or more space characters intervene. For example, there is no break opportunity in “(12.00) %”.
Some of these characters—in particular, degree sign and percent sign—can appear on both sides of a numeric expression. Therefore the line breaking algorithm by default does not break between PO and numbers or letters on either side.
The list of postfix characters is
|
0025 |
PERCENT SIGN |
|
00A2 |
CENT SIGN |
|
00B0 |
DEGREE SIGN |
| 060B |
AFGHANI SIGN |
| 066A | ARABIC PERCENT SIGN |
|
2030 |
PER MILLE SIGN |
|
2031 |
PER TEN THOUSAND SIGN |