Unicode Frequently Asked Questions

Line Breaking

Q: What is line breaking?

Computers need to have automated ways to determine where to break text into lines, so that text can automatically be wrapped into paragraphs. Note what happens if you change the width of this window in your browser. Parts of lines jump up or down into preceding and succeeding lines to keep the overall text within displayable margins. This happens as the result of an automatic process (an algorithm) that decides where lines should and should not break.

Q: How does the Unicode Standard specify line breaking for Unicode text?

Unicode Standard Annex #14, Unicode Line Breaking Algorithm specifies an algorithm for line breaking, generalized to handle all Unicode characters. A related data file provides all the character properties needed by that algorithm.

Q: Does my line breaking implementation need to match UAX#14 to conform to Unicode?

There are many different ways to break lines of text—with different tradeoffs between speed and typographical quality. Different languages also have slightly different rules for breaking lines. The Unicode Standard does not restrict or interfere with the ways in which implementations can do this. Instead, the Unicode Line Breaking Algorithm defines a generic default that can be tailored to fit specific needs. [AF]

Q: Is Unicode Line Breaking Algorithm for all scripts and languages?

The algorithm is a carefully designed default that will work well in many ordinary situations. However, more complex tasks like hyphenation are outside the scope.

Line breaking of South East Asian scripts that don't use spaces to separate words requires an add-on module that uses dictionaries.

Also, some languages have additional requirements that may require tailoring. Finally, certain typesetting styles will need specific tailoring or other adjustments, such as multi-line balancing of text, that are required to fully match specific conventions expected by the users. [AF]

Q: What are the limits to tailoring?

Some Unicode characters are encoded solely or primarily for their line breaking behavior; their interpretation must be consistent with their semantics as defined by Unicode. The subset of the rules that describes this behavior is specified as non-tailorable. For more info see Section 4 of UAX #14. [AF]