|Authors||Mark Davis (firstname.lastname@example.org)|
This document describes guidelines for how to handle different characters used to represent CRLF and other representations of new lines on different platforms.
This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. It is a stable document and may be used as reference material or cited as a normative reference from another document.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, carrying the same version number, but is published as a separate document. Note that conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).
Newlines are represented on different platforms by carriage return (CR), line feed (LF), CRLF, or next line (NEL). Unfortunately, not only are newlines represented by different characters on different platforms, they also have ambiguous behavior even on the same platform. Especially with the advent of the web, where text on a single machine can arise from many sources, this causes a significant problem.
Unfortunately, these characters are often transcoded directly into the corresponding Unicode codes when a character set is transcoded; this means that even programs handling pure Unicode have to deal with the problems. For information on handling newlines in regular expressions, see UTR #18: Unicode Regular Expression Guidelines [RegExp].
The following table provides hexadecimal values for the acronyms used in the text. The Unicode Standard does not formally assign control characters, instead it provides the 65 code values for use as in the 7 and 8-bit standards. See The Unicode Standard, Version 2.0, Section 2.6 Controls and Control Sequences.
|Hex Values for Acronyms|
For clarity, when referring to the function that a particular character has, we will use lowercase (e.g., paragraph separator); when referring to the specific characters that represent those functions, we will use titlecase or an acronym (e.g., Paragraph Separator or PS).]
The term NLF (new line function) stands for different characters depending on the platform; that is, any of CR, LF, CRLF, or NEL.
A paragraph separator is used to indicate a separation between paragraphs, while a line separator indicates where a line break alone should occur, typically within a paragraph. For example:
This is a paragraph with a line separator at this point,
causing the word "causing" to appear on a different line, but not causing the typical paragraph indentation, sentence-breaking, line spacing, or change in flush (right, center or left paragraphs).
For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to older usage of HTML <P> (modern HTML delimits paragraphs by enclosing them in <P>...</P>). In word processors, paragraph separators are usually entered using a keyboard RETURN or ENTER; line separators are usually entered using a modified RETURN or ENTER, such as SHIFT-ENTER.
A record separator is used to separate records. For example, when exchanging tabular data, a common format is to tab-separate the cells, and use a CRLF at the end of a line of cells. This function is not precisely the same as line separation, but the same characters are often used.
Traditionally, NLF started out as a line separator (and sometimes record separator). It is still used as a line separator in simple text editors such as program editors. As platforms and programs started to handle word processing with automatic line-wrap, these characters were reinterpreted to stand for paragraph separators. For example, even such simple programs as the Windows Notepad program or the Mac SimpleText program interpret their platform's NLF as a paragraph separator, not a line separator.
Once NLF was reinterpreted to stand for a paragraph separator, in some cases some other control character was impressed into service as a line separator. For example, vertical tabulation VT is used in Microsoft Word. However, the choice of character for line separator is even less standardized than the choice of character for NLF.
Yet, many internet protocols and a lot of existing text treats NLF as a line separator, so you can't just simply treat NLF as a paragraph separator in all circumstances.
The Unicode Standard defines two unambiguous separator characters, Paragraph Separator (PS = 202916) and Line Separator (LS = 202816). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous. Otherwise, the following specifies how to cope with an NLF when converting from other character sets to Unicode, when interpreting characters in text, and when converting from Unicode to other character sets.
Note: Even if you know which characters represents NLF on your particular platform, on input and in interpretation, treat CR, LF, CRLF, and NEL the same. Only on output do you need to distinguish between them.
FF is commonly used as a page separator, and it should be interpreted that way in text. When displaying on the screen, it causes the text after the separator to be forced to the next page. It should be independent of paragraph separation: a paragraph can start on one page and continue on the next page. Except when displaying on pages, in most parsing and in readline it is interpreted in the same way as a LS.
|[RegExp]||Unicode Technical Report #18: Unicode Regular Expression
UTR #18: Unicode Regular Expression Guidelines
The following summarizes modifications from the previous version of this document.
Copyright © 1998-2001 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.