Line Separator and Paragraph Separator

From: Jill Ramonsky (Jill.Ramonsky@aculab.com)
Date: Mon Oct 20 2003 - 08:37:50 CST


Are the LS and PS characters actually used in real plain-text documents?

I ask because plain text documents are created by text editors. The text
editor I happen to use is TextPad (there are hundreds of others, and
everyone has their favorite). It can save in UTF-8, and so on. But it
always saves documents with CRLF separating the lines. (It's a Windows
system).

Going a little deeper, applications like this are often written in C or
C++. These languages have the convention that "\n" in a string literal
means "new line". Strictly speaking, BY DEFINITION (from the C and C++
specs), "\n" is supposed to mean LF, and nothing else, but programs
compiled on Windows will reinterpret "\n" in a string literal to mean
either LF only (when in memory) or CRLF (when encoded to or from a file
or stream opened in text mode). Yes, it's a kludge, but it obviously
works quite well. I suspect (but I don't know for sure) that the Mac
will interpret "\n" as CR only.

It would seem impossible (or at least, a violation of the C/C++ specs)
to reinterpret "\n" as LS in C/C++ ... but then again, that
specification has already been violated, so maybe the precedent is there
and that no longer matters.

Nonetheless, it would seem, at least /slightly/ sensible to me that text
files encoded as UTF-8 should be using LS instead of CRLF. But this
appears to be difficult to achieve. There is no C/C++ escape sequence
which is defined to mean LS (unless you're prepared to write
"\xE2\x80\xA2" instead of "\n" all over the place), and what "\n"
generates is platform-dependent.

We can't change C or C++, of course, but would it make sense for other
computer languages, in particular future computer languages, either to
redefine "\n" to mean LS (if the encoding is capable of representing
it)* or to introduce a new escape sequence, ("\l"?) to mean LS? (Of
course, if we introduced "\l" for LS, we could also introduce "\p" for PS).

Thoughts anyone?

Jill

*FOOTNOTE - this is actually quite difficult to achieve if you're
storing stuff internally as bytes. Windows knows whether or not to
convert LF->CRLF and vice versa by means of a parameter passed to
fopen(), but this parameter can only distinguish between "text" and
"binary", not between "latin-1 text" and "utf-8 text". Things get easier
if you stor chars internally as Unicode chars of course.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST