CRLF vs. LF (was Re: Unicode and end users)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Feb 21 2002 - 10:47:00 EST


David Hopwood wrote:
> Lars Kristan wrote:
> > Doug Ewell wrote:
> > > fine (as are LF->CRLF, stripped BOM's, and maybe even
> some edge cases
> > > like converting between tabs and spaces). If there are any
> > > security or spoofing concerns, it's best to leave
> everything completely
> > > untouched.
> >
> > I see this as a good reason for NOT using BOM in UTF-8
> files. CRLF is a
> > major nuisance that many Windows programmers need to deal
> with. It requires
> > text vs. binary mode when opening the files, plus size of
> the file does not
> > match the number of characters written or read. UNIX
> programs usually don't
> > need to bother with all that.
>
> Text files in a known charset should always be opened in binary mode
> (that is, what the C stdio API refers to as binary mode). The
> sets of valid
> character sequences that must be accepted or generated for newline are
> defined by the file format, *not* by the platform. When
> designing a new
> file format, see UAX#13.

What you say makes sense, but is unfortunately not true in praxis (I wish it
was). If we were discussing a *new* file format, it would be easy. Text
files however are an existing file format, or rather (as you suggest)
several file formats, each using a different standard for newlines.
If I switch to binary mode as you suggest, I will get two things:
A - When writing, no CR characters will be written (unless read from a
file). Many programs (like notepad) will not display such files correctly.
It is a good question whether this is my problem or notepad's.
B - When reading, I will get CR characters which are not handled anywhere in
my current code. Of course this only happens when reading files that were
written 'the old way'.

Well, I do want to get rid of CR's, and at some point I would need to take
care of reading. I wish run-time library would help me there. As for output,
well, as long as notepad can't handle it, my hands are tied. Users would
kill me (although it is not really me that they should 'kill';).

Now UAX#13. I checked it (document date 2001-03-23) and found:
<quote>
4.3 Converting to other character code sets
1. If you know the intended target, map NLF, LS, and PS appropriately,
depending on the target conventions. For example, when mapping to Microsoft
Word's internal conventions for Windows documents you would map LS to VT,
and PS and any NLF to CRLF.
2. If you don't know the intended target, map NLF, LS, and PS to the
platform newline convention (CR, LF, CRLF, or NEL). In Java, for example,
this is done by mapping to a string nlf, defined as:
String nlf = System.getProperties("line.separator");
<end quote>

Eh, now I see "platform convention". So which is it, file format or platform
convention?

My humble wish to get rid of CRLF delimited files on DOS (well, Windows) is
really not a unicode problem. Although, with all the expertise, unicoders
should be able to make a good suggestion about it. I began by stating that
CRLF vs. LF is a mess (even after decades, programs obviously still have
problems with that). And BOM's in UTF-8 files remind me of that. Eventually
they will be just noise, like the CR's in CRLF. Windows will have it and
UNIX will hate it.

Lars Kristan



This archive was generated by hypermail 2.1.2 : Thu Feb 21 2002 - 10:28:19 EST