Re: PRODUCING and DESCRIBING UTF-8 with and without BOM

From: Doug Ewell (
Date: Tue Nov 05 2002 - 00:30:57 EST

  • Next message: Doug Ewell: "Re: In defense of Plane 14 language tags"

    Joseph Boyle <Boyle at siebel dot com> wrote:

    > Newline problems are a good analogy. They still require bookkeeping of
    > different formats and attention in any new coding and cause new bugs,
    > even though the problem has been around for decades. Nobody is holding
    > their breath for any of the platforms to change their newline
    > convention to match the others or even update all their tools to deal
    > with the differences - bare LF still doesn't work in Notepad.

    Of the hundreds of little utility programs I've written over the past 10
    years or so, one of the ones I still use most often is FIXCRLF, which
    (as you might expect) converts files between different CR/LF
    conventions. I have to; most text files downloaded from the Internet
    are LF, but most DOS/Windows tools demand CRLF. It's a shame, but
    hardly a surprise, that the industry could never standardize on one or
    the other.

    The invention of U+2028 LINE SEPARATOR was supposed to relieve us of all
    this misery -- but ironically, the success of UTF-8 has probably killed
    LS for good. Not only do people now expect Unicode text files to be
    backward-compatible with ASCII, which favors CR and/or LF instead of LS,
    but the single character LS requires more bytes in UTF-8 than the two
    characters CR and LF.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Tue Nov 05 2002 - 01:11:53 EST