Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 13:56:24 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/19 19:18, Oliver Christ at oli@trados.com wrote:

    >
    >> UTF-8 BOM's seem pointless.
    >
    > On the very contrary. It's most helpful to determine a text file's
    > encoding. Without the UTF8 BOM it's hard to tell whether a file is
    > encoded in some ISO or whatever encoding/codepage or is already UTF8.
    > I'm grateful every day that .Net by default prefixes UTF8-encoded text
    > files with a UTF8 BOM, and IMO the UTF8 BOM should be part of the
    > standard or at least be generally applied best practice. It simplifies
    > at least part of the problem if you have to deal with thousands of files
    > (or char strings [such as file names ;-) ], for that matter) of which
    > you don't know the encoding.
    >
    > I agree that "byte order" is misleading in the case of UTF8 but in
    > practice it's a blessing.

    The problem is that platforms such as UNIX use different methods to
    determine file encodings that file contents, and there are other problems
    with it, see <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

    The use of a BOM you indicate is not any longer a character encoding, but a
    file format. Local platforms might use it, but it should not be a part of an
    encoding format such as UTF-8. I once suggested that Unicode include special
    escape characters, just so that file encodings could be indicated. but that
    suggested was turned down as contrary to the Unicode spirt. Then one should
    not put in such escape characters into the Unicode encodings, in even more
    special version, causing problems in various circumstances.

    If BOM's should be admitted, one should probably add to Unicode special file
    encodings characters. Then these should be more general, and specifically
    only used for special file formats, but mot a part of the encodings
    themselves. Then developers of file formats could use them at need. And
    those merely concerned with 8-bit byte applications to Unicode via UTF-8
    need not worry about these escape sequences.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:58:16 CST