Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 13:56:24 CST

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
In reply to: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/19 19:18, Oliver Christ at oli@trados.com wrote:

>
>> UTF-8 BOM's seem pointless.
>
> On the very contrary. It's most helpful to determine a text file's
> encoding. Without the UTF8 BOM it's hard to tell whether a file is
> encoded in some ISO or whatever encoding/codepage or is already UTF8.
> I'm grateful every day that .Net by default prefixes UTF8-encoded text
> files with a UTF8 BOM, and IMO the UTF8 BOM should be part of the
> standard or at least be generally applied best practice. It simplifies
> at least part of the problem if you have to deal with thousands of files
> (or char strings [such as file names ;-) ], for that matter) of which
> you don't know the encoding.
>
> I agree that "byte order" is misleading in the case of UTF8 but in
> practice it's a blessing.

The problem is that platforms such as UNIX use different methods to
determine file encodings that file contents, and there are other problems
with it, see <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

The use of a BOM you indicate is not any longer a character encoding, but a
file format. Local platforms might use it, but it should not be a part of an
encoding format such as UTF-8. I once suggested that Unicode include special
escape characters, just so that file encodings could be indicated. but that
suggested was turned down as contrary to the Unicode spirt. Then one should
not put in such escape characters into the Unicode encodings, in even more
special version, causing problems in various circumstances.

If BOM's should be admitted, one should probably add to Unicode special file
encodings characters. Then these should be more general, and specifically
only used for special file formats, but mot a part of the encodings
themselves. Then developers of file formats could use them at need. And
those merely concerned with 8-bit byte applications to Unicode via UTF-8
need not worry about these escape sequences.

Hans Aberg

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
In reply to: Oliver Christ: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:58:16 CST