Re: UTF-8 in email

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Fri Oct 16 1998 - 09:58:36 EDT


Murray Sargent wrote on 1998-10-16 00:23 UTC:
> Donald Page wrote:
> > The above attachment should contain all of the Minimum European Subset
> > encoded as UTF-8. I created it for my own testing, but feel free to use
> > it.
> Donald's UTF-8 file should begin with a UTF-8 BOM in order to identify it as
> a UTF-8 encoded file. The starting bytes should be 0xEF 0xBB 0xBF.

No. The MIME attachment should just contain the header line

  Content-Type: text/plain; charset=UTF-8

as specified in RFC 2044, and then the receiving email client should
know how to activate the UTF-8 decoder and how to select an appropriate
font. Most developers of email clients still have to add a bit here to
get this running as it is supposed to work.

I do not like BOMs. The whole beauty of UTF-8 is that it is stateless,
and introducing Byte-Order-Marker-Hacks destroys this. What happens to
BOMs in a cut&paste context? It just creates a mess.

If you want to switch properly between different encodings, then use
established complete mechanisms like the MIME charset identifier or the
ISO 2022 ESC sequences. BOMs are just an ugly hack.

> These bytes are discarded when reading the file in and added when
> writing the file out.

I am not sure what exactly you mean, but I hope it is the following: If
you are working on an unfortunate platform that requires BOMs in all
UTF-8 files, then the email software on that platform should prefix the
BOM to a file whenever a MIME text/plain UTF-8 body part is saved into a
file. If a file starting with a BOM is attached to an email as a text/
plain file, then the BOM should be stripped of and the MIME
charset=UTF-8 header should be added.

Markus

-- 
Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
email: mkuhn at acm.org,  home page: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT