RE: UTF-8 in email

From: Ienup Sung (ienup.sung@eng.sun.com)
Date: Fri Oct 16 1998 - 23:09:19 EDT


I just would like to point out that it is not normative but informative,
right? Also, BOM can be used as "zero width no break space" character.

I believe application shouldn't require BOM in UTF-8 file and also shouldn't
put such heuristics in your application and here's some reasons:

- EF BB BF (or in the other ordering form) doesn't mean that partucular file
  is a UTF-8 file since in Asian codesets (also in many single byte codesets),
  EFBB and BFxx can be a pair of valid multibyte characters (or three single
  byte characters). Therefore, if you hard code such a very limited scope of
  heuristics in your application without any override mechanism, your
  applications are not going to be able to support many other codesets.
  (You have to put codeset selection mechanism somehow and one way or
  another anyway.)
- Also, unless you know somehow this misterious file is any one kind of
  Unicode files, it wouldn't make much sense to have such additional
  heuristics in your application. (Q: How many times you will not know whether
  this Unicode file is UTF-16 or UTF-8 or UCS-4?? Isn't it more easier, if
  possible, just ask to the sender what is it really? Or just try to open
  it three times with each one of them?)
- Since it can be a "zero width no break space" character, you also need to
  give some kind of choice whether end-users want to use it as zero width
  no break space character or to indicate whether that particular character is
  to indicate this is UTF-8 file and ignore it (or both??).

Just wanted to share above with you since, to me, having the BOM in UTF-8
file seems not such a good idea all the time but rather problematic and/or
not attractive.

With regards,

Ienup



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT