I just would like to point out that it is not normative but informative,
right? Also, BOM can be used as "zero width no break space" character.
I believe application shouldn't require BOM in UTF-8 file and also shouldn't
put such heuristics in your application and here's some reasons:
- EF BB BF (or in the other ordering form) doesn't mean that partucular file
is a UTF-8 file since in Asian codesets (also in many single byte codesets),
EFBB and BFxx can be a pair of valid multibyte characters (or three single
byte characters). Therefore, if you hard code such a very limited scope of
heuristics in your application without any override mechanism, your
applications are not going to be able to support many other codesets.
(You have to put codeset selection mechanism somehow and one way or
another anyway.)
- Also, unless you know somehow this misterious file is any one kind of
Unicode files, it wouldn't make much sense to have such additional
heuristics in your application. (Q: How many times you will not know whether
this Unicode file is UTF-16 or UTF-8 or UCS-4?? Isn't it more easier, if
possible, just ask to the sender what is it really? Or just try to open
it three times with each one of them?)
- Since it can be a "zero width no break space" character, you also need to
give some kind of choice whether end-users want to use it as zero width
no break space character or to indicate whether that particular character is
to indicate this is UTF-8 file and ignore it (or both??).
Just wanted to share above with you since, to me, having the BOM in UTF-8
file seems not such a good idea all the time but rather problematic and/or
not attractive.
With regards,
Ienup
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT