Re: MS/Unix BOM FAQ again (small fix)

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Apr 10 2002 - 13:44:07 EDT


The reason for ICU's "UTF-16" converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates.
This converter name is (currently) only a convenience alias for "use the UTF-16 byte serialization that is normally used on this machine".

As this discussion shows, whether initial FF FE or FE FF are interpreted as BOM/signature or ZWNBSP or U+FFFE depends on the protocol and on what other information is available.

If a BOM can be expected, then the application should inspect the first few bytes with something like ICU's ucnv_detectUnicodeSignature().
This function in turn will provide a string "UTF-16BE" or "UTF-16LE" or "SCSU" or "UTF-8" or... and tell how many bytes to skip for the signature.
Then the application can instantiate a converter - not just one of the UTF-16*E but possibly a different one.

This has been consensus for a while.
The implementation could be changed if the consensus in the ICU team changes.

markus



This archive was generated by hypermail 2.1.2 : Wed Apr 10 2002 - 12:09:17 EDT