UTF-7 signature

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Apr 11 2002 - 12:42:53 EDT


On 2002-apr-09, Shlomi Tal and Doug Ewell discussed on this list a UTF-7 signature byte sequence of +/v8- (which was news to me).
(Subject "MS/Unix BOM FAQ again (small fix)")

I "meditated" some over this -

+/v8 is the encoding of U+FEFF as the first code point in a text. So far, so good.
The '-' as the next byte switches UTF-7 back to direct-encoding of a subset of US-ASCII.

What if there is no '-' there? What if a non-ASCII code point immediately follows the U+FEFF?
In such a case, depending on the following code point, the first four bytes could be
   +/v8 or +/v9 or +/v+ or +/v/

The 4th byte will not be '8' if the following code point is >=U+4000.

This illustrates a property of UTF-7 that sets it further apart from most encodings than for example SCSU and BOCU-1:
In most Character Encoding Schemes, consecutive code units/points are encoded in _separate_, consecutive byte sequences.

In UTF-7, byte sequences overlap and many bytes in the encoding (2 out of 8 I think) contain pieces of two adjacent code units.
This is more like in Huffman codes.

As one conclusion, one cannot always remove the intial encoding of U+FEFF from a UTF-7 byte stream and start converting from the following byte offset. One must instead remove U+FEFF _from the output_.
This is also true for BOCU-1 because the initial U+FEFF is relevant for its state, although code points are encoded with non-overlapping byte sequences.

For SCSU and all UTFs it is equally safe to skip the signature bytes before decoding or the intial U+FEFF after decoding.
(The SCSU signature is defined to not change the intial converter state; it is one of several SCSU encodings of U+FEFF.)

For as long as we keep using the/an encoding of U+FEFF as the signature for each Unicode encoding, it is possible to remove U+FEFF from the output when a signature was detected as such.

Sorry for rambling; back to work...

markus



This archive was generated by hypermail 2.1.2 : Thu Apr 11 2002 - 11:59:42 EDT