On 2002-apr-09, Shlomi Tal and Doug Ewell discussed on this list a UTF-7 signature byte sequence of +/v8- (which was news to me).
(Subject "MS/Unix BOM FAQ again (small fix)")
I "meditated" some over this -
+/v8 is the encoding of U+FEFF as the first code point in a text. So far, so good.
The '-' as the next byte switches UTF-7 back to direct-encoding of a subset of US-ASCII.
What if there is no '-' there? What if a non-ASCII code point immediately follows the U+FEFF?
In such a case, depending on the following code point, the first four bytes could be
+/v8 or +/v9 or +/v+ or +/v/
The 4th byte will not be '8' if the following code point is >=U+4000.
This illustrates a property of UTF-7 that sets it further apart from most encodings than for example SCSU and BOCU-1:
In most Character Encoding Schemes, consecutive code units/points are encoded in _separate_, consecutive byte sequences.
In UTF-7, byte sequences overlap and many bytes in the encoding (2 out of 8 I think) contain pieces of two adjacent code units.
This is more like in Huffman codes.
As one conclusion, one cannot always remove the intial encoding of U+FEFF from a UTF-7 byte stream and start converting from the following byte offset. One must instead remove U+FEFF _from the output_.
This is also true for BOCU-1 because the initial U+FEFF is relevant for its state, although code points are encoded with non-overlapping byte sequences.
For SCSU and all UTFs it is equally safe to skip the signature bytes before decoding or the intial U+FEFF after decoding.
(The SCSU signature is defined to not change the intial converter state; it is one of several SCSU encodings of U+FEFF.)
For as long as we keep using the/an encoding of U+FEFF as the signature for each Unicode encoding, it is possible to remove U+FEFF from the output when a signature was detected as such.
Sorry for rambling; back to work...
markus
This archive was generated by hypermail 2.1.2 : Thu Apr 11 2002 - 11:59:42 EDT