UTF-7 signature

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Apr 11 2002 - 12:42:53 EDT

Previous message: jarkko.hietaniemi@nokia.com: "RE: MS/Unix BOM FAQ again (small fix)"
Next in thread: Shlomi Tal: "Re: UTF-7 signature"
Reply: Shlomi Tal: "Re: UTF-7 signature"
Reply: Doug Ewell: "Re: UTF-7 signature"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2002-apr-09, Shlomi Tal and Doug Ewell discussed on this list a UTF-7 signature byte sequence of +/v8- (which was news to me).
(Subject "MS/Unix BOM FAQ again (small fix)")

I "meditated" some over this -

+/v8 is the encoding of U+FEFF as the first code point in a text. So far, so good.
The '-' as the next byte switches UTF-7 back to direct-encoding of a subset of US-ASCII.

What if there is no '-' there? What if a non-ASCII code point immediately follows the U+FEFF?
In such a case, depending on the following code point, the first four bytes could be
+/v8 or +/v9 or +/v+ or +/v/

The 4th byte will not be '8' if the following code point is >=U+4000.

This illustrates a property of UTF-7 that sets it further apart from most encodings than for example SCSU and BOCU-1:
In most Character Encoding Schemes, consecutive code units/points are encoded in _separate_, consecutive byte sequences.

In UTF-7, byte sequences overlap and many bytes in the encoding (2 out of 8 I think) contain pieces of two adjacent code units.
This is more like in Huffman codes.

As one conclusion, one cannot always remove the intial encoding of U+FEFF from a UTF-7 byte stream and start converting from the following byte offset. One must instead remove U+FEFF _from the output_.
This is also true for BOCU-1 because the initial U+FEFF is relevant for its state, although code points are encoded with non-overlapping byte sequences.

For SCSU and all UTFs it is equally safe to skip the signature bytes before decoding or the intial U+FEFF after decoding.
(The SCSU signature is defined to not change the intial converter state; it is one of several SCSU encodings of U+FEFF.)

For as long as we keep using the/an encoding of U+FEFF as the signature for each Unicode encoding, it is possible to remove U+FEFF from the output when a signature was detected as such.

Sorry for rambling; back to work...

markus

Previous message: jarkko.hietaniemi@nokia.com: "RE: MS/Unix BOM FAQ again (small fix)"
Next in thread: Shlomi Tal: "Re: UTF-7 signature"
Reply: Shlomi Tal: "Re: UTF-7 signature"
Reply: Doug Ewell: "Re: UTF-7 signature"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Apr 11 2002 - 11:59:42 EDT