SummaryThere are some security and interoperability issues with PEP 383 (http://python.org/dev/peps/pep-0383/), as outlined below.
While it appears that the security issues may not be problematic in Python, if used as intended, we should present these issues publicly in some way so that people using similar systems are made aware of the problems.
ProblemThere is a known problem with file systems that use a legacy charset. When you use a Unicode API to find the files in a directory, you typically get back a list of Unicode file names. You use those names to access the files through some other API. There are two possible problems:
A possible solution to this is to enable all charset converters to losslessly (reversibly) convert to Unicode. That is, any sequence of bytes can be converted by each charset converter to a Unicode string, and that Unicode string will be converted back to exactly that original sequence of bytes by that converter. This precludes, for example, the charset converter's mapping two different unmappable byte sequences to
PEP 383 ApproachThe basic idea of PEP 383 is to be able to do this by converting all "unmappable" sequences to a sequence of one or more isolated high surrogate code points: that is, each code point's value is 0xD800 plus the corresponding unmappable byte value. With this mechanism, every maximal subsequence of bytes that can be reversibly mapped to Unicode by the charset converter is so mapped; any intervening subsequences are converted to a sequence of high surrogates. The result is a Unicode String, but is not a well-formed UTF sequence.
For example, suppose that the byte 81 is illegal in charset n. When converted to Unicode, PEP 383 represents this as U+D881. When mapped back to bytes (for charset n), then that turns back into the byte 81. This allows the source byte sequence to be reversibly represented in a Unicode String, no matter what the contents. If this mechanism is applied to a charset converter that has no fallbacks from bytes to Unicode, then the charset converter becomes reversible (from bytes to Unicode to bytes).
Note that this only works when the Unicode string is converted back with the very same charset converter that was used to convert from bytes. For more information on PEP 383, see (http://python.org/dev/peps/pep-0383/).
SecurityUnicode implementations have been subject to a number of security exploits (such as http://blogs.technet.com/srd/archive/2009/05/18/more-information-about-the-iis-authentication-bypass.aspx) centered around ill-formed encoding. Systems making incorrect use of a PEP 383 style mechanism are subject to such an attack.
Suppose the source byte stream is <A B X D>, and that according to the charset converter being used (n), X is an invalid byte. B2Un transforms the byte stream into Unicode as <G Y H>, where Y is an isolated surrogate. U2Bn maps back to the correct original <A B X D>. That is the intended usage of PEP 383.
The problem comes when that Unicode sequence is converted back to bytes by a different charset converter m. Suppose that U2Bm maps Y into a valid byte representing "/", or any one of a number of other security-sensitive characters. That means that converting <G Y H> via U2Bm to bytes, and back to Unicode results in the string "G/Y", where the "/" did not exist in the original.
This violates one of the cardinal security rules for transformations of Unicode strings: creating a character where no valid character previously existed. This was, for example, at the heart of the "non-shortest form" security exploits. A gatekeeper is watching for suspicious characters. It doesn't see Y as one of them, but past the gatekeeper, a conversion of U2Bm followed by B2Um results in a suspicious character where none previously existed.
The suggested solution for this is that a converter can only map an isolated surrogate Y onto a byte stream when the resulting byte would be an illegal byte. If not, then an exception would be thrown, or a replacement byte or byte sequence must be used instead (such as the SUB character). For details, see Safely Converting to Bytes, below. This replacement would be similar to what is used when trying to convert a Unicode character that cannot be represented in the target encoding. That preserves the ability to round-trip when the same encoding is used, but prevents security attacks. Note that simply not represented (deleting) Y in the output is not an option, since that is also open to security exploits.
It appears that PEP 383 when used as intended in Python is unlikely to present security problems. According to information from the author:
InteroperabilityThe choice of isolated surrogates (D8xx) as the way to represent the unconvertible bytes appears clever at first glance. However, it presents certain interoperability and security issue. Such isolated surrogates are not well formed. Although they can be represented in Unicode Strings, they are not supported by conformant UTF-8, UTF-16, or UTF-32 converters or implementations. This may cause interoperability problems, since many systems replace incoming ill-formed Unicode sequences by replacement characters.
It may also cause security problems. Although strongly discouraged for security reasons, some implementations may delete the isolated surrogates, which can cause a security problem when two substrings become adjacent that were previously separated.
There are different alternatives:
Safely Converting to BytesThe following describes how to safely convert a Unicode buffer U1 to a byte buffer B1 when the D8xx convention is used. It assumes that an exception is thrown if a D8xx is problematic. It can be enhanced to use substitution characters instead, if needed.
Brute ForceThe simplest mechanism is just by brute force:
Optimized ApproachesThere are a number of different ways to optimize this. Such approaches may vary depending on whether the converter is stateless or stateful.
StatefulThis is basically the same as Stateless, except for the conversion of B2 to U3, which has the following changes
More Optimized StatelessIn building the B2Un conversion table generate the following data and store:
Safe: xx is never part of any valid character in charset n.
Unsafe: xx is always part of some valid character in charset n.
Mixed: anything else; that is, safe or unsafe depending on context.
All single-byte charsets would only have safe or unsafe bytes, so they are easy. The only ones that require more work are the mixed bytes, which only occur in multi-byte charsets. Let's take SJIS. In the sequence <D881 0030> it is safe to map the D881 back to 81, because <81 30> would map to <D881 0030>. But in the sequence <D881 0040> it is not, because <81 40> would map back to <3000>.
Once this is done, the Stateless algorithm is modified to just map the Safe D8xx's, throw an exception on the Unsafe D8xx's, and use the plain Stateful process on any sequence of Mixed D8xx's.
More Optimized Stateful