Over-long Control Characters in UTF-8

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Sun Aug 01 1999 - 06:33:00 EDT


If I write a UTF-8 decoder for a terminal emulator, shall I accept and
execute control characters even if they are part of an UTF-8 sequence
that is longer than necessary?

The reason for the question is that a Unix plain-text editor normally
looks for the LF byte 0x0a to find out where a line ends, but LF =
U+000A = 0x0a = 0xc0 0x8a = 0xe0 0x80 0x8a = 0xf0 0x80 0x80 0x8a = ...
can be encoded in many ways legally under UTF-8, which would technically
force me to run everything trough an UTF-8 decoder before I look at its
value, which was not was UTF-8 was intended for (minimum amount or
required modifications on existing 8-bit software).

In general, are there any requirements in the Unicode and ISO 10646-1
standard with regard to handling UTF-8 sequences that are longer than
necessary? I haven't found any.

ISO 10646-1 Annex R.7 <http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html>
does define the concept of a "malformed UTF-8 sequence", and XFree86
xterm does at the moment convert every received malformed UTF-8 sequence
into U+FFFD. Unfortunately, ISO 10646-1/Am.1 did not specify that an
UTF-8 sequence for which there exists also a shorter alternative is a
malformed UTF-8 sequence as well. This would have simplified a number of
things greatly. For instance, a POSIX system can hardly be required to
accept a 3-byte encoding of "/" as a directory name separator, because
the OS kernel has hardwired this ASCII byte value, and we want to keep
the kernel locale independent for many very good reasons. Allowing the
'\0' and '/' to remain what they are was after all what FSS-UTF was all
about originally, but it has never been thought through to the end
such that '/' is forbidden to be encoded by anything but 0x2F in UTF-8.

The fact that Java abuses the 2-byte encoding of the U+0000 (0xc0 0x80)
to get C string binary transparency for NUL has effectively established
the practice of using overlong UTF-8 sequences as a hack. :-(

RFC 2279 <ftp://ftp.funet.fi/mirrors/nic.nordu.net/rfc/rfc2279.txt>
says:

6. Security Considerations

   Implementors of UTF-8 need to consider the security aspects of how
   they handle illegal UTF-8 sequences. It is conceivable that in some
   circumstances an attacker would be able to exploit an incautious
   UTF-8 parser by sending it an octet sequence that is not permitted by
   the UTF-8 syntax.

   A particularly subtle form of this attack could be carried out
   against a parser which performs security-critical validity checks
   against the UTF-8 encoded form of its input, but interprets certain
   illegal octet sequences as characters. For example, a parser might
   prohibit the NUL character when encoded as the single-octet sequence
   00, but allow the illegal two-octet sequence C0 80 and interpret it
   as a NUL character. Another example might be a parser which
   prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
   illegal octet sequence 2F C0 AE 2E 2F.

All this leads me to the conclusion that it is probably a good idea to
extend Annex R.7 in ISO 10646-1:2000 to also declare over-long UTF-8
sequences as malformed and to require UTF-8 decoders to treat them like
other malformed sequences, e.g. signal them as a transmission error,
substitute U+FFFD for them, but under no circumstances treat them as the
corresponding UCS value. The Java NUL handling practice can be kept
legal by adding some of the usual blurbs ("except when prior agreements
between sending and receiving parties have specified something else
blabla").

Unicode 2.0 says on page A-8:

  When converting from UTF-8 to Unicode values, however, implementations
  do not need to chack that the shortest encoding is being used, which
  simplifies the conversion algorithm,

I think, this is a big mistake. Adding a check for whether the unique
shortest encoding has been used is trivial. Just check, whether a UTF-8
sequence starts with any of the following illegal byte combinations:

  11000000
  11100000 100xxxxx
  11110000 1000xxxx
  11111000 10000xxx
  11111100 100000xx

I think the "simplification" of the decoding algorithm given as a
rationale in the Unicode standard is somewhat naive, because the actual
simplification is really negligible, while the complications caused by
allowing over-long UTF-8 sequences can be very considerable for other
parts of the code and can become a serious problem especially for people
who want to add UTF-8 support in a robust and secure way to existing
8-bit applications. The problem is particularly severe with control
characters such as LF, but also some other ASCII characters that have a
special meaning for the processing application (e.g., '/').

Conforming UTF-8 decoders should in general not be allowed to decode
over-long UTF-8 sequences into a value that could be represented
shorter. Could we formally add this to ISO 10646-1:2000 and Unicode 3.0
please? Please!

If this is not possible for political reasons and backwards
compatibility with the previous spec, then let's please at least add the
new concept of a "Secure UTF-8 decoder", which interprets over-long
UTF-8 sequences like malformed sequences, and which guarantees that for
every sequence of UCS-charactesr (that do not contain the REPLACEMENT
CHARACTER), there exists exactly one UTF-8 byte sequence that decodes to
it.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT