L2/01-311

From: Misha.Wolf@reuters.com
Sent: Wednesday, August 08, 2001 1:24 PM

W3C concerns about UTF-8 on agenda of next week's UTC

Harald, Patrik, Paul,

I have an action to draft some text to send to you guys on behalf of the
W3C I18N WG regarding UTF-8.  Unfortunately, I've run out of time, as
the UTC will be discussing this issue at its meeting next week and we'd
like to draw your attention to these issues before that meeting takes
place, in the hope of gaining your support for our proposal.
Consequently, this mail is from me, writing as W3C I18N WG Chair, rather
than from the WG itself.

The following text from the W3C I18N WG is on the agenda of next week's
UTC meeting:

   The W3C I18N WG applauds the restrictions imposed, for security
   reasons, in TUS 3.1, on the interpretation of UTF-8 non-shortest form
   BMP characters.

   We urge the Unicode Consortium to impose the same restrictions, for
   the same reasons, on UTF-8 non-shortest form characters outside the
   BMP.  In other words, "irregular code unit sequences" in UTF-8 should
   become "illegal code unit sequences".

   Owing to the inclusion, in TUS 3.1, of many characters outside of the
   BMP, this has become very topical.  Any ambiguity in the
   interpretation of UTF-8 has the potential to allow serious security
   breaches.

Subsequently, the W3C I18N WG decided as follows:

   AGREED: The use of different definitions of UTF-8 by different groups
   working in the context of the Web/Internet is a serious problem.

   ACTION: Misha to draft a mail to Paul Hoffman, Patrik Fältström,
   Harald Alvestrand about the problem of different definitions of
   UTF-8.

Some background follows.

RFC 2279 (UTF-8, a transformation format of ISO 10646) provides an
informative definition of UTF-8, which excludes all non-standard forms.
It explicitly warns against such forms in:

6.  Security Considerations

   Implementors of UTF-8 need to consider the security aspects of how
   they handle illegal UTF-8 sequences.  It is conceivable that in some
   circumstances an attacker would be able to exploit an incautious
   UTF-8 parser by sending it an octet sequence that is not permitted by
   the UTF-8 syntax.

   [...]

For the normative definition of UTF-8, the RFC relies on:

   [ISO-10646]    ISO/IEC 10646-1:1993. International Standard --
                  Information technology -- Universal Multiple-Octet
                  Coded Character Set (UCS) -- Part 1: Architecture and
                  Basic Multilingual Plane.  Five amendments and a
                  technical corrigendum have been published up to now.
                  UTF-8 is described in Annex R, published as Amendment
                  2.  UTF-16 is described in Annex Q, published as
                  Amendment 1. 17 other amendments are currently at
                  various stages of standardization.

As ISO/IEC 10646-1:1993 has been replaced by the year 2000 version, it
would now be almost impossible for a developer to lay his/her hands on
Amendment 2 to ISO/IEC 10646-1:1993.  Consequently, developers are most
probably relying on the Unicode Standard for the definition of UTF-8.

For some years there were two differences between the IETF position
on UTF-8 and the Unicode position.  Both relate to non-standard forms of
UTF-8 and both have serious security implications.  They are:

1.  the use of non-shortest forms for characters within the BMP,

2.  the use of non-shortest forms for characters outside the BMP.

Unicode Standard 3.1 has, I'm very glad to say, banned case 1 above.  It
still, however, permits the processing (though not the production) of
case 2.  The Unicode Standard refers to these as "irregular code unit
sequences".

As you will see in the first quote from the W3C I18N WG, we are asking
of the UTC that:

   "irregular code unit sequences" in UTF-8 should become "illegal
   code unit sequences".

Your support for this change at the UTC (in person or otherwise) would
be very much appreciated.

Thanks,
Misha Wolf
W3C I18N WG Chair


-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.