L2/01-311 From: Misha.Wolf@reuters.com Sent: Wednesday, August 08, 2001 1:24 PM W3C concerns about UTF-8 on agenda of next week's UTC Harald, Patrik, Paul, I have an action to draft some text to send to you guys on behalf of the W3C I18N WG regarding UTF-8. Unfortunately, I've run out of time, as the UTC will be discussing this issue at its meeting next week and we'd like to draw your attention to these issues before that meeting takes place, in the hope of gaining your support for our proposal. Consequently, this mail is from me, writing as W3C I18N WG Chair, rather than from the WG itself. The following text from the W3C I18N WG is on the agenda of next week's UTC meeting: The W3C I18N WG applauds the restrictions imposed, for security reasons, in TUS 3.1, on the interpretation of UTF-8 non-shortest form BMP characters. We urge the Unicode Consortium to impose the same restrictions, for the same reasons, on UTF-8 non-shortest form characters outside the BMP. In other words, "irregular code unit sequences" in UTF-8 should become "illegal code unit sequences". Owing to the inclusion, in TUS 3.1, of many characters outside of the BMP, this has become very topical. Any ambiguity in the interpretation of UTF-8 has the potential to allow serious security breaches. Subsequently, the W3C I18N WG decided as follows: AGREED: The use of different definitions of UTF-8 by different groups working in the context of the Web/Internet is a serious problem. ACTION: Misha to draft a mail to Paul Hoffman, Patrik Fältström, Harald Alvestrand about the problem of different definitions of UTF-8. Some background follows. RFC 2279 (UTF-8, a transformation format of ISO 10646) provides an informative definition of UTF-8, which excludes all non-standard forms. It explicitly warns against such forms in: 6. Security Considerations Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax. [...] For the normative definition of UTF-8, the RFC relies on: [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. Five amendments and a technical corrigendum have been published up to now. UTF-8 is described in Annex R, published as Amendment 2. UTF-16 is described in Annex Q, published as Amendment 1. 17 other amendments are currently at various stages of standardization. As ISO/IEC 10646-1:1993 has been replaced by the year 2000 version, it would now be almost impossible for a developer to lay his/her hands on Amendment 2 to ISO/IEC 10646-1:1993. Consequently, developers are most probably relying on the Unicode Standard for the definition of UTF-8. For some years there were two differences between the IETF position on UTF-8 and the Unicode position. Both relate to non-standard forms of UTF-8 and both have serious security implications. They are: 1. the use of non-shortest forms for characters within the BMP, 2. the use of non-shortest forms for characters outside the BMP. Unicode Standard 3.1 has, I'm very glad to say, banned case 1 above. It still, however, permits the processing (though not the production) of case 2. The Unicode Standard refers to these as "irregular code unit sequences". As you will see in the first quote from the W3C I18N WG, we are asking of the UTC that: "irregular code unit sequences" in UTF-8 should become "illegal code unit sequences". Your support for this change at the UTC (in person or otherwise) would be very much appreciated. Thanks, Misha Wolf W3C I18N WG Chair ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.