To: | UTC |
From: | Mark Davis |
Date: | 2000-11-12 |
Re: | UTF-8 and "Non-Shortest Form" (R4) |
The following is a proposal for a corrigendum to the Unicode Standard, tightening up the language for UTF-8 to close the "non-shortest form" issue. As a part of these actions, we may also want to add a conformance test file.
To address this issue, the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting the problematic non-shortest forms, and clarified some of the conformance clauses. These modifications make use of updated notation: see the Glossary for any unfamiliar terms. The UTF-8 program in http://www.unicode.org/Public/PROGRAMS/CVTUTF/ has been upgraded to reflect this corrigendum.
Add the following to the end of C12:
A conformant process shall not interpret illegal UTF code unit sequences as characters. Irregular UTF code unit sequences shall not be used for encoding any other information.
Add the following notes after C12:
Delete the second sentence in the note under D32:
- The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as <C0, AF> are illegal.
- Internally, a particular function might be used that does not check for illegal code unit sequences. However, the conformant process can use that function only on data that has already been certified to not contain any illegal code unit sequences.
- Processes that require unique representation must not interpret irregular UTF code unit sequences as characters. They may, for example, reject or remove those sequences. Processes may transform irregular code unit sequences into the equivalent well-formed code value sequences.
- Conformant processes cannot interpret illegal code unit sequences. However, the conformance clauses do not, for example, prevent utility programs from operating on "mangled" text. For example, a UTF-8 file could have had CRLF sequences introduced at every 80 bytes by a bad mailer program. This could result in some UTF-8 byte sequences being interrupted by CRLFs, producing illegal byte sequences. This mangled text is no longer UTF-8. It is permissible for a conformant program to repair such text, recognizing that the mangled text was originally well-formed UTF-8 byte sequences. However, such repair of mangled data is a special case, and must not be used in circumstances where it would cause security problems.
For example, UTF-8 allows nonshortest code value sequences to be interpreted: a UTF-8 conformant mayt map the code value sequence C0 80 (110000002 100000002) to the Unicode value U+0000, even though a UTF-8 conformant process shall never generate that code value sequence -- it shall generate the sequence 00 (000000002) instead.
Modify D36 as follows, and add a note:
D36 | UTF-8 is the Unicode Transformation Format that
serializes a Unicode code point as a sequence of one to four bytes, as
specified in Table 3.1. Any UTF-8 byte sequences are illegal
unless they match the patterns listed in Table 3.1B, Legal UTF-8 Byte
Sequences. An irregular code unit sequence in UTF-8 is a six-byte
sequence where the first three bytes correspond to a high surrogate, and
the next three bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-8 sequences shall not be generated by a conformant process. |
Delete the two text paragraphs after Table 3.1. The relevant portions have been elevated into definitions or conformance clauses.
When converting a Unicode scalar value to UTF-8, the shortest form that can represent those values shall be used. This practice preserves uniqueness of encoding. For example, the Unicode binary value <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>. The latter is an example of an irregular UTF-8 byte sequence. Irregular UTF-8 sequences shall not be used for encoding any other information.
When converting from UTF-8 to a Unicode scalar value, implementations do not need to check that the shortest encoding is being used. This simplifies the conversion algorithm.
Replace them by the following table and text:
Code Points | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|---|
U+0000..U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
U+1000..U+FFFF | E1..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
Table 3.1B. lists all of the byte sequences that are legal in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is legal in that position. Any byte value outside of the ranges listed is illegal. For example, the byte sequence <C0, AF> is illegal since C0 is not legal in the 1st Byte column. The byte sequence <E0, 9F, 80> is illegal since in the row where E0 is legal as a first byte, 9F is not legal as a second byte. The byte sequence <F4, 80, 83, 92> is legal, since every byte in that sequence matches a byte range in a row of the table (the last row).
- Cases where a trailing byte range is unusual are underlined in the table to call them to the reader's attention. These only occur in the second byte of a sequence.
Add the following to the end of DXX for UTF-32:
An irregular byte sequence in UTF-32 is an eight-byte sequence where the first four bytes correspond to a high surrogate, and the next four bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-32 sequences shall not be generated by a conformant process.