L2/00-374R2
To: | UTC |
From: | Mark Davis |
Date: | 2000-10-15 |
Re: | UTF-8 and "Non-Shortest Form" (R2) |
The following is a proposal for tightening up the language for UTF-8 to close the "non-shortest form" issue. It is modified from the previous paper, taking in feedback from the mailing list. I suggest that two additional steps be taken:
The current C12 forbids the generation of "non-shortest form", and forbids the interpretation of illegal sequences, but not non-shortest form. We still need to allow for programs that do fast processing with no error checking where the data is guaranteed to be well-formed, but otherwise we can extend the prohibition on illegal byte sequences to be all ill-formed byte sequences, not just illegal byte sequences. We also need to change the definition of UTF-8 to make absolutely clear what is ill-formed. (The definition of UTF-8 is in Chapter 3, and duplicated in http://www.unicode.org/unicode/faq/utf_bom.html.)
To do this, we make the following normative modifications:
Modify C12 as follows:
C12 | When a process generates data in a Unicode
Transformation Format, it shall not emit ill-formed byte sequences. When
a process interprets data in a Unicode Transformation Format, it shall
treat ill-formed |
Modify D36 as follows:
D36 | UTF-8 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of one to four bytes, as specified in Table 3.1. A byte sequence is ill-formed UTF-8 if and only if some sequence of those bytes matches one (or more) of the lines in Table 3.1b |
Add the following table and text:
1 Byte | 2 Bytes | 3 Bytes | 4 Bytes | 5 Bytes | |
---|---|---|---|---|---|
1 | C0 - C1 | ||||
2 | F5 - FF | ||||
3 | C2-DF | 00-7F, C0-FF | |||
4 | E0 | 00-9F, C0-FF | |||
5 | E1-EF | 00-7F, C0-FF | |||
6 | F0 | 00-9F, C0-FF | |||
7 | F4 | 00-7F, 90-FF | |||
8 | E0-EF | XX | 00-7F, C0-FF | ||
9 | F0-F4 | XX | 00-7F, C0-FF | ||
10 | F0-F4 | XX | XX | 00-7F, C0-FF | |
11 | ED | A0-AF | XX | ED | B0-BF |
Table 3.1b. lists all of the byte sequences that are ill-formed in UTF-8. The "XX" in any cell matches any byte whatsoever, otherwise the specific byte range is listed. Thus