To UTC
From Mark Davis
Date 2000-10-15
Re UTF-8 and "Shortest Form"

The following is a proposal for tightening up the language for UTF-8 to close the "shortest form" issue.

The current C12 forbids the generation of "shortest form", and forbids the interpretation of illegal sequences, but not shortest form. Here is the current C12.

 
C12 When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat illegal byte sequences as an error condition.

We still want to allow for fast processing with no error checking if the data is guaranteed to be well-formed, but otherwise we can extend the prohibition on illegal byte sequences to be all ill-formed byte sequences, not just illegal byte sequences. Modify it as follows:

 
C12 When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat ill-formed byte sequences as an error condition unless the data is guaranteed to be well-formed.

We also need to change the definition of UTF-8 to make absolutely clear what is ill-formed. Currently it is as stated in http://www.unicode.org/unicode/faq/utf_bom.html. Here is a modified version.

 
D36
UTF-8 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of one to four bytes, as specified in Table 3.1. Such a byte sequence in UTF-8 is ill-formed if it does not meet the conditions in Table 3.1a.

Add the following table:

Table 3.1a. Allowable UTF-8 Byte Values
1st Byte 2nd Byte 3rd Byte 4th Byte
00-7F      
C2-DF 80-BF    
E0 A0-BF 80-BF  
E1-EF 80-BF 80-BF  
F0 90-BF 80-BF 80-BF
F1-F3 80-BF 80-BF 80-BF
F4 80-8F 80-BF 80-BF