To | UTC |
From | Mark Davis |
Date | 2000-10-15 |
Re | UTF-8 and "Shortest Form" |
The following is a proposal for tightening up the language for UTF-8 to close the "shortest form" issue.
The current C12 forbids the generation of "shortest form", and forbids the interpretation of illegal sequences, but not shortest form. Here is the current C12.
C12 | When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat illegal byte sequences as an error condition. |
We still want to allow for fast processing with no error checking if the data is guaranteed to be well-formed, but otherwise we can extend the prohibition on illegal byte sequences to be all ill-formed byte sequences, not just illegal byte sequences. Modify it as follows:
C12 | When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat ill-formed byte sequences as an error condition unless the data is guaranteed to be well-formed. |
D36 |
UTF-8 is the Unicode Transformation Format that serializes a Unicode
scalar value as a sequence of one to four bytes, as specified in Table
3.1. Such a byte sequence in UTF-8 is ill-formed if it does not
meet the conditions in Table 3.1a.
|
Add the following table:
1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|
00-7F | |||
C2-DF | 80-BF | ||
E0 | A0-BF | 80-BF | |
E1-EF | 80-BF | 80-BF | |
F0 | 90-BF | 80-BF | 80-BF |
F1-F3 | 80-BF | 80-BF | 80-BF |
F4 | 80-8F | 80-BF | 80-BF |