|
208 | Proposed Update Unicode Technical Report #36, Unicode Security Considerations | 2012.05.01 |
Status: | Open | |
Originator: | UTC | |
Informal Discussion: | Unicode Mail List (Join) | |
Formal Feedback: | Contact Form | |
Description of Issue:
This UTR is being prepared for an update to bring the IDNA 2008 references up to date. Public review and comment is invited on this draft.
There are significant additions and changes in the new proposed updates of these specifications. The definition of Restriction Levels has moved from UTR #36 to UTS #39, which also adds two new conformance clauses and specifications for Restriction Levels and mixed number detection, an amended specification for mixed script detection, and updates for Unicode 6.1.
There are several of review notes requesting feedback on particular issues. Please submit feedback on those and the rest of this document by May 1 for consideration at the UTC meeting on May 7.
For information about how to discuss this issue and how to supply formal feedback, please see the feedback and discussion instructions. The accumulated feedback received so far on this issue is shown below, or you can look at a full page view.
The draft was updated on 2012-03-06.
(No feedback was received since the last meeting.)
Date/Time: Tue Nov 22 18:20:36 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UTR#36 (Unicode security) 3.7.1 (PEP 383 Approach) error
The UTR#36 document (for lossless conversion to Unicode of other encodings) says that PEP 383 uses the code points <0xD800 + byte value> for any unmappable byte of a source encoding to map them to low surrogates. However PEP 383 actually uses (for its "unicodeescape" encoding) the code points <0xDC00 + byte value>, i.e. high surrogates (with the advantage that it is easier to detect them when converting back to the original encoding, without having to look forward in the string, when the generated Unicode string uses 16-bit code units, to see if it is followed by a high surrogate representing a valid non-BMP character. In its current implementation however, not all unmapped characters are converted like this: if the source encoding is not based on ASCII (that is always convertible to Unicode), the current Python implementation of PEP 383 generates exceptions rather than converting these bytes from 0x00..0x7F to 0xDC00..0xDCFF, but in fact the PEP383 approach is not required to do this. The PEP 383 approach is usable independantly of the size of code units through which the code points are represented, including if the Unicode string uses 8-bit code units (i.e. this is still a valid Unicode string, at the code point level, but this is not a valid UTF-8). But for this case, it would generate 3 bytes in the 8-bit Unicode string for each unmapped byte of the original encoding, and a more efficient but similar approach could as well map them in two bytes: - <0xC0, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate code points 0xDC00..0xDC3F that themselves represent the unmapped source bytes 0x00..0x3F; - <0xC1, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate code points 0xDC40..0xDC7F that themselves represent the unmapped source bytes 0x40..0x7F; - <0xC2, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate code points 0xDC80..0xDCBF that themselves represent the unmapped source bytes 0x80..0xBF; - <0xC3, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate code points 0xDCC0..0xDCFF that themselves represent the unmapped source bytes 0xC0..0xFF; Nothing would be changed to PEP383 if the generated Unicode string uses 16-bit or 32-bit code units. In all cases, the Unicode string will still enumerate the same number and values of code units at the Python programmatic level. (This approach is similar to the approach used in Java for (representing the NULL codepoint as <0xC0, 0x80> to allow lossless (representation of valid Unicode strings, which will be internally (represented as 16-bit code units at the programmatic Java level, but (as 8-bit code units at the legacy JNI 8-bit interface or in network (serialisations and for strings in compiled Java classes recognized by (the class loader).