Public Review Issues

208	Proposed Update Unicode Technical Report #36, Unicode Security Considerations	2012.05.01
Status:	Open
Originator:	UTC
Informal Discussion:	Unicode Mail List (Join)
Formal Feedback:	Contact Form

Description of Issue:

This UTR is being prepared for an update to bring the IDNA 2008 references up to date. Public review and comment is invited on this draft.

There are significant additions and changes in the new proposed updates of these specifications. The definition of Restriction Levels has moved from UTR #36 to UTS #39, which also adds two new conformance clauses and specifications for Restriction Levels and mixed number detection, an amended specification for mixed script detection, and updates for Unicode 6.1.

There are several of review notes requesting feedback on particular issues. Please submit feedback on those and the rest of this document by May 1 for consideration at the UTC meeting on May 7.

For information about how to discuss this issue and how to supply formal feedback, please see the feedback and discussion instructions. The accumulated feedback received so far on this issue is shown below, or you can look at a full page view.

The draft was updated on 2012-03-06.

Accumulated Feedback on PRI #208

(No feedback was received since the last meeting.)

Feedback on previous drafts of the review document is listed below

Date/Time: Tue Nov 22 18:20:36 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UTR#36 (Unicode security) 3.7.1 (PEP 383 Approach) error


The UTR#36 document (for lossless conversion to Unicode of other
encodings) says that PEP 383 uses the code points <0xD800 + byte
value> for any unmappable byte of a source encoding to map them to low
surrogates.

However PEP 383 actually uses (for its "unicodeescape" encoding) the
code points <0xDC00 + byte value>, i.e. high surrogates (with the
advantage that it is easier to detect them when converting back to the
original encoding, without having to look forward in the string, when
the generated Unicode string uses 16-bit code units, to see if it is
followed by a high surrogate representing a valid non-BMP character.

In its current implementation however, not all unmapped characters are
converted like this: if the source encoding is not based on ASCII
(that is always convertible to Unicode), the current Python
implementation of PEP 383 generates exceptions rather than converting
these bytes from 0x00..0x7F to 0xDC00..0xDCFF, but in fact the PEP383
approach is not required to do this.

The PEP 383 approach is usable independantly of the size of code units
through which the code points are represented, including if the
Unicode string uses 8-bit code units (i.e. this is still a valid
Unicode string, at the code point level, but this is not a valid
UTF-8).

But for this case, it would generate 3 bytes in the 8-bit Unicode
string for each unmapped byte of the original encoding, and a more
efficient but similar approach could as well map them in two bytes:

- <0xC0, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate 
code points 0xDC00..0xDC3F that themselves represent the unmapped source bytes 0x00..0x3F;

- <0xC1, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate 
code points 0xDC40..0xDC7F that themselves represent the unmapped source bytes 0x40..0x7F;

- <0xC2, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate 
code points 0xDC80..0xDCBF that themselves represent the unmapped source bytes 0x80..0xBF;

- <0xC3, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate 
code points 0xDCC0..0xDCFF that themselves represent the unmapped source bytes 0xC0..0xFF;

Nothing would be changed to PEP383 if the generated Unicode string uses 16-bit or 
32-bit code units. In all cases, the Unicode string will still enumerate the same 
number and values of code units at the Python programmatic level.

(This approach is similar to the approach used in Java for
(representing the NULL codepoint as <0xC0, 0x80> to allow lossless
(representation of valid Unicode strings, which will be internally
(represented as 16-bit code units at the programmatic Java level, but
(as 8-bit code units at the legacy JNI 8-bit interface or in network
(serialisations and for strings in compiled Java classes recognized by
(the class loader).