This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Sat May 22 04:42:52 CDT 2021
Name: Timothy Gu
Report Type: Error Report
Opt Subject: UTS #46 IDNA issue report
Hello, I'm looking to fully implement UTS #46 in Google Chrome in alignment with other web browsers and the WHATWG URL Standard. Unfortunately, while doing so, we have discovered the following issues with UTS #46. We also attached a proposed solution for each issue listed. ## Forbid double-encoded xn-- even with CheckHyphens=false The CheckHyphens boolean flag was introduced in version 10.0.0 of UTS #46. It loosens the DNS restriction of having -- in the third and fourth places of a domain label, in order to support certain existing deployed content. However, the introduced flag has a defect: it allows double-encoded IDNA labels to be considered valid. Here's one example of how this is bad. Consider the domain label "xn--xn---epa". Assuming CheckHyphens=false, upon applying ToUnicode, it would get converted to "xn--é" without any errors. However, this conversion would not round trip, since applying ToASCII to "xn--é" would produce a failure value. We propose the following fix. In Section 4.1, Validity Criteria, insert the following item after criterion 3: > If not CheckHyphens, the label must not begin with “xn--”. ## Provide a mode to keep ASCII labels identical Under the current UTS #46 processing, the label "xn--a" is considered invalid since it suffers from a Punycode decoding error. Yet, existing implementations universally accept "xn--a.com" as a valid domain for lookup. In order to reflect reality, UTS #46 needs to make provisions for such "ASCII fast path." However, to prevent roundtripping bugs, there is also a need to maintain the same validity status for equivalent A- and U-labels. (E.g., "xn--a-ecp.com" should continue to be invalid, just like how "a⒈.com" is invalid.) My proposal is to introduce a new boolean flag IgnoreInvalidPunycode. The algorithm in Section 4, Processing should then be amended as follows. Replace step 4.1.1 which currently says: > Attempt to convert the rest of the label to Unicode according to > Punycode [RFC3492]. If that conversion fails, record that there was an > error, and continue with the next label. Otherwise replace the > original label in the string by the results of the conversion. with the two steps > **If the label contains any non-ASCII code point (i.e., a code point > greater than U+007F), record there was an error, and continue with the > next label.** > > Attempt to convert the rest of the label to Unicode according to > Punycode [RFC3492]. If that conversion fails, **and if not > IgnoreInvalidPunycode,** record that there was an error, and continue > with the next label. Otherwise replace the original label in the > string by the results of the conversion. (Additions are surrounded by two asterisks.) These changes would continue to make ToASCII return failure for labels such as "xn--é" and "xn--a-ecp", but keep "xn--a" as is. Additionally, since the rule operates on the basis of labels, it would be okay with "xn--a.xn--nxa" ("xn--a.β" in Unicode), matching existing real-world implementations. As a reference, here is a partial list of important implementations that currently support IDNA but also allow "xn--a.com": - curl - Firefox Browser - GNU Wget - Go net/http - Google Chrome - Safari ## IdnaTestV2.txt issue The IdnaTestV2.txt that came with UTS #46 Version 13.0.0 has a minor error. ToUnicode for "xn--mbm8237g..xn--7-7hf" is marked as failure with reason "V6", but this should not be the case. (This domain appears _twice_ around line 3070.) The cause appears to be the lack of update for version 13.0.0. The problematic domain contains the code point U+18C4E. While this code point is "disallowed" in Unicode 12.0.0's IdnaMappingTable.txt (line 6397), it is "valid" under Unicode 13.0.0's (line 6450). (Note, however, a similar domain "xn--mbm8237g.xn--7-7hf1526p" is correct in the file and should remain forbidden. This is since it additionally has the code point U+FE12, which is "disallowed" in both 12.0.0 and 13.0.0.) ## ICU support Finally, I request that the ICU libraries add support for the two features mentioned above. If ICU is outside the purview of this committee, please kindly let me know so that I can forward the request to the right people. For the first issue, ICU4C does not make it possible to distinguish labels of type "xn--xn---epa" versus "ab--cde": both return UIDNA_ERROR_HYPHEN_3_4. Some way of forbidding the latter but allowing the former would be useful. For the second issue, additional changes may be needed as ICU returns "xn--a�" with a U+FFFD at the end for "xn--a". We would want to keep the original label unchanged for ASCII labels with Punycode decoding errors. A separate uidna_openUTS46() option may be necessary. Similar changes would probably be needed for ICU4J. ICU4X does not plan to support IDNA. ---- Best, Timothy