Accumulated Feedback on PRI #429

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.


Date/Time: Sat May 22 04:42:52 CDT 2021
Name: Timothy Gu
Report Type: Error Report
Opt Subject: UTS #46 IDNA issue report


Hello,

I'm looking to fully implement UTS #46 in Google Chrome in alignment
with other web browsers and the WHATWG URL Standard. Unfortunately,
while doing so, we have discovered the following issues with UTS #46. We
also attached a proposed solution for each issue listed.

## Forbid double-encoded xn-- even with CheckHyphens=false

The CheckHyphens boolean flag was introduced in version 10.0.0 of UTS
#46. It loosens the DNS restriction of having -- in the third and fourth
places of a domain label, in order to support certain existing deployed
content. However, the introduced flag has a defect: it allows
double-encoded IDNA labels to be considered valid.

Here's one example of how this is bad. Consider the domain label
"xn--xn---epa". Assuming CheckHyphens=false, upon applying ToUnicode, it
would get converted to "xn--é" without any errors. However, this
conversion would not round trip, since applying ToASCII to "xn--é" would
produce a failure value.

We propose the following fix. In Section 4.1, Validity Criteria, insert
the following item after criterion 3:

> If not CheckHyphens, the label must not begin with “xn--”.

## Provide a mode to keep ASCII labels identical

Under the current UTS #46 processing, the label "xn--a" is considered
invalid since it suffers from a Punycode decoding error. Yet, existing
implementations universally accept "xn--a.com" as a valid domain for
lookup. In order to reflect reality, UTS #46 needs to make provisions
for such "ASCII fast path." However, to prevent roundtripping bugs,
there is also a need to maintain the same validity status for equivalent
A- and U-labels. (E.g., "xn--a-ecp.com" should continue to be invalid,
just like how "a⒈.com" is invalid.)

My proposal is to introduce a new boolean flag IgnoreInvalidPunycode.
The algorithm in Section 4, Processing should then be amended as
follows. Replace step 4.1.1 which currently says:

> Attempt to convert the rest of the label to Unicode according to
> Punycode [RFC3492]. If that conversion fails, record that there was an
> error, and continue with the next label. Otherwise replace the
> original label in the string by the results of the conversion.

with the two steps

> **If the label contains any non-ASCII code point (i.e., a code point
> greater than U+007F), record there was an error, and continue with the
> next label.**
>
> Attempt to convert the rest of the label to Unicode according to
> Punycode [RFC3492]. If that conversion fails, **and if not
> IgnoreInvalidPunycode,** record that there was an error, and continue
> with the next label. Otherwise replace the original label in the
> string by the results of the conversion.

(Additions are surrounded by two asterisks.)

These changes would continue to make ToASCII return failure for labels
such as "xn--é" and "xn--a-ecp", but keep "xn--a" as is. Additionally,
since the rule operates on the basis of labels, it would be okay with
"xn--a.xn--nxa" ("xn--a.β" in Unicode), matching existing real-world
implementations.

As a reference, here is a partial list of important implementations that
currently support IDNA but also allow "xn--a.com":

- curl
- Firefox Browser
- GNU Wget
- Go net/http
- Google Chrome
- Safari

## IdnaTestV2.txt issue

The IdnaTestV2.txt that came with UTS #46 Version 13.0.0 has a minor
error. ToUnicode for "xn--mbm8237g..xn--7-7hf" is marked as failure with
reason "V6", but this should not be the case. (This domain appears
_twice_ around line 3070.)

The cause appears to be the lack of update for version 13.0.0. The
problematic domain contains the code point U+18C4E. While this code
point is "disallowed" in Unicode 12.0.0's IdnaMappingTable.txt (line
6397), it is "valid" under Unicode 13.0.0's (line 6450).

(Note, however, a similar domain "xn--mbm8237g.xn--7-7hf1526p" is
correct in the file and should remain forbidden. This is since it
additionally has the code point U+FE12, which is "disallowed" in both
12.0.0 and 13.0.0.)

## ICU support

Finally, I request that the ICU libraries add support for the two
features mentioned above. If ICU is outside the purview of this
committee, please kindly let me know so that I can forward the request
to the right people.

For the first issue, ICU4C does not make it possible to distinguish
labels of type "xn--xn---epa" versus "ab--cde": both return
UIDNA_ERROR_HYPHEN_3_4. Some way of forbidding the latter but allowing
the former would be useful.

For the second issue, additional changes may be needed as ICU returns
"xn--a�" with a U+FFFD at the end for "xn--a". We would want to keep the
original label unchanged for ASCII labels with Punycode decoding errors.
A separate uidna_openUTS46() option may be necessary.

Similar changes would probably be needed for ICU4J. ICU4X does not plan
to support IDNA.

----

Best,

Timothy