L2/07-185
Source: Mark Davis
Date: Mar 30, 2007 6:14 PM
Subject: Re: IDNAbis compatibility
We had a bit more time to look at IDNAbis compatibility, and here are some better (and hopefully clearer) results. Out of a significantly large sampling of the web, there were about 800,000 cases where an HTML document contained an href="..." that contained a host name that was valid IDNA2003. We tested those host names to see if they would also be valid under IDNAbis (based on the current working proposals). About 85% were valid, about 8% more would be valid if IDNAbis were changed to also do case and width folding, and about 6% would still be invalid even if case and width foldings were applied. (The width foldings are applying NFKC to just the half-width and full-width characters to get the normal ones.)
Here are some more details, where A0-A4 are disjoint categories.
A0: Passes IDNAbis
708,760
85.26%
A1: Passes IDNAbis after case folding
22,714
2.73%
A2: Passes IDNAbis after width folding
47,312
5.69%
A3: Passes IDNAbis after apply width folding, and then case folding.
4
0.00%
A4: Failed to pass IDNAbis after 3 steps
52,456
6.31%
A5: Passes IDNA = sum(A1-A4)
831,246
100.00%
This differs from some of our previous data, because we are explicitly testing IDNA vs IDNAbis (not just approximating the latter), and also filtering out invalid URLs. I will be out next week, but we'll try to follow up with more of a breakdown of A4.
Mark