In looking at how to implement 2008 (and maintain backward
compatibility), we are wrestling with some practical questions that
we'd appreciate feedback on.
Scenario
Look at the following scenario, where we have three processes that
handle an IRI (perhaps just passing it through), with the final one
using it to access the DNS. (We'll use the term IRI for both when the
domain name labels are in punycode or in Unicode. They aren't
necessarily known to be A-Labels or U-Labels at any given point.)
P1 => P2 => P3 => P4 => DNS
Variables and Background
These processes may be
within the same system, or they may be passing IRIs across the web
(embodied in HTML5 doc, email, XLink, etc.) to other systems or
operating systems. For example, P1 could be a web server hosting a web
page, P2 may be a search engine indexer, P3 could be a search engine
results supply, and P4 could be a browser. Or these could all be
cooperating processes within a search engine indexer.
There are a lot of variables here:
Each of P1..P4 could convert an IRI to punycode before sending it on.
For that matter, any of them could convert back from punycode to
Unicode for use internally, or pass that Unicode form on (IRIs with
Unicode are recommended by the W3C in their protocols).
Each of the processes could be doing validity checks to determine
whether the domain name is valid or not. Such a check may be partial
(as in the current protocol spec, which doesn't require checking
CONTEXT or BIDI), or full. (The check for validity is orthogonal to
whether the form is Unicode or punycode.)
Each of the processes may be on IDNA2003, or on IDNA2008, or on some hybrid for compatibility.
For IDNA2008 implementations, each might be on a different version of Unicode.
Examples:
IE6 only handles punycode, and won't do any validity checking. IE7
handles both punycode and Unicode. It checks the punycode, so a valid
IDNA2008 IRI with a ZWJ will fail. There are still enough IE6
implementations around that we (and others) need to handle them, and
for years to come there will be IE7 implementations around. Not to
speak of other browsers, emailers, word processors, etc. that handle
URL/IRIs based on IDNA2003.
Note: even if validity checking is done on an IRI, non-registries don't
need to include the tests for BIDI or CONTEXT, so there is no guarantee
that a punycode form is an A-Label or that a Unicode form is a U-Label.
Questions
1. Suppose that P2 is on Unicode 5.1, and the others are on Unicode
6.0. If P2 does a validity check, then it could prevent a perfectly
valid IRI from being correctly looked up. To prevent this problem, does
that mean that the best practice is for only P4 to do validity
checking? Or should the others do some weaker form of validity
checking, like skipping a check for UNASSIGNED?
2. Suppose P3 is a non-IDNA aware process, so IRIs should be converted
to Punycode by P2 before sending. Should one do a validity check in P2?
How do we avoid problem #1 in that case?
3. The current protocol spec appears to only require validity checking
when converting to punycode. So when an IRI is already in punycode
(which could have been from IDNA2003 application), it might not
undergo any checking at all when going from P1 to the DNS; so
everything depends on the registry's doing the right thing. Is it best
to check anyway, or does that run into problem #1?
4. If P2 accepts an IRI in Unicode and passes it on to P3 in Unicode
(never converting to punycode), should it do any validity checking?
5. When a search engine does indexing, it has to map together IRIs that
are "equivalent" (resolving to the same logical location). When it
provides an IRI to the user for a page, that IRI should go to the
indexed page. However, because IDNA2003 and IDNA2008 browsers may go to
different places with the same IRI, which do we provide? If we try to
test for which browser the user has, that is clumsy and
error-prone.