2008-12-03
Comments to:
markdavis@google.com
Contents
This document is the result of a request by members of the Unicode Technical Committee to produce a summary of the potential security problems in the current draft IDNA 2008, for circulation to security teams within their organizations. This is a personal contribution, and does not necessarily represent any official position of any affiliated organization.
The goal here is to describe the security issues as if the current draft IDNA 2008 were approved as is. That draft is a moving target, and the text here may undergo progressive revisions if there are changes in the draft IDNA 2008.
Comments and questions are welcome. However, if readers have any concerns about the draft IDNA 2008 itself, the appropriate forum to voice them is at
idna-update (joining the email list and thereby the working group), not in response to this document.
Main Differences
|
Category |
Description |
Comments |
Strict |
No mapping |
Thus rejecting
http://ÖBB.at (but permitting
http://öbb.at) |
Hybrid |
Map as in IDNA 2003 & disallow symbols |
Using the Unicode comparison format. Thus it will allow
http://ÖBB.at, mapping it to
http://öbb.at. |
Compatible
|
Map as in IDNA 2003 & allow symbols |
Same as Hybrid, except that it also allows IDNs like
http://√.com. (See above under Subtractions.) |
Custom |
Non-standard mapping |
Arbitrary other mappings are allowed in the current draft of IDNA 2008. Thus a custom implementation could allow http://ÖBB.at, mapping it to http://øbb.at, or to http://oebb.at, or to http://obb.at, or to anything else, even http:/phishing.com. One IDNA 2008-Custom implementation could map http://TÜRKIYE.com to http://türkiye.com while another could map it to http://türkıye.com (note the dotless i) -- and go to a different location. |
There are a few situations (luckily only a few) where IDNA 2008-Strict will result in the resolution of IDNs to different IP addresses than in IDNA 2003. This affects a small number of characters, but that are relatively common in particular languages and will affect a significant number of strings in those languages. (For more information on why IDNA 2003 does this, see the FAQ.) These four "Special Cases" are all listed in the table below:
Code |
Character |
IDNA 2008 |
IDNA 2003 |
Example: IDNA 2008 |
Example: IDNA 2003 |
ß |
ß |
ss |
|||
ς |
ς |
σ |
|||
ZWJ |
ZWJ |
deleted |
[TBD] |
|
|
ZWNJ |
ZWNJ |
deleted |
[TBD] |
|
These differences allow for security exploits. Consider the following URL, where "IDNBANK.xx" represents an IDN for a bank.
Alice's browser supports IDNA 2003. Under those rules, "IDNBANK.xx"
is mapped to "xn--blahblah", which is registered by the "xx" registry
and resolves to the IP address 127.0.63.245.
Bob's browser supports IDNA 2008. Under those rules, "IDNBANK.xx"
is also valid, but converts to a different punycode "xn--gorpblah",
which it turns out is also registered by the "xx" registry and resolves
to different IP address: 136.17.22.221.
The site at
http://136.17.22.221/index.html turns out to be a deliberate spoof page
(put up by a scammer) of the legitimate page
http://127.0.63.245/index.html, a banking site. Alice gets to the
correct page she is seeking. Bob gets to the phishing site instead,
supplies his bank password, and is robbed.
Note that this exploit can be carried out no matter which of the IDNA 2008 implementation categories Bob's browser uses.
The ZWJ and ZWNJ characters are of particular concern, because they are normally invisible. That is, the sequence "a<ZWJ>b" looks just like "ab". IDNA 2008 does provide a special category for characters like this (called CONTEXT), and only permits them in certain contexts (certain sequences of Arabic or Indic characters, for example). However, lookup applications are not required to check for these contexts, so overall security is dependent on registries' correct implementations.
The existence of these special cases means that the Unicode comparison format used in Hybrid and Compatible implementations needs to be modified to exclude these characters.
While some steps could be taken by registries to mitigate the above problems, we must remember that we are not only talking about top level domains, or second level domains, but also lower level domains that are under the control of thousands of different organizations. For example, the domain names under "blogspot.com", such as http://café.blogspot.com, are controlled by the company that has registered "blogspot". Ideally no registries would allow two IDNs that correspond according to the Special Cases table to resolve to different IP addresses. So blogspot would need to disallow registration of both the registration of http://gefäss.blogspot.com and of http://gefäß.blogspot.com, to prevent problems (and of other cases like the normally-invisible ZWJ and ZWNJ). However, applications cannot depend on all such registries behaving correctly, because the odds are high that at least some (and perhaps many) of the many thousands of registries will not check for this. Thus the burden is primarily on applications handling IDNs to prevent the situation.
The worst of all possible cases is an IDNA 2008-Custom implementation. Unfortunately, there appears to be no good way to prevent security problems with IDNA 2008 Custom implementations, because it is impossible to anticipate what such implementations would do. Such an implementation is not limited to just the above four special cases for exploits -- it could remap even characters like "A" or "B" to an arbitrary other character (or sequence). Because there is no way to predict what it will do, there are no effective countermeasures.
Clients such as search engines have another practical issue facing
them. They will probably opt for IDNA 2008-Compatible, allowing all
valid IDNA 2003 characters so that they can access all of the web.
Normally they also need to canonicalize URLs, so that they can
determine when two URLs are actually the same. For IDNA 2003 this was
straightforward. For IDNA 2008-Hybrid/Compatible, the canonicalization
can result in two different possibilities (with or without Special
Cases). It may then require two DNS lookups to determine which of the
two possibilities is to be used.
Whatever approach is taken, IDNA 2008 does not make any appreciable
difference in reducing problems with visually-confusable characters
(so-called homographs). Thus programmers still need to be aware of
those issues as detailed in
Unicode Security Considerations,
including the list of potentially visually-confusable characters that
can be used in programmatic tests found in that Unicode Technical
Report.
As implementations update to IDNA2008, we will for some
considerable length of time have a situation where there are both IDNA
2003 and IDNA 2008 implementations in use, with the possible categories
of IDNA 2008 given above: Strict, Hybrid, Compatible, or Custom.
To reduce security concerns, we strongly hope that no implementations choose a Custom variant, to avoid indeterminacies which can cause security problems. (Even better would be if this option were removed from the IDNA 2008 specs!) To maintain compatibility, we anticipate that few implementations will opt for the Strict variant.
That is, most would implement either IDNA 2008-Hybrid or IDNA 2008-Compatible in the near term. Once sufficiently many high-level registries disallow symbols, the IDNA 2008-Compatible implementations could probably move towards IDNA 2008-Hybrid. It is unclear when, if ever, it would reasonable for those implementations to move to being Strict.
Q. What are examples of where the different categories of IDNA implementation behave differently?
Q. What are the main advantages of IDNA2008?
Q. What is "bidi label hopping?
Q. What are the main disadvantages of IDNA2008?
Q. Are the "local" mappings just a UI issue?
Q. Do the Custom exploits require unscrupulous registries?
Q. What is the motivation for allowing arbitrary (Custom) mappings?
A. Here is a table that illustrates the differences, where 2003 is the current behavior.
|
2003 | 2008-Compatible | 2008-Hybrid | 2008-Strict | 2008-Custom | Comments |
http://öbb.at | Yes | Yes |
Yes | Yes | Yes |
Simple characters |
http://ÖBB.at | Yes | Yes | Yes |
No | ? | Case mapping |
http://√.com | Yes | Yes |
No | No | ? | Symbol |
http://faß.de | Yes | Yes* | Yes* | Yes* | Yes* | Special (different IP address) |
http://ԛәлп.com |
No | Yes | Yes | Yes | Yes |
New Unicode (version 5.1) U+051B (ԛ)
cyrillic qa |
The larger concern are those cases like http://Brüder.com that work now (on IDNA 2003, being equivalent to http://brüder.com), but fail under a strict implementation.
Note that the text "ΒόλοΣ.com", which appears on http://Βόλος.com, illustrates this: the normal case mapping of Σ is to σ. If σ and ς are not treated as case variants, there wouldn't be a match between ΒόλοΣ and Βόλος.
In German, the situation is even more complicated:
IDNA 2003 deletes ZWJ, ZWNJ and other characters that are
themselves invisible but may affect rendering. IDNA 2008 allows them,
but only in limited contexts.
A. It is extremely difficult to restrict on the basis of language, because the letters used in a particular language are not well defined. The "core" letters typically are, but many others are typically accepted in loan words, and have perfectly legitimate commercial and social use.
It is a bit easier to maintain a bright line based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). Even there it is problematic to have that as a restriction. Some languages (Japanese) require multiple scripts. And in most cases, mixtures of scripts are harmless. One can have SONY日本.com with no problems at all -- while there are many cases of "homographs" (visually confusable characters) within the same script that a restriction based on script doesn't deal with.
It would have been of some aid to remove historic scripts (like cuneiform) from the protocol, but the IDNA working group didn't agree to that. See Unicode Specific Character Adjustments, Table 4.
The rough consensus among the working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, IDNA 2008 is no different than IDNA 2003. IDNA doesn't try to attack the homograph problem, because it is too difficult to have a bright line. Effective solutions depend on information or capabilities outside of the protocol's control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.
Responsible registries will have their own rules, since they can apply such restrictions. For example, DENIC can decide on a restricted set of characters appropriate for German. Apps also take certain precautions -- MSIE, Safari, and Chrome all display domain names in Unicode only if the user's language(s) typically use the scripts in those domain names. Firefox is the odd man out, expecting TLD registries to publish rules to Firefox management's liking. There is more on the kinds of techniques that implementations can use on the Unicode web site, at [Unicode Security Considerations].