Technical Reports |
Authors | Mark Davis (mark.davis@us.ibm.com) |
Date | 2005-02-20 |
This Version | http://www.unicode.org/reports/tr36/tr36-2.html |
Previous Version | http://www.unicode.org/reports/tr36/tr36-1.html |
Latest Version | http://www.unicode.org/reports/tr36/ |
Revision | 2 |
This document describes security considerations that are important to be aware of when working with Unicode, and provides specific recommendations for dealing with the issues that arise.
This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
Note to Reviewers: The original working title was "Unicode Security Considerations". Should the above title be changed back to that, or changed to something else, eg "Unicode Security Recommendations"? Feedback is welcome.
Unicode represents a very significant advance over all previous methods of encoding characters. For the first time, all of the worlds characters could be represented in a uniform manner, for the first time making it feasible for the vast majority of programs to be globalized: built to handle any language in the world.
In many ways, the use of Unicode makes programs much more robust and secure. When systems need to use a hodge-podge of different charsets for representing characters, it was possible to take advantage of differences between those charsets, or in the way in which programs converted to and from them.
However, because Unicode contains such a large number of characters, and because it incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks. This document describes some of the security considerations that should be taken into account by programmers, system analysts, standards-developers, and users.
We anticipate that this document will grow over time, adding additional sections as needed. Initially, there are two areas that will be discussed: canonical representation and visual spoofing. For more information, see also the Unicode FAQ on Security Issues.
Each section below presents a background information on the kinds of problems that can occur, then a list of specific recommendations for avoiding the problems.
Note to Reviewers: Some of the examples below use Unicode characters which some browsers will not show, or may not show in a way that illustrates the problem. For more information about improving the display, see [Display]. In the final version, we'll prepare GIFs for the characters where necessary.
A common practice is to have a 'gatekeeper' for a system. That gatekeeper checks over incoming data to ensure that it is safe, and passes only safe data through. Once in the system, the other components assume that the data is safe. A problem arises when a component treats two pieces of text as identical — typically by canonicalizing them to the same form — while the gatekeeper only detected that one of them was unsafe.
There are three equivalent encoding forms for Unicode: UTF-8, UTF-16, and UTF-32. UTF-8 is commonly used in XML and HTML; UTF-16 is the most common in program APIs; and UTF-32 is the best for representing single characters. While these forms are all equivalent in terms of the ability to express Unicode, the original usage of UTF-8 was open to a canonicalization exploit.
Up to The Unicode Standard, Version 3.0 the generation of "non-shortest form" UTF-8 was forbidden, and as was the interpretation of illegal sequences, but not the interpretation of what was called the "non-shortest form". Where software does interpret the non-shortest forms, security issues can arise. For example:
For example, the backslash character "\" can often be a dangerous character to let through a gatekeeper, since it can be used to access different directories. Thus a gatekeeper might specifically prevent it from getting through. The backslash is represented in UTF-8 as the byte sequence <5C>. However, as a non-shortest form, backslash could also be represented as the byte sequence<C1 9C>. When a gatekeeper doesn't catch that, but a component converts non-shortest forms, it can allow a real security breech. For more information, see http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx and http://www.ins.com/downloads/whitepapers/ins_white_paper_ms_iis_unicode_exploit_0801.pdf.
To address this issue, the Unicode Technical Committee modified the definition of UTF-8 in Unicode 3.1 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses.
Note to Reviewers: To Do:
Add information about other possible exploits in this area:
Unicode Normalization
Case mapping
Buffer overflows with all of the above, and when converting encoding forms
Visual spoofing is where a similarity in visual appearance fools a user, and causes him or her to take unsafe actions. This is not new to Unicode: it was possible to spoof simply with ASCII character: "inteI.com" for example, uses a capital I instead of an L. The infamous example here is of course "paypaI.com":
... Not only was "Paypai.com" very convincing, but the scam artist even goes one step further. He or she is apparently emailing PayPal customers, saying they have a large payment waiting for them in their account.
The message then offers up a link, urging the recipient to claim the funds. But the URL that is displayed for the unwitting victim uses a capital "i" (I), which looks just like a lowercase "L" (l), in many computer fonts. ...
And the spoofs nowadays are pretty clever. One is an email that looks like it comes from a trusted source, like your bank. It even has an explicit disclaimer to not trust links in email, and directs you to copy text to your address bar in your browser. The text looks ok to you, so you won't realize that you are going to a completely different site, which is then set up to simulate your bank well enough to get your password.
These spoofs depend on the use of visually confusable strings:
D1. | Two different strings of Unicode characters are said to be visually confusable when their appearance in common fonts in small sizes at screen resolutions is sufficiently close that people easily mistake one for the other. |
There are no hard-and-fast rules for visual confusability: it is of course possible to make any characters look like any others with a suitably faulty font. By "small-sizes at screen resolutions", this means fonts whose ascent + descent is from 9 to 12 pixels for most scripts, somewhat larger for scripts where the font size users typically have is larger, such as Japanese. Of course, at sufficiently small sizes, such as 4px, a great many characters would become confusable. In some cases sequences of characters can be used to spoof: for example, "rn" ("r" followed by "n") in many san-serif fonts is visually confusable with "m". Where two different strings are essentially identical in most fonts at all sizes, they are called homographs. However, spoofing is not dependent on just homographs; if the visual appearance is close enough at small sizes, that can be sufficient to cause problems.
Note that characters are not visually confusable if the positioning of the glyph is sufficiently different. For example, foo·com (using the hyphenation point instead of the period) should be distinguishable from foo.com by the positioning of the dot (except in faulty fonts).
To a certain extent, the new forms of visual spoofing available with Unicode are a matter of degree and not kind. However, because of the very large number of Unicode characters (over 94,000 in the current version), the number of opportunities for visual spoofing are significantly larger than with a restricted character set such as ASCII.
For examples of visually confusable characters, see [confusables].
Spoofing is an especially important subject given the recent introduction of international domain names (IDN). There is a natural desire for people to see domain names in their own languages and writing systems; English speakers can understand this if they consider what it would be like if they always had to type web addresses with Russian characters! So IDN represents a very significant advance for most people in the world. The avoidance of spoofing vulnerability requires proper implementation in browsers and other programs, so as to minimize security risks without making the use of non-ASCII character too onerous.
International domain names are, of course, not the only cases where visual spoofing can occur. For example, you might get a message asking you to allow allowing the installation of software from "IBM", authenticated with the proper Verisign certificate, but the "M" character happens to be the Russian (Cyrillic) character that looks precisely like the English "M". Any place where strings are used as identifiers is subject to this kind of spoofing. For more information on identifers, see UAX #31: Identifier and Pattern Syntax.
However, IDN provides a good starting point for a discussion of visual spoofing. The good news is that the design of IDN prevents a huge number of spoofing attacks. All conformant users of IDN are required to process domain names to convert compatibility-equivalent characters into a unique form; this processing eliminates most of the possibilities for visual spoofing by mapping away a large number of visually confusable characters and sequences. For example, Unicode contains the "ä" (a-umlaut) character, but also contains a free-standing umlaut ("¨") which can be used in combination with any character, including an "a". But the compatibility normalization will convert any sequence of "a" plus "¨" into the regular "ä".
Thus you can't spoof an a-umlaut with a + umlaut; it simply results in the same domain name. See example 1 below. The String column shows the actual characters; the UTF-16 shows the underlying encoding, while the IDNA column shows the IDNA format used to represent the string internally in International Domain Names.
String | UTF-16 | IDNA | |
---|---|---|---|
1a | ät.com | 0061 0308 0074 002E 0063 006F 006D | xn--t-zfa.com |
1b | ät.com | 00E4 0074 002E 0063 006F 006D | xn--t-zfa.com |
Note: The ICU demo at http://ibm.com/software/globalization/icu/demo/domain/ can be used to demonstrate the results of processing different domain names. That demo was also used to get the IDNA values shown here.
The IDN processing also removes case distinctions by performing a case folding to reduce characters to a lowercase form. This is also useful for avoiding spoofing problems, since characters are generally more distinctive in their lowercase forms. That means that we can focus on just the lowercase characters.
For a list of allowable characters in IDN, see [idn-chars]. There are many misperceptions about which characters are allowed in IDN, so referencing this explicit list should be useful for dispelling some of them. The characters are those left after string processing has been performed, so case-folding and normalization have already been applied.
Although normalization and case-folding prevent many possible spoofing attacks, there remain many cases where visual spoofing can still occur with international domain names. Ideally, much of this would be handled on the registries' side instead of user-agents (browsers, emailers, and other programs that display and process URLs). The registry has the most data available, and process it the most efficiently at the time of registration, using policies to reduce visual spoofing. For example, given confusable mapping data, the registry can easily determine if a proposed registration conflicts with an existing one; that is much more difficult for user agents because of the sheer number of combinations that it would have to probe.
However:
So efforts need to be made on the part of user-agents as an additional line of defense.
Note: since the top-level domain names (TLD: .com, .ru, etc.) is currently always ASCII, all discussions below of the domain names pertain to all but the top level.
Visually confusable characters are not usually unified across scripts. Thus a Greek omicron is encoded as a different character from the Latin "o", even though it is usually identical or nearly identical in appearance. There are good reasons for this: often the characters were separate in legacy encodings, and preservation of those distinctions was necessary for existing data to be mapped to Unicode without loss. Moreover, the characters generally have very different behavior: two visually confusable characters may be different in casing behavior, in category (letter vs. number), or in numeric value. After all, ASCII doesn't unify lowercase L and digit 1, even though those are visually confusable. Encoding the Cyrillic character б (corresponding to the letter "b") by using the numeral 6, would clearly have been a mistake, even though they are visually confusable.
However, the existence of these cases means that there is a significant number of spoofing possibilities using characters from different scripts. For example, a domain name can be spoofed by using a Greek omicron instead of an 'o', as in example 2a.
String | UTF-16 | IDNA | |
---|---|---|---|
2a | tοp.com | 0074 03BF 0070 002E 0063 006F 006D | xn--tp-jbc.com |
2b | tοp.com | 0074 006F 0070 002E 0063 006F 006D | top.com |
There are many legitimate uses of mixed scripts. Because of the prevalence of Latin characters, it is quite common, for example, to use English words (with Latin characters) in the middle of other languages using other scripts. For example, one could have XML-документы.com (which would be a site for "XML documents" in Russian). Even in English, legitimate product or organization names or may contain non-Latin characters, such as Ωmega, Teχ, Toys-Я-Us, or HλLF-LIFE. The lack of IDNs in the past has also led to the usage in some registries (such as the .ru TLD) where Latin names have been used to create pseudo-cyrillic names in the .ru tld. For example, see http://caxap.ru/ (сахар means sugar in Russian).
The Unicode Standard supplies information that can be used for detecting mixed-script text: for more information, see UAX #24: Script Names.
Cyrillic and Latin represent special challenges, since the number of common glyphs shared between them is so high, as can be seen from [idn-chars]. It may be possible to compose an entire domain name (except the TLD) in Cyrillic using letters that will be essentially always identical in form to Latin letters, such as "scope.com": with "scope" in Cyrillic looking just like "scope" in Latin. These are called whole-script confusables.
While compatibility normalization and mixed-script detection can handle the vast majority of cases, there are other visual confusables that could cause problems. With fonts increasing able to handle international characters, and especially with smaller font sizes in the context of an address bar, these visual confusables could be used to spoof. Importantly, these problems can be illustrated with common, widely available fonts on widely available operating systems — this is not pointing a finger at any one vendor.
Consider the following examples, all in the same script. In each numbered case, in commonly available browsers, the strings will look identical.
String | UTF-16 | IDNA | |
---|---|---|---|
3a | a‐b.com | 0061 2010 0062 002E 0063 006F 006D | xn--ab-v1t.com |
3b | a-b.com | 0061 002D 0062 002E 0063 006F 006D | a-b.com |
4a | so̷s.com | 0073 006F 0337 0073 002E 0063 006F 006D | xn--sos-rjc.com |
4b | søs.com | 0073 00F8 0073 002E 0063 006F 006D | xn--ss-lka.com |
5a | z̵o.com | 007A 0335 006F 002E 0063 006F 006D | xn--zo-pyb.com |
5b | ƶo.com | 01B6 006F 002E 0063 006F 006D | xn--o-zra.com |
6a | an͂o.com | 0061 006E 0342 006F 002E 0063 006F 006D | xn--ano-0kc.com |
6b | año.com | 0061 00F1 006F 002E 0063 006F 006D | xn--ao-zja.com |
7a | Đo.org | 0110 006F 002E 006F 0072 0067 | xn--o-kia.org |
7b | Ɖo.org | 0189 006F 002E 006F 0072 0067 | xn--o-40a.org |
An additional problem arises when a font and/or rendering engine has inadequate support for certain sequences of characters. These are characters that should be visually distinguishable, but don't appear that way. In example 8a, the a-umlaut is followed by another umlaut. The Unicode Standard guidelines indicate that the second umlaut should be 'stacked' above the first, producing a distinct visual difference. But as this example shows, common fonts will simply superimpose the second umlaut; and if the positioning is close enough, the user will not see a difference between 8a and 8b.
String | UTF-16 | IDNA | |
---|---|---|---|
8a | ä̈t.com | 00E4 0308 0074 002E 0063 006F 006D | xn--t-zfa85n.com |
8b | ät.com | 00E4 0074 002E 0063 006F 006D | xn--t-zfa.com |
9a | eḷ.com | 0065 006C 0323 002E 0063 006F 006D | xn--e-zom.com |
9b | ẹl.com | 0065 0323 006C 002E 0063 006F 006D | xn--l-ewm.com |
9c | ẹl.com | 1EB9 006C 002E 0063 006F 006D | xn--l-ewm.com |
In example 9, we have an even worse case. The underdot character in 9a is actually under the 'l', but in many fonts, it appears as under the 'e'! It is thus visually confusable with 9b (where the underdot is under the e) or the equivalent normalized form 9c.
Spoofing syntax characters can be even worse than regular characters. For example, U+2044 '⁄' FRACTION SLASH can look like '/' in many fonts (ideally the spacing and angle is sufficiently different as to be distinguishable, but this is not always maintained. This allows http://www.example.org/not.mydomain.com to pretend to be in the example.org domain, whereas it is actually the subzone www.example.org/not in the domain mydomain.com. Thus anything that is visually similar to '.', '/', '#', is especially dangerous. Most of these cases, such as U+2024 (·) ONE DOT LEADER are disallowed by StringPrep, but not all.
It is important also not to show a missing glyph or character with a simple "?", since that makes every such character be visually confusable with a real question mark. Instead, follow the Unicode guidelines for displaying missing glyphs using a rounded-rectangle, as described in Section 5.3 Unknown and Missing Characters of [Unicode]. For examples of this, see also [Charts].
Turning away from IDN for a moment, there is another area where visual spoofs can be used. Many scripts have sets of decimal digits that are different in shape that the typical European digits {0 1 2 3 4 5 6 7 8 9}. For example, Bengali has {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, while Oriya has {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}. While the sets taken as a whole are different in shape, individual digits may have the same shapes as digits from other scripts, even digits of different values. For example, the string ৪୨ is visually confusable with 89 (at small sizes), but actually has the numeric value 42! Where software simply interprets the numeric value of a string of digits, without detecting that the digits are from different scripts, it is possible to generate such spoofs.
We are in the process of gathering data that would allow for a finer-grained approach, but until such time as that is more comprehensive, we'd recommend having a more conservative stance. It is always easier to widen restrictions than narrow them. We do expect these recommendations to be refined over time.
Some people have proposed prevention of spoofing by restricting
domain names according to language. In practice, that is very problematic. It is very difficult to
determine the intended language of many terms, especially product or company names, which are
often constructed to be neutral regarding language. Moreover, languages tend to be quite fluid;
foreign words are continually being adopted. Except for registries with very special policies
(such as the blocking used by some East Asian
registries such as described in RFC 3743),
the language association doesn't make too much sense.
Instead, what is recommended is a combination of string preprocessing to remove basic equivalences, promoting adequate rendering support, and putting restrictions in place according to script and restricting by confusable characters. While the ICANN guidelines say "top-level domain registries will (a) associate each registered internationalized domain name with one language or set of languages" (http://www.icann.org/general/idn-guidelines-20jun03.htm), that is better interpreted as limiting to script rather than language.
In the following, "appropriate alerts" are recommended. The form of such alerts could be minimal, such as special coloring or icons (perhaps with a tool-tip for more information), or more in-your-face, such as an alert dialog describing the issue and requiring user confirmation before continuing. The strength of the alert can be scaled according to the level of the potential problem. The user-agent could also remember when the user has accepted an alert, for say Ωmega.com, and permit future access without bothering the user again with an alert.
The term "Registry" is to be interpreted broadly. The .com operator
can impose restrictions on the 2nd level domain label,
but if someone registers foo.com, then it's up to them to decide what will be allowed at the 3rd
level (e.g. bar.foo.com). So for that purpose, the owner of foo.com is treated as the "registry"
for the 3rd level (the 'bar'). The term "Registrant" is used to refer to someone applying to a
registry for a domain name.
Also see the security discussions in [IRI], [URI],
and [StringPrep].
Note to Reviewers: To Do:
Give more background as to why normalization fixes certain problems, and which it does not fix. Describe how implementations of normalization can use small data set limited to only supported characters. Describe the recommended use of normalization in non-domain part of URL.
Describe BIDI spoofs. Use material from Michel's slides. Show how reverse-bidi (visual order -> storage order) can be used to detect bidi spoofs. That is: one can apply bidi then reverse bidi: if the result does not match the original, then reject the string.
Explain that private use characters can cause security problems, and recommend strongly against their use.
Describe cases in complex languages (eg Indic) where the same visual appearance may result from two different undering character sequences — in the right context.
Add information on spoofs that only work with contextual scripts, such as Arabic.
Discuss security issues in Collation (sorting, searching, matching)
Describe how TrueType/OpenType fonts can be used in spoofing: fonts are actually programs that can deform glyph shapes radically according to resolution, platform, or language. For example $100.00 could appear as $200.00 when printed.
Discuss SSL and how root Certificate Authorities can be a problem, but are also part of the solution; most customers would lose faith quickly in internet financial transaction if SSL/https can be easily compromised
Add other applications of visual spoofing, aside from the example of IDN. International domain names are actually in much better shape than many other areas, since the problem will be much more severe in any area where text is not normalized. So focus on those issues.
Discuss Unicode properties. Eg more characters have numeric properties than developers might expect.
Discuss Use of Regular Expressions in validating data — ensuring that the Regular Expression Engine follows the Unicode Guidelines, but also that use of regular expressions makes use of properties rather than fixed lists of characters.
Discuss and/or point to other items:
There are three data files currently associated with this document.
[idn-chars] | IDN Characters:
idn-chars.html Lists all the possible IDN chars, after StringPrep is performed. Contains all the possible characters, sorted by script, then whether atomic (non-decomposible) or decomposable, then according to UCA collation order. Scripts that are bicameral (have both upper and lower cases) are further divided into two sets, based on whether letters have both cases or not. If your browser supports tool-tips, hovering the mouse over any character will show its name and code point. |
[confusables] |
Visually Confusable Characters:
confusables.txt The format and usage of the file are described in the file header. Note: we are just starting the project of collecting this data, and examining the feasibility of different approaches, so we have just begun to gather data in this file. |
[special] | Special-Purpose Characters:
special_purpose.txt Characters that are not in common modern use. Note: we are just starting the project of collecting this data, and examining the feasibility of different approaches, so we have just begun to gather data in this file. |
Steven Loomis and other people on the ICU team were very helpful in developing the original proposal for this technical report. Thanks also to the following people for their feedback or contributions to this document or earlier versions of it: Martin Dürst, Paul Hoffman, Eric Muller, and especially Michel Suignard. This document also draws on examples or ideas suggested in email discussions from Alexander Savenkov, Eric van der Poel, and others.
To Do: comb through the text and convert the references to the standard form.
[CharMod] | Character Model for the World Wide
Web 1.0: Fundamentals http://www.w3.org/TR/charmod/ |
[Charts] | Unicode Charts http://www.unicode.org/charts/ |
[Display] | Display Problems? http://www.unicode.org/help/display_problems.html |
[IRI] | RFC 3987 Internationalized
Resource Identifiers (IRIs). M. Duerst, M. Suignard. January 2005. http://ietf.org/rfc/rfc3987.txt |
[Feedback] | Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html |
Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[StringPrep] | RFC 3454 Preparation of
Internationalized Strings ("stringprep"). P. Hoffman, M. Blanchet. December 2002. http://ietf.org/rfc/rfc3454.txt |
[UCD] | Unicode Character Database. http://www.unicode.org/ucd For an overview of the Unicode Character Database and a list of its associated files |
[Unicode] | The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1. |
[URI] | RFC 3986 Uniform Resource
Identifier (URI): Generic Syntax. T. Berners-Lee, R. Fielding, L. Masinter. January 2005. http://ietf.org/rfc/rfc3986.txt |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/standard/versions For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports. |
The following summarizes modifications from the previous revision of this document.
Revision 2:
Revision 1:
Copyright © 2004-2005 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.