Technical Reports |
Revision | 3 (working draft $Revision: 1.16 $) |
Authors | Mark Davis (mark.davis@us.ibm.com) |
Date | $Date: 2005/05/09 00:23:51 $ |
This Version | http://www.unicode.org/reports/tr36/tr36-2.html |
Previous Version | http://www.unicode.org/reports/tr36/tr36-1.html |
Latest Version | http://www.unicode.org/reports/tr36/ |
Because Unicode contains such a large number of characters, and because it incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks. This document describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account, and provides specific recommendations to reduce the risk of problems.
New text is marked with this style. This draft has rearranged a number of pieces of text -- that rearrangement is not marked.
Review notes appear with this style.
This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
Based on feedback, the title was shortened to "Unicode Security Considerations".
Unicode represents a very significant advance over all previous methods of encoding characters. For the first time, all of the world's characters could be represented in a uniform manner, for the first time making it feasible for the vast majority of programs to be globalized: built to handle any language in the world.
In many ways, the use of Unicode makes programs much more robust and secure. When systems used a hodge-podge of different charsets for representing characters, it was possible to take advantage of differences between those charsets, or in the way in which programs converted to and from them.
But because Unicode contains such a large number of characters, and because it incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks. This document describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account.
An example is visual spoofing, where a similarity in visual appearance fools a user, and causes him or her to take unsafe actions. Suppose that you get an email notifying you that your paypal.com account has a problem. You, being a security-savvy user, realize that it might be a spoof; the HTML email you get might be presenting the URL http://paypal.com/... visually, but might be hiding the real URL. You realize that even what shows up in the status bar might be a lie, since clever Javascript* or ActiveX can work around that. (And users may have these turned on unless they know to turn them off.) So you click on the link, and carefully examine your browser's address box to make sure that it is actually going to http://paypal.com/...; and according to what you see it is. But actually it is going to a spoof site that has a fake "paypal.com", using the Cyrillic letter that looks precisely like a 'p'. You use the site without suspecting, and your password ends up compromised.
This problem is not new to Unicode: it was possible to spoof even with ASCII characters alone. For example, "inteI.com" uses a capital I instead of an L. The infamous example here is of course "paypaI.com":
... Not only was "Paypai.com" very convincing, but the scam artist even goes one step further. He or she is apparently emailing PayPal customers, saying they have a large payment waiting for them in their account.
The message then offers up a link, urging the recipient to claim the funds. But the URL that is displayed for the unwitting victim uses a capital "i" (I), which looks just like a lowercase "L" (l), in many computer fonts. ...
While some browsers prevent this spoof by lowercasing domain names, but others don't.
Thus to a certain extent, the new forms of visual spoofing available with Unicode are a matter of degree and not kind. However, because of the very large number of Unicode characters (over 96,000 in the current version), the number of opportunities for visual spoofing is significantly larger than with a restricted character set such as ASCII.
We anticipate that this document will grow over time, adding additional sections as needed. Initially, it is organized into two section: visual security issues and non-visual security issues. For more information, see also the Unicode FAQ on Security Issues.
Each section presents background information on the kinds of problems that can occur, then lists specific recommendations for reducing the risk of such problems.
Some of the examples below use Unicode characters which some browsers will not show, or may not show in a way that illustrates the problem. For more information about improving the display, see [Display]. In the final version, we'll prepare GIFs for the characters where necessary.
Visual spoofs depend on the use of visually confusable strings: two different strings of Unicode characters whose appearance in common fonts in small sizes at screen resolutions is sufficiently close that people easily mistake one for the other.
There are no hard-and-fast rules for visual confusability: it is of course possible to make any characters look like any others with a suitably faulty font. "Small-sizes at screen resolutions", means fonts whose ascent + descent is from 9 to 12 pixels for most scripts, somewhat larger for scripts where the font size users typically have is larger, such as Japanese. Of course, at sufficiently small sizes, such as 4 pixels for ascent + descent, a great many characters would become confusable. In some cases sequences of characters can be used to spoof: for example, "rn" ("r" followed by "n") is visually confusable with "m" in many sans-serif fonts.
Where two different strings are essentially identical in most fonts at all sizes, they are called homographs. However, spoofing is not dependent on just homographs; if the visual appearance is close enough at small sizes, that can be sufficient to cause problems. Note that some people use the term homograph broadly, encompassing all visually confusables.
Note that characters are not visually confusable if the positioning of the glyph is sufficiently different. For example, foo·com (using the hyphenation point instead of the period) should be distinguishable from foo.com by the positioning of the dot (except in faulty fonts). For examples of visually confusable characters, see [confusables].
Visual spoofing is an especially important subject given the recent introduction of international domain names (IDN). There is a natural desire for people to see domain names in their own languages and writing systems; English speakers can understand this if they consider what it would be like if they always had to type web addresses with Russian characters! So IDN represents a very significant advance for most people in the world. The avoidance of spoofing vulnerabilities requires proper implementation in browsers and other programs, to minimize security risks while still allowing for effective use of non-ASCII characters.
International domain names are, of course, not the only cases where visual spoofing can occur. For example, you might get a message asking you to allow the installation of software from "IBM", authenticated with the proper Verisign certificate, but the "М" character happens to be the Russian (Cyrillic) character that looks precisely like the English "M". Any place where strings are used as identifiers is subject to this kind of spoofing. For more information on identifiers, see UAX #31: Identifier and Pattern Syntax.
However, IDN provides a good starting point for a discussion of visual spoofing. Fortunately the design of IDN prevents a huge number of spoofing attacks. All conformant users of IDN are required to process domain names to convert what are called compatibility-equivalent characters into a unique form using a process called compatibility normalization (NFKC) — for more information on this, see [UAX15]. This processing eliminates most of the possibilities for visual spoofing by mapping away a large number of visually confusable characters and sequences. For example, Unicode contains the "ä" (a-umlaut) character, but also contains a free-standing umlaut (" ̈") which can be used in combination with any character, including an "a". But the compatibility normalization will convert any sequence of "a" plus " ̈" into the regular "ä". It will also convert characters like the half-width Japanese katakana character カ to the regular character カ, and single ligature characters like "fi" to the regular characters "fi".
Thus you can not spoof an a-umlaut with a + umlaut; it simply results in the same domain name. See the example Safe Domain Names below. The String column shows the actual characters; the UTF-16 shows the underlying encoding, while the IDNA column shows the IDNA format used to represent the string internally in International Domain Names.
String | UTF-16 | IDN Internal | Comments | |
---|---|---|---|---|
1a | ät.com | 0061 0308 0074 002E 0063 006F 006D | xn--t-zfa.com | Uses the decomposed form, a + umlaut |
1b | ät.com | 00E4 0074 002E 0063 006F 006D | xn--t-zfa.com | But it ends up being identical to the composed form, in IDNA |
Note: The ICU demo at [IDN-Demo] can be used to demonstrate the results of processing different domain names. That demo was also used to get the IDNA values shown here.
Similarly, for most scripts, when the text is normalized, two accents that don't interact typographically are put into a determinate order. Thus the sequence <x, dot_above, dot_below> is reordered as <x, dot_below, dot_above>. This ensures that the two sequences that look identical (ẋ̣ and ẋ̣̇) have the same representation.
The IDN processing also removes case distinctions by performing a case folding to reduce characters to a lowercase form. This is also useful for avoiding spoofing problems, since characters are generally more distinctive in their lowercase forms. That means that we can focus on just the lowercase characters.
For a list of allowable characters in IDN, see [idn-chars]. There are many misperceptions about which characters are allowed in IDN, so referencing this explicit list should help dispel some of them.
Note: Users expect diacritical marks to distinguish domain names. For example, the domain names "resume.com" and "résumé.com" are (and should be) distinguished. In languages where the spelling may allow certain words with and without diacritics, two domain names would need to be registered (just as one may register both "analyze.com" and "analyse.com"). However, if the practice of dropping diacriticals is widespread in a particular language, a registry may want to pay attention to this.
Although normalization and case-folding prevent many possible spoofing attacks, visual spoofing can still occur with many international domain names. Some of this can be handled on the registry side instead of the user-agent side (browsers, emailers, and other programs that display and process URLs). The registry has the most data available about alternative registered names, and can process that information the most efficiently at the time of registration, using policies to reduce visual spoofing. For example, given confusable mapping data, the registry can easily determine if a proposed registration conflicts with an existing one; that is much more difficult for user agents because of the sheer number of combinations that they would have to probe.
However:
So efforts need to be made on the part of user-agents as an additional line of defense.
Note: since the top-level domain names (TLD: .com, .ru, etc.) are currently always ASCII, all discussions below of the domain names pertain to all but the top level.
Visually confusable characters are not usually unified across scripts. Thus a Greek omicron is encoded as a different character from the Latin "o", even though it is usually identical or nearly identical in appearance. There are good reasons for this: often the characters were separate in legacy encodings, and preservation of those distinctions was necessary for existing data to be mapped to Unicode without loss. Moreover, the characters generally have very different behavior: two visually confusable characters may be different in casing behavior, in category (letter versus number), or in numeric value. After all, ASCII does not unify lowercase letter l and digit 1, even though those are visually confusable. (Many fonts always distinguish them, but many do not.) Encoding the Cyrillic character б (corresponding to the letter "b") by using the numeral 6, would clearly have been a mistake, even though they are visually confusable.
However, the existence of visually confusable characters across scripts means that there is a significant number of spoofing possibilities using characters from different scripts. For example, a domain name can be spoofed by using a Greek omicron instead of an 'o', as in example 2a.
String | UTF-16 | IDN Internal | Comments | |
---|---|---|---|---|
2a | tοp.com | 0074 03BF 0070 002E 0063 006F 006D | xn--tp-jbc.com | Uses a Greek omicron in place of the o |
2b | tοp.com | 0074 006F 0070 002E 0063 006F 006D | top.com |
There are many legitimate uses of mixed scripts. For example, the prevalence of Latin characters means that it is quite common to use English words (with Latin characters) in the middle of other languages using other scripts. For example, one could have XML-документы.com (which would be a site for "XML documents" in Russian). Even in English, legitimate product or organization names may contain non-Latin characters, such as Ωmega, Teχ, Toys-Я-Us, or HλLF-LIFE. The lack of IDNs in the past has also led to the usage in some registries (such as the .ru TLD) where Latin characters have been used to create pseudo-Cyrillic names in the .ru (Russian) top-level domain. For example, see http://caxap.ru/ (сахар means sugar in Russian).
The Unicode Standard supplies information that can be used for detecting mixed-script text. For more information, see [UAX24].
Cyrillic and Latin represent special challenges, since the number of common glyphs shared between them is so high, as can be seen from [idn-chars]. It may be possible to compose an entire domain name (except the TLD) in Cyrillic using letters that will be essentially always identical in form to Latin letters, such as "scope.com": with "scope" in Cyrillic looking just like "scope" in Latin. These are called whole-script confusables.
The use of characters entirely within one script, or using characters that are common across scripts is called in-script spoofing, and the strings that cause it are correspondingly called in-script confusables. While compatibility normalization and mixed-script detection can handle the majority of cases, they do not handle in-script confusables. Especially at the smaller font sizes in the context of an address bar, any visual confusables within a single script can be used in spoofing. Importantly, these problems can be illustrated with common, widely available fonts on widely available operating systems — the problems are not specific to any single vendor.
Consider the following examples, all in the same script. In each numbered case, in commonly available browsers, the strings will look identical or close to identical.
String | UTF-16 | IDN Internal | Comments | |
---|---|---|---|---|
3a | a‐b.com | 0061 2010 0062 002E 0063 006F 006D | xn--ab-v1t.com | Uses a real hyphen, instead of the ASCII hyphen-minus |
3b | a-b.com | 0061 002D 0062 002E 0063 006F 006D | a-b.com | |
4a | so̷s.com | 0073 006F 0337 0073 002E 0063 006F 006D | xn--sos-rjc.com | Uses o + combining slash |
4b | søs.com | 0073 00F8 0073 002E 0063 006F 006D | xn--ss-lka.com | |
5a | z̵o.com | 007A 0335 006F 002E 0063 006F 006D | xn--zo-pyb.com | Uses z + combining bar |
5b | ƶo.com | 01B6 006F 002E 0063 006F 006D | xn--o-zra.com | |
6a | an͂o.com | 0061 006E 0342 006F 002E 0063 006F 006D | xn--ano-0kc.com | Uses n + greek perispomeni |
6b | año.com | 0061 00F1 006F 002E 0063 006F 006D | xn--ao-zja.com | |
7a | ʣe.org | 02A3 0065 002E 006F 0072 0067 | xn--e-j5a.org | Uses d-z digraph |
7b | dze.org | 0064 007A 0065 002E 006F 0072 0067 | dze.org |
There are other examples where sequences cause problems. The example of "rn" is mentioned above; there are other examples in other scripts. For example, the sequence अ + ा typically looks identical to आ.
As mentioned above, in most cases two sequences of accents that have the same visual appearance are put into a canonical order. This does not happen, however, for certain scripts used in Southeast Asia, so reordering characters may be used for spoofs in those cases.
[TBD add example]
An additional problem arises when a font and/or rendering engine has inadequate support for certain sequences of characters. These are characters that should be visually distinguishable, but do not appear that way. In example 8a, the a-umlaut is followed by another umlaut. The Unicode Standard guidelines indicate that the second umlaut should be 'stacked' above the first, producing a distinct visual difference. But as this example shows, common fonts will simply superimpose the second umlaut; and if the positioning is close enough, the user will not see a difference between 8a and 8b.
String | UTF-16 | IDN Internal | Comments | |
---|---|---|---|---|
8a | ä̈t.com | 00E4 0308 0074 002E 0063 006F 006D | xn--t-zfa85n.com | a-umlaut + umlaut |
8b | ät.com | 00E4 0074 002E 0063 006F 006D | xn--t-zfa.com | |
9a | eḷ.com | 0065 006C 0323 002E 0063 006F 006D | xn--e-zom.com | Has a dot under the l; may appear under the e |
9b | ẹl.com | 0065 0323 006C 002E 0063 006F 006D | xn--l-ewm.com | |
9c | ẹl.com | 1EB9 006C 002E 0063 006F 006D | xn--l-ewm.com |
In example 9, we have an even worse case. The underdot character in 9a is actually under the 'l', but in many fonts, it appears as under the 'e'! It is thus visually confusable with 9b (where the underdot is under the e) or the equivalent normalized form 9c.
Spoofing syntax characters can be even worse than regular characters. For example, U+2044 ( ⁄ ) FRACTION SLASH can look like a regular ASCII '/' in many fonts (ideally the spacing and angle is sufficiently different as to be distinguishable, but this is not always maintained. This allows the following name:
http://example.org/not.mydomain.com
to pretend to be a subdomain in
http://example.org
whereas it is actually the subzone "example.org/not" in the domain
http://mydomain.com
Thus anything that is visually similar to '.', '/', '#', is especially dangerous. Most of these cases, such as U+2024 (·) ONE DOT LEADER are disallowed by StringPrep [RFC3454], but not all.
Of course, this approach can also work even without IDN, where the user is fooled into thinking that the domain name is the first part of the URL, not where it actually is. For example, in the following the real domain name, mydomain.com, is also obscured for the casual user, who may not realize that -- does not terminate the domain name.
http://example.org--long-and-obscure-list-of-characters.mydomain.com
In retrospect, it would have been much better if domain names were customarily written with "most significant part first". The following hypothetical display would be harder to spoof: the fact that it is "com.mydomain" is not as easily lost.
http://com.mydomain.org/not.example
http://com.mydomain.org--long-and-obscure-list-of-characters.example
But that would be an impossible change at this point: those horses have long since left the barn. However, a possible solution is to always visually distinguish the second-level domain, for example:
http://example.org
http://mydomain.com
http://example.org/not.mydomain.com
http://example.org--long-and-obscure-list-of-characters.mydomain.com
It is important also not to show a missing glyph or character with a simple "?", since that makes every such character be visually confusable with a real question mark. Instead, follow the Unicode guidelines for displaying missing glyphs using a rounded-rectangle, as described in Section 5.3 Unknown and Missing Characters of [Unicode]. For examples of this, see also [Charts].
Turning away from IDN for a moment, there is another area where visual spoofs can be used. Many scripts have sets of decimal digits that are different in shape from the typical European digits {0 1 2 3 4 5 6 7 8 9}. For example, Bengali has {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, while Oriya has {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}. While the sets taken as a whole are different in shape, individual digits may have the same shapes as digits from other scripts, even digits of different values. For example, the string ৪୨ is visually confusable with 89 (at small sizes), but actually has the numeric value 42! Where software interprets the numeric value of a string of digits without detecting that the digits are from different scripts, it is possible to generate such spoofs.
Many opportunities for spoofing can be removed by using a case-folded format. This format, defined by the Unicode Standard, produces a string that only contains lowercase characters where possible.
However, there is one particular situation where the pure case-folded format of a string as defined by the standard is not desired. The character U+03A3 "Σ" capital sigma lowercases to U+03C3 "σ" small sigma if it is followed by another letter, but lowercases to U+03C2 "ς" small final sigma if it is not. Because both σ and ς have a case-insensitive match to Σ, and the case folding algorithm needs to map both of them together (so that transitivity is maintained), only one of them appears in the case-folded form.
When the case-folded format of a string is to be displayed to the user, it should be processed so as to choose the proper form for the small sigma, depending on the context. That is provided in Table 3-13 of [Unicode], where C = σ. For more information on case mapping and folding, see the following: Section 3.13 Default Case Operations of [Unicode], Section 4.2 Case Normative of [Unicode], and Section 5.18 Case Mappings of [Unicode].
A number of characters are included in Unicode for compatibility. What is called Compatibility Normalization (NFKC) can be used to map these characters to the regular variants (this is what is done in IDNA). For example, a half-width Japanese katakana character カ is mapped to the regular character カ. Additional mappings can be added to this, for example, IDNA adds additional mappings such as:
200D; ZERO WIDTH JOINER
maps to nothing (that is, is
removed)
0041; 0061;
Case maps 'A' to 'a'
20A8; 0072 0073;
Additional folding, mapping ₨
to "rs"
In addition, characters may be prohibited. For example, IDNA prohibits space and no-break space (U+00A0). Instead, for example, of removing a ZERO WIDTH JOINER, or mapping ₨ to "rs", one could prohibit these characters. There are pluses and minuses to both approaches. If compatibility characters are widely used in practice, in entering text, then it is much more user-friendly to remap them. This also extends to deletion; for example, the ZERO WIDTH JOINER is commonly used to affect the presentation of characters in languages such as Hindi or Arabic. In this case, text copied into the address box may often contain the character.
Where this is not the case, however, it may be advisable to simply prohibit the character. It is unlikely, for example, that ㋕ would be typed by a Japanese user, nor that it need work in copied text.
Where both mapping and prohibition are used, the mapping should be done before the prohibition, to ensure that characters don't "sneak past". For example, the Greek character TONOS (΄) ends up being prohibited, because it normalizes to space + acute, and space ends up being prohibited.
The Security Levels 1-5 are defined below for use in implementations. These place restrictions on the use of identifiers according to the recommended Identifier Characters as specified in Appendix C. The Special-Purpose Characters are also specified in that appendix.
The determination of Script is according to the Unicode Standard [UAX24]. (A visual breakdown of characters by script is given in [idn-chars].) In determining script, Common and Inherited script characters are ignored, except for characters outside of XID_Continue. For example, "abc-def" counts as a single script, and would be allowed at Security Level 2. That is, the script of "-" is ignored. The string "I♥NY", on the other hand, is first allowed at Security Level 4, since the heart character is outside of XID_Continue, and thus the string counts as containing two scripts: Latin and Common.
An "appropriate alert" should be generated if a domain name fails to satisfy the chosen security level. Depending on the circumstances and the level difference, the form of such alerts could be minimal, such as special coloring or icons (perhaps with a tool-tip for more information); or more obvious, such as an alert dialog describing the issue and requiring user confirmation before continuing; or even more stringent, such as disallowing the use of the identifier. User-agents should remember when the user has accepted an alert, for say Ωmega.com, and permit future access without bothering the user again. Where icons are used to indicate the presence of characters from scripts, the glyphs in Appendix D. Missing Character Glyphs can be used.
A possible future extension to Level 2 is to exclude any combining
character sequences outside of NamedSequences.txt
The Unicode Consortium recommends a somewhat conservative approach at this point, because is always easier to widen restrictions than narrow them. The Consortium is gathering data that would allow for a finer-grained approach, and expects to refine these recommendations in the future.
Some have proposed restricting domain names according to language, to prevent spoofing. In practice, that is very problematic: it is very difficult to determine the intended language of many terms, especially product or company names, which are often constructed to be neutral regarding language. Moreover, languages tend to be quite fluid; foreign words are continually being adopted. Except for registries with very special policies (such as the blocking used by some East Asian registries such as described in [RFC3743]), the language association does not make too much sense.
Instead, the recommendations call for combination of string preprocessing to remove basic equivalences, promoting adequate rendering support, and putting restrictions in place according to script and restricting by confusable characters. While the ICANN guidelines say "top-level domain registries will ... associate each registered internationalized domain name with one language or set of languages" [ICANN], that guidance is better interpreted as limiting to script rather than language.
Also see the security discussions in IRI [RFC3987], URI [RFC3986], and StringPrep [RFC3454].
The following are recommendations for user agents in dealing with domain names.
The following are recommendations for registries in dealing with domain names. The term "Registry" is to be interpreted broadly. The .com operator can impose restrictions on the 2nd level domain label, but if someone registers foo.com, then it is up to them to decide what will be allowed at the 3rd level (for example, bar.foo.com). So for that purpose, the owner of foo.com is treated as the "registry" for the 3rd level (the bar). The term "Registrant" is used to refer to someone applying to a registry for a domain name.
Thus a registry could allow registration of http://caxap.ru/ in Latin (which is already registered), or the Cyrillic equivalent, or both — but for both, only with the same registrant!
To Do:
Give more background as to why normalization fixes certain problems, and which it does not fix. Describe how implementations of normalization can use small data set limited to only supported characters. Describe the recommended use of normalization in non-domain part of URL.
Describe BIDI spoofs. Use material from Michel's slides. Show how reverse-bidi (visual order -> storage order) can be used to detect bidi spoofs. That is: one can apply bidi then reverse bidi: if the result does not match the original, then reject the string.
Explain that private use characters can cause security problems, and recommend strongly against their use (not a problem for IDN, but for other identifiers it can be).
Describe cases in complex languages (eg Indic) where the same visual appearance may result from two different undering character sequences — in the right context.
Add information on spoofs that only work with contextual scripts, such as Arabic.
Discuss security issues in Collation (sorting, searching, matching)
Describe how TrueType/OpenType fonts can be used in spoofing: fonts are actually programs that can deform glyph shapes radically according to resolution, platform, or language. For example $100.00 could appear as $200.00 when printed.
Discuss SSL and how root Certificate Authorities can be a problem, but are also part of the solution; most customers would lose faith quickly in internet financial transaction if SSL/https can be easily compromised
Expand other applications of visual spoofing, aside from the example of IDN. International domain names are actually in much better shape than many other areas, since the problem will be much more severe in any area where text is not normalized. So focus on those issues.
A common practice is to have a 'gatekeeper' for a system. That gatekeeper checks incoming data to ensure that it is safe, and passes only safe data through. Once in the system, the other components assume that the data is safe. A problem arises when a component treats two pieces of text as identical — typically by canonicalizing them to the same form — while the gatekeeper only detected that one of them was unsafe.
There are three equivalent encoding forms for Unicode: UTF-8, UTF-16, and UTF-32. UTF-8 is commonly used in XML and HTML; UTF-16 is the most common in program APIs; and UTF-32 is the best for representing single characters. While these forms are all equivalent in terms of the ability to express Unicode, the original usage of UTF-8 was open to a canonicalization exploit.
Up to The Unicode Standard, Version 3.0 the generation of "non-shortest form" UTF-8 was forbidden, as was the interpretation of illegal sequences, but not the interpretation of what was called the "non-shortest form". Where software does interpret the non-shortest forms, security issues can arise. For example:
For example, the backslash character "\" can often be a dangerous character to let through a gatekeeper, since it can be used to access different directories. Thus a gatekeeper might specifically prevent it from getting through. The backslash is represented in UTF-8 as the byte sequence <5C>. However, as a non-shortest form, backslash could also be represented as the byte sequence<C1 9C>. When a gatekeeper does not catch that, but a component converts non-shortest forms, it can allow a real security breach. For more information, see [Related Material].
To address this issue, the Unicode Technical Committee modified the definition of UTF-8 in Unicode 3.1 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses.
To Do:
Add information about other possible exploits in this area:
Unicode Normalization
Case mapping
Buffer overflows with all of the above, and when converting encoding forms
Discuss Unicode properties. Eg more characters have numeric properties than developers might expect.
Discuss use of Regular Expressions in validating data — ensuring that the Regular Expression Engine follows the Unicode Guidelines, but also that use of regular expressions makes use of properties rather than fixed lists of characters.
Explain that private use characters can cause security problems, and recommend strongly against their use.
Discuss security issues in Collation (sorting, searching, matching)
There are three data files currently associated with this document.
Note: we are just starting the project of collecting data for the second two files, and examining the feasibility of different approaches, so we have just begun to gather data.
[idn-chars] | IDN Characters: Categorizes all the possible
IDN chars, under the current definition of IDN.
The format of both files is described in the html file. idn-chars.html idn-chars.txt |
[confusables] | Visually Confusable Characters: Provides a mapping
for visually confusables. The format and usage of the file are described in the file
header. confusables.txt |
The following points to background information that may be useful.
The characters recommended for general use as identifiers include those character having the XID_Continue property as defined in the Unicode Character Database (see [DCore]), plus the characters listed in Additional Word Characters below, which are required for expressing words in some languages.
The list of allowable characters may be reduced or expanded according to the requirements of the specific domain. For example, programming language identifiers typically add some characters like '$', and remove others like '-' (because of the use as minus), while IDNA removes '_' (among others). For more information, see UAX #31, Identifier and Pattern Syntax [UAX31].
0027 ; word-chars # Po (') APOSTROPHE
002D ; word-chars # Pd (-) HYPHEN-MINUS
002E ; word-chars # Po (.) FULL STOP
003A ; word-chars # Po (:) COLON
00B7 ; word-chars # Po (·) MIDDLE DOT
02B9 ; word-chars # Lm (ʹ) MODIFIER LETTER PRIME
02BA ; word-chars # Lm (ʺ) MODIFIER LETTER DOUBLE PRIME
04C0 ; word-chars # L& (Ӏ) CYRILLIC LETTER PALOCHKA
055A ; word-chars # Po (՚) ARMENIAN APOSTROPHE
058A ; word-chars # Pd (֊) ARMENIAN HYPHEN
05F3 ; word-chars # Po (׳) HEBREW PUNCTUATION GERESH
05F4 ; word-chars # Po (״) HEBREW PUNCTUATION GERSHAYIM
0F85 ; word-chars # .. (྅) TIBETAN MARK PALUTA
2010 ; word-chars # Pd (‐) HYPHEN
2019 ; word-chars # Pf (’) RIGHT SINGLE QUOTATION MARK
2027 ; word-chars # Po (‧) HYPHENATION POINT
3003 ; word-chars # .. (〃) DITTO MARK
30A0 ; word-chars # Pd (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN
30FB ; word-chars # .. (・) KATAKANA MIDDLE DOT
These are all characters of General Category Mn or Me (Nonspacing or Enclosing Mark).
The Special-Purpose Characters are used in the definition of Security Levels. They include all characters with the General Category Nl (Letter-Number), plus the following list:
0138 ; Ll # (ĸ) LATIN SMALL LETTER KRA
0180 ; Ll # (ƀ) LATIN SMALL LETTER B WITH STROKE
018D ; Ll # (ƍ) LATIN SMALL LETTER TURNED DELTA
019B ; Ll # (ƛ) LATIN SMALL LETTER LAMBDA WITH STROKE
01AA ; Ll # (ƪ) LATIN LETTER REVERSED ESH LOOP
01AB ; Ll # (ƫ) LATIN SMALL LETTER T WITH PALATAL HOOK
01BA ; Ll # (ƺ) LATIN SMALL LETTER EZH WITH TAIL
01BE ; Ll # (ƾ) LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE
0250 ; Ll # (ɐ) LATIN SMALL LETTER TURNED A
0251 ; Ll # (ɑ) LATIN SMALL LETTER ALPHA
0252 ; Ll # (ɒ) LATIN SMALL LETTER TURNED ALPHA
0255 ; Ll # (ɕ) LATIN SMALL LETTER C WITH CURL
0258 ; Ll # (ɘ) LATIN SMALL LETTER REVERSED E
025A ; Ll # (ɚ) LATIN SMALL LETTER SCHWA WITH HOOK
025C ; Ll # (ɜ) LATIN SMALL LETTER REVERSED OPEN E
025D ; Ll # (ɝ) LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
025E ; Ll # (ɞ) LATIN SMALL LETTER CLOSED REVERSED OPEN E
025F ; Ll # (ɟ) LATIN SMALL LETTER DOTLESS J WITH STROKE
0261 ; Ll # (ɡ) LATIN SMALL LETTER SCRIPT G
0262 ; Ll # (ɢ) LATIN LETTER SMALL CAPITAL G
0264 ; Ll # (ɤ) LATIN SMALL LETTER RAMS HORN
0265 ; Ll # (ɥ) LATIN SMALL LETTER TURNED H
0266 ; Ll # (ɦ) LATIN SMALL LETTER H WITH HOOK
0267 ; Ll # (ɧ) LATIN SMALL LETTER HENG WITH HOOK
026A ; Ll # (ɪ) LATIN LETTER SMALL CAPITAL I
026B ; Ll # (ɫ) LATIN SMALL LETTER L WITH MIDDLE TILDE
026C ; Ll # (ɬ) LATIN SMALL LETTER L WITH BELT
026D ; Ll # (ɭ) LATIN SMALL LETTER L WITH RETROFLEX HOOK
026E ; Ll # (ɮ) LATIN SMALL LETTER LEZH
0270 ; Ll # (ɰ) LATIN SMALL LETTER TURNED M WITH LONG LEG
0271 ; Ll # (ɱ) LATIN SMALL LETTER M WITH HOOK
0273 ; Ll # (ɳ) LATIN SMALL LETTER N WITH RETROFLEX HOOK
0274 ; Ll # (ɴ) LATIN LETTER SMALL CAPITAL N
0276 ; Ll # (ɶ) LATIN LETTER SMALL CAPITAL OE
0277 ; Ll # (ɷ) LATIN SMALL LETTER CLOSED OMEGA
0278 ; Ll # (ɸ) LATIN SMALL LETTER PHI
0279 ; Ll # (ɹ) LATIN SMALL LETTER TURNED R
027A ; Ll # (ɺ) LATIN SMALL LETTER TURNED R WITH LONG LEG
027B ; Ll # (ɻ) LATIN SMALL LETTER TURNED R WITH HOOK
027C ; Ll # (ɼ) LATIN SMALL LETTER R WITH LONG LEG
027D ; Ll # (ɽ) LATIN SMALL LETTER R WITH TAIL
027E ; Ll # (ɾ) LATIN SMALL LETTER R WITH FISHHOOK
027F ; Ll # (ɿ) LATIN SMALL LETTER REVERSED R WITH FISHHOOK
0281 ; Ll # (ʁ) LATIN LETTER SMALL CAPITAL INVERTED R
0282 ; Ll # (ʂ) LATIN SMALL LETTER S WITH HOOK
0284 ; Ll # (ʄ) LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK
0285 ; Ll # (ʅ) LATIN SMALL LETTER SQUAT REVERSED ESH
0286 ; Ll # (ʆ) LATIN SMALL LETTER ESH WITH CURL
0287 ; Ll # (ʇ) LATIN SMALL LETTER TURNED T
0289 ; Ll # (ʉ) LATIN SMALL LETTER U BAR
028C ; Ll # (ʌ) LATIN SMALL LETTER TURNED V
028D ; Ll # (ʍ) LATIN SMALL LETTER TURNED W
028E ; Ll # (ʎ) LATIN SMALL LETTER TURNED Y
028F ; Ll # (ʏ) LATIN LETTER SMALL CAPITAL Y
0290 ; Ll # (ʐ) LATIN SMALL LETTER Z WITH RETROFLEX HOOK
0291 ; Ll # (ʑ) LATIN SMALL LETTER Z WITH CURL
0293 ; Ll # (ʓ) LATIN SMALL LETTER EZH WITH CURL
0295 ; Ll # (ʕ) LATIN LETTER PHARYNGEAL VOICED FRICATIVE
0296 ; Ll # (ʖ) LATIN LETTER INVERTED GLOTTAL STOP
0297 ; Ll # (ʗ) LATIN LETTER STRETCHED C
0298 ; Ll # (ʘ) LATIN LETTER BILABIAL CLICK
0299 ; Ll # (ʙ) LATIN LETTER SMALL CAPITAL B
029A ; Ll # (ʚ) LATIN SMALL LETTER CLOSED OPEN E
029B ; Ll # (ʛ) LATIN LETTER SMALL CAPITAL G WITH HOOK
029C ; Ll # (ʜ) LATIN LETTER SMALL CAPITAL H
029D ; Ll # (ʝ) LATIN SMALL LETTER J WITH CROSSED-TAIL
029E ; Ll # (ʞ) LATIN SMALL LETTER TURNED K
029F ; Ll # (ʟ) LATIN LETTER SMALL CAPITAL L
02A0 ; Ll # (ʠ) LATIN SMALL LETTER Q WITH HOOK
02A1 ; Ll # (ʡ) LATIN LETTER GLOTTAL STOP WITH STROKE
02A2 ; Ll # (ʢ) LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE
02A3 ; Ll # (ʣ) LATIN SMALL LETTER DZ DIGRAPH
02A4 ; Ll # (ʤ) LATIN SMALL LETTER DEZH DIGRAPH
02A5 ; Ll # (ʥ) LATIN SMALL LETTER DZ DIGRAPH WITH CURL
02A6 ; Ll # (ʦ) LATIN SMALL LETTER TS DIGRAPH
02A7 ; Ll # (ʧ) LATIN SMALL LETTER TESH DIGRAPH
02A8 ; Ll # (ʨ) LATIN SMALL LETTER TC DIGRAPH WITH CURL
02A9 ; Ll # (ʩ) LATIN SMALL LETTER FENG DIGRAPH
02AA ; Ll # (ʪ) LATIN SMALL LETTER LS DIGRAPH
02AB ; Ll # (ʫ) LATIN SMALL LETTER LZ DIGRAPH
02AC ; Ll # (ʬ) LATIN LETTER BILABIAL PERCUSSIVE
02AD ; Ll # (ʭ) LATIN LETTER BIDENTAL PERCUSSIVE
03D7 ; Ll # (ϗ) GREEK KAI SYMBOL
03F3 ; Ll # (ϳ) GREEK LETTER YOT
# Total code points: 84
0559 ; Lm # (ՙ) ARMENIAN MODIFIER LETTER LEFT HALF RING
# Total code points: 1
01BB ; Lo # (ƻ) LATIN LETTER TWO WITH STROKE
01C0 ; Lo # (ǀ) LATIN LETTER DENTAL CLICK
01C1 ; Lo # (ǁ) LATIN LETTER LATERAL CLICK
01C2 ; Lo # (ǂ) LATIN LETTER ALVEOLAR CLICK
01C3 ; Lo # (ǃ) LATIN LETTER RETROFLEX CLICK
# Total code points: 5
0483 ; Mn # (҃) COMBINING CYRILLIC TITLO
0484 ; Mn # (҄) COMBINING CYRILLIC PALATALIZATION
0485 ; Mn # (҅) COMBINING CYRILLIC DASIA PNEUMATA
0486 ; Mn # (҆) COMBINING CYRILLIC PSILI PNEUMATA
These lists are still draft! Once finalized, they should become regular properties.
We should extend the Special Purpose Characters to include problematic characters even outside XID_Continue. This would be so that if an implementation extends the set of Identifier Characters, that it can distinguish between characters like heart (♥) which don't cause a problem for spoofing and those like division slash (∕) which do.
Latin | Tibetan | Hanunoo |
---|---|---|
Greek | Myanmar | Buhid |
Cyrillic | Georgian | Tagbanwa |
Armenian | Hangul | Limbu |
Hebrew | Ethiopic | Tai Le |
Arabic | Cherokee | Linear B |
Syriac | Canadian Aboriginal | Ugaritic |
Thaana | Ogham | Shavian |
Devanagari | Runic | Osmanya |
Bengali | Khmer | Cypriot |
Gurmukhi | Mongolian | Braille |
Gujarati | Hiragana | Buginese |
Oriya | Katakana | Coptic |
Tamil | Bopomofo | New Tai Lue |
Telugu | Han | Glagolitic |
Kannada | Yi | Tifinagh |
Malayalam | Old Italic | Syloti Nagri |
Sinhala | Gothic | Old Persian |
Thai | Deseret | Kharoshthi |
Lao | Tagalog | |
Common | Inherited |
Steven Loomis and other people on the ICU team were very helpful in developing the original proposal for this technical report. Thanks also to the following people for their feedback or contributions to this document or earlier versions of it: Martin Dürst, Paul Hoffman, Peter Karlsson, Gervase Markham, Eric Muller, and especially Erik van der Poel and Michel Suignard. This document also draws on examples or ideas suggested in email discussions from Alexander Savenkov, Erik van der Poel, and others.
[CharMod] | Character Model for the World Wide Web 1.0:
Fundamentals http://www.w3.org/TR/charmod/ |
[Charts] | Unicode Charts (with Last
Resort Glyphs) http://www.unicode.org/charts/lastresort.html See also: |
[DCore] | Derived Core Properties http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt |
[Display] | Display Problems? http://www.unicode.org/help/display_problems.html |
[ICANN] | Guidelines for the Implementation
of Internationalized Domain Names http://www.icann.org/general/idn-guidelines-20jun03.htm |
[IDN-Demo] | ICU (International
Components for Unicode) IDN Demo http://ibm.com/software/globalization/icu/demo/domain/ |
[Feedback] | Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html |
[Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[RFC3454] | P. Hoffman, M. Blanchet. "Preparation of
Internationalized Strings ("stringprep")", RFC 3454, December 2002. http://ietf.org/rfc/rfc3454.txt |
[RFC3490] | Faltstrom, P., Hoffman, P. and A.
Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. http://ietf.org/rfc/rfc3490.txt |
[RFC3491] | Hoffman, P. and M. Blanchet, "Nameprep:
A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. http://ietf.org/rfc/rfc3491.txt |
[RFC3492] | Costello, A., "Punycode: A
Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)", RFC
3492, March 2003. http://ietf.org/rfc/rfc3492.txt |
[RFC3743] | Konishi, K., Huang, K., Qian, H.
and Y. Ko, "Joint Engineering Team (JET) Guidelines for Internationalized Domain Names (IDN)
Registration and Administration for Chinese, Japanese, and Korean", RFC 3743, April 2004. http://ietf.org/rfc/rfc3743.txt |
[RFC3986] | T. Berners-Lee, R. Fielding, L. Masinter. "Uniform
Resource Identifier (URI): Generic Syntax", RFC 3986, January 2005. http://ietf.org/rfc/rfc3986.txt |
[RFC3987] | M. Duerst, M. Suignard. "Internationalized Resource
Identifiers (IRIs)", RFC 3987, January 2005. http://ietf.org/rfc/rfc3987.txt |
[UCD] | Unicode Character Database. http://www.unicode.org/ucd For an overview of the Unicode Character Database and a list of its associated files |
[UAX15] |
UAX #15: Unicode Normalization Forms |
[UAX24] | UAX #24, Script Names http://unicode.org/reports/tr24/ |
[UAX31] | UAX #31, Identifier and Pattern Syntax http://www.unicode.org/reports/tr31/ |
[Unicode] | The Unicode Standard, Version 4.1.0 http://www.unicode.org/versions/Unicode4.1.0/ |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/standard/versions For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports. |
The following summarizes modifications from the previous revision of this document.
Revision 3:
Revision 2:
Revision 1:
Copyright © 2004-2005 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.