Draft
Unicode Technical Standard #39
Unicode Security Mechanisms
Summary
Because Unicode contains such a large number of characters and incorporates the
varied writing systems of the world, incorrect usage can expose programs or systems to possible
security attacks. This document specifies mechanisms that can be used
in detecting possible security problems.
Status
This is a draft document which may be updated, replaced, or
superseded by other documents at any time. Publication does not imply endorsement by the Unicode
Consortium. This is not a stable document; it is inappropriate to cite this document as other
than a work in progress.
A Unicode Technical Standard (UTS) is an independent
specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the online reporting form [Feedback].
Related information that is useful in understanding this document is found in
References. For the latest version of the Unicode Standard see [Unicode].
For a list of current Unicode Technical Reports see [Reports]. For more
information about versions of the Unicode Standard, see [Versions].
To allow access to the most recent work of the Unicode security subcommittee on this
document, the "Latest Working Draft" link in the header points to the latest
working-draft document under development.
Contents
1. Introduction
Unicode Technical Report #36: Unicode Security Considerations
[UTR36] provides guidelines for detecting and avoiding security problems
connected with the use of Unicode. This document specifies mechanisms that are used in that
document, and can be used elsewhere. Readers should be familiar with [UTR36]
before continuing.
An implementation claiming conformance to this specification must
do so in conformance to the following clauses.
C2. |
An implementation claiming to implement any
of the following confusable-detection functions must do so in accordance with the
specifications in Section 4. Confusable Detection.
- X and Y are single-script confusables
- X and Y are mixed-script confusables
- X and Y are whole-script confusables
- X has any simple single-script confusables
- X has any mixed-script confusable
- X has any whole-script confusable
|
C3. |
An implementation claiming to detect mixed
scripts must do so in accordance with the specifications in Section 5.
Mixed Script Detection. |
Identifiers are special-purpose strings used for identification —
strings that are deliberately limited to particular repertoires for that purpose. Exclusion of
characters from identifiers does not at all affect the general use of those characters, such as
within documents. UAX #31, Identifier and Pattern Syntax [UAX31]
provides a recommended method of determining which strings should qualify as identifiers. The
UAX #31 specification extends the common practice of defining identifiers in terms of letters
and numbers to the Unicode repertoire.
UAX #31 also permits other protocols to use that method as a base, and to define a
profile that adds or removes characters. For example, identifiers for specific programming
languages typically add some characters like '$', and remove others like '-' (because of the use
as minus), while IDNA removes '_' (among others). For more information, see UAX #31,
Identifier and Pattern Syntax [UAX31].
This document provides for alternative identifier profiles. These are profiles of the
extended identifiers based on the XID_Start and XID_Continue properties as defined in the
Unicode Character Database (see [DCore]). In both cases, the identifiers
are folded: that is, there is a larger set of input characters that are allowed, but
these are folded together into a set of allowed output characters. The folding uses Case
Folding as defined in Chapter 3. Conformance of [Unicode], and
NFKD normalization as defined in [UAX15].
The data files used in defining these profiles follow the UCD File Format, which has a
semicolon-delimited list of data fields associated with given characters. For more details, see
[UCDFormat].
The file [idmod] provides data for a profile of identifiers in
environments where security is at issue. The file contains a small set of characters that are
recommended as additions (to the list of characters defined by the XID_Start and XID_Continue
properties), because they may be used in identifiers in a broader context than programming
identifiers, and a set of characters recommended to be restricted from use.
An implementation claiming conformance to the General Security
Profile for Identifiers shall not include any Alphabetic characters in the restricted list. It
should include the additions, and may include other non-Alphabetic characters.
The restricted characters are characters not in common use, removed so as to further reduce
the possibilities for visual confusion. Initially, the following are being excluded: characters
not in modern use; characters only used in specialized fields, such as liturgical characters,
mathematical letter-like symbols, and certain phonetic alphabetics; and ideographic characters
that are not part of a set of core CJK ideographs consisting of the CJK Unified Ideographs block
plus IICore (the set of characters defined by the IRG as the minimal set of required ideographs
for East Asian use). A small number of such characters are allowed back in so that the profile
includes all the characters in the country-specific restricted IDN lists: see Appendix
F. Country-Specific IDN Restrictions.
The principle has been to be more conservative initially, allowing for the set to be
modified in the future as requirements for characters are refined. For information on handling
that, see Section 2.9.1 Backwards Compatibility.
In the file [idmod], Field 2 is an action (either restricted or
addition), and Field 3 is a reason.
This list is also used in deriving the IDN Identifiers list given below. It is, however,
designed to be applied to other environments, and is not limited to Unicode 3.2 (as IDNA is
currently), so that it can be applied to a future version of IDNA that includes the (large)
repertoire of characters that have been added since Unicode 3.2.
The file [idn-chars] provides data for composition a list of
all and only those characters recommended for use in IDN, as described in the recommendations
above. It is presented as a series of tables organized by the type, as given in Field 2
in the file.
Two profiles are defined: strict and lenient.
Recommended IDN Identifier Profiles
|
Strict Profile |
Lenient Profile |
Types allowed in output identifiers |
output |
output |
Types allowed in input identifiers |
output + input |
output + input + input-lenient |
In both profiles, both input and output identifiers cannot start with a nonstarting
character. The only difference between the profiles is that the lenient profile allows more
characters on input.
The following provides BNF descriptions using the extended BNF found in Section 0.3 of [Unicode],
supplemented by the property syntax of [UTS18].
Strict Profile BNF:
<strict-profile-output> := <SP-output-start> <SP-output-continue>*
<SP-output-start> := [[:SP-output-continue:] - [:nonstarting:]]
<SP-output-continue> := [:output-type:]
<strict-profile-input> := <SP-input-start> <SP-input-continue>*
<SP-input-start> := [[:SP-input-continue:] - [:nonstarting:]]
<SP-input-continue> := [[output-type:][:input-type:]]
Lenient Profile BNF:
<lenient-profile-output> := <strict-profile-output>
<lenient-profile-input> := <LP-input-start> <LP-input-continue>*
<LP-input-start> := [[:LP-input-continue:] - [:nonstarting:]]
<LP-input-continue> := [[:output-type:][:input-type:][:input-lenient-type:]]
The only distinction between them is the inclusion of [:input-lenient-type:] on the very last
line.
The following table provides more description of the types given by Field 2.
IDN Identifier Profile Types
Type |
Description |
output |
This type marks characters that are retained in this profile in the output
of IDN; that is, any characters outside of this set are not allowed by this profile. It is
formed by taking everything in IDNA [RFC3491], and intersecting that
with the characters in Section 3.1 General Security Profile for Identifiers. |
input |
This type marks additional characters (beyond those in output)
which are retained on input to IDNA in this profile. These are characters that case-fold to
the characters in output. |
input-lenient |
This type marks additional characters (beyond those in output
and input) which are be retained on input to IDNA, but only in the non-strict
profile. These are IDNA characters that normalize to the characters in output. |
nonstarting |
This type marks characters that are disallowed at the start of an
identifier, either input or output. (IDNA, unlike [UAX31]
or most programming languages, does not place restrictions on which characters can start an
identifier.) |
In both profiles, on input the following characters should be pre-mapped. That is, in
circumstances where the user is typing in a URL into an address bar, these are recommended so as
to allow people to type characters that they may not otherwise easily be able to type. However,
this is not formally part of the identifier profile; simply a recommendation for GUIs,
given the constraints of the identifier profile.
Remapping Characters
0027 → 2019 |
' → ʼ |
APOSTROPHE
→ MODIFIER LETTER APOSTROPHE |
2018 → 02BB |
‘ → ʻ |
LEFT SINGLE QUOTATION MARK
→ MODIFIER LETTER TURNED COMMA |
2019 → 02BC |
’ → ʼ |
RIGHT SINGLE QUOTATION MARK
→ MODIFIER LETTER APOSTROPHE |
309B → 3099 |
゛ → ゙ |
KATAKANA-HIRAGANA VOICED SOUND MARK
→ COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK |
309C → 309A |
゜ → ゚ |
KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
→ COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK |
The tables in the data file [confusables] provide a mechanism for
determining when two strings are visually confusable. The data in these files may be refined
and extended over time. For information on handling that, see Section 2.9.1
Backwards Compatibility. The data is organized into
four different tables, depending on the desired parameters. Each table provides a mapping from
source characters to target strings.
On the basis of this data, there are three main classes of confusable strings:
X and Y are single-script confusables if they are confusable according to the
Single-Script table, and each of them is a single script string according to Section 5.
Mixed Script Detection. Examples: "so̷s" and "søs"
in Latin.
X and Y are mixed-script confusables if they are confusable according to the
Mixed-Script table, and they are not single-script confusables. Example: "paypal" in Latin and
"paypal" with the 'a' being in Cyrillic.
X and Y are whole-script confusables if they are mixed-script confusables,
and each of them is a single script string. Example: "scope" in Latin and "scope" in
Cyrillic.
To see whether two strings X and Y are confusable according to a given table, an
implementation first converts both X and Y to NFKD format, as described in [UAX15].
It then produces transform(X) from X, by successively mapping each source character in X to the
target string, and then produces transform(Y) from Y by the same process. The resulting strings
transform(X) and transform(Y) are then compared. If they are identical (codepoint-for-codepoint),
then the original strings are visually confusable according to the table.
Note: the strings transform(X) and transform(Y) are not intended for
display, storage or transmission. They should be thought of instead as an intermediate
processing form, a kind of hashcode or skeleton. The characters in transform(X) and
transform(Y) are not guaranteed to be identifier characters.
Implementations do not have to recursively apply the mappings, because the transforms are
idempotent. That is,
transform(transform(X)) = transform(X).
For each table, field 1 is the source, field 2 is the target, and Field 3 is a type. The
different tables are distinguished by type:
Confusable Data Table Types
Type |
Name |
Description |
SL |
Single-Script, Lowercase |
This table is used to test cases of single-script confusables, where the
output only allows lowercase. For example:
# ( ø → o̷ ) LATIN SMALL LETTER O WITH STROKE → LATIN SMALL LETTER
O, COMBINING SHORT SOLIDUS OVERLAY
|
SA |
Single-Script, Any-Case |
This table is used to test cases of single-script confusables, where the
output allows for mixed case (which may be later folded away). For example, this table
contains the following entry not found in SL:
# ( O → 0 ) LATIN CAPITAL LETTER O → DIGIT ZERO
|
ML |
Mixed-Script, Lowercase |
This table is used to test cases of mixed-script and whole-script
confusables, where the output only allows lowercase. For example, this table contains the
following entry not found in SL or SA:
# ( ν → v ) GREEK SMALL LETTER NU → LATIN SMALL LETTER V
|
MA |
Mixed-Script, Any-Case |
This table is used to test cases of mixed-script and whole-script
confusables, where the output allows for mixed case (which may be later folded away). For
example, this table contains the following entry not found in SL, SA, or ML:
# ( Ι → l ) GREEK CAPITAL LETTER IOTA → LATIN SMALL LETTER L
|
Note: It would be possible to provide a more sophisticated confusable detection, by
providing a metric between given characters, indicating their 'closeness'. However, that is
computationally much more expensive, and requires more sophisticated data, so at this point in
time the simpler mechanism has been chosen. It does impose transitivity on the data, so if X ~ Y
and Y ~ Z, then X ~ Z.
Data is also provided for testing a string to see if a string X has
any whole-script confusable, using the file [confusablesWS].
This file consists of a list of lines of the form:
<range>; <sourceScript>; <targetScript>; <type> #comment
The types are either L for lowercase-only, or A for any-case, where the any-case ranges are
broader (including uppercase and lowercase characters). If the string is only lowercase, use the
lowercase-only table. Otherwise, first test according to the any-case table, then lowercase the
string and test according to the lowercase-only table.
In using the data, all of the lines having the same sourceScript and targetScript
are collected together to form a set of Unicode characters. Logically, the file is thus a set of
tuples of the form <sourceScript, unicodeSet, targetScript>. For example, the following
lines are present for Latin to Cyrillic:
0061 ; Latn; Cyrl; L # (a) LATIN SMALL LETTER A
0063..0065 ; Latn; Cyrl; L # [3] (c..e) LATIN SMALL LETTER C..LATIN SMALL LETTER E
...
0292 ; Latn; Cyrl; L # (ʒ) LATIN SMALL LETTER EZH
They logically form a tuple <Latin, [a c-e ... \u0292], Cyrillic>, which indicates
that a Latin string containing characters only from that Unicode set can have a whole-script
confusable in Cyrillic (lowercase-only).
To test to see if a single-script string givenString has a whole-script confusable in
targetScript, the following process is used.
- Convert the givenString to NFKD format, as specified in [UAX15]
- Let givenSet be the set of all characters in givenString
- Remove all [:script=common:] and [:script=inherited:] characters from givenSet
- Let givenScript be the script of the characters in givenSet
- (if there is more than one script, fail with error).
- See if there is a tuple <sourceScript, unicodeSet, targetScript> where
- sourceScript = givenScript
- unicodeSet ⊇ givenSet
- If so, then there is a whole-script confusable in
targetScript
The test is actually slightly broader than simply a whole-script confusable; what it tests is
whether the given string has a whole-script confusable string in another script, possibly with
the addition or removal of common/inherited characters such as numbers and combining marks
characters to both strings. In practice, however, this has no significant impact.
Implementations would normally read the data into appropriate data structures in memory for
processing. A quick additional optimization is to keep, for each script, a fastReject
set, containing characters in the script contained in none of the unicodeSet values.
The following is a Java sample of how this code can work (using the Java version of [ICU]):
/*
* For this routine, we don't care what the target scripts are,
* just whether there is at least one whole-script confusable.
*/
boolean hasWholeScriptConfusable(String s) {
int givenScript = getSingleScript(s);
if (givenScript == UScript.INVALID_CODE) {
throw new IllegalArgumentException("Not single script string");
}
UnicodeSet givenSet = new UnicodeSet()
.addAll(s)
.removeAll(commonAndInherited);
if (fastReject[givenScript].containsSome(givenSet)) return false;
UnicodeSet[] possibles = scriptToUnicodeSets[givenScript];
for (int i = 0; i < possibles.length; ++i) {
if (possibles[i].containsAll(givenSet)) return true;
}
return false;
}
The data in [confusablesWS] is built using the data in [confusables],
and subject to the same caveat: the data in these files may be refined and extended over
time. For information on handling that, see Section 2.9.1
Backwards Compatibility.
To test for mixed-script confusables, use the following process.
Convert the given string to NFKD format, as specified in [UAX15]. For
each script found in the given string, see if all the characters in the string outside of that
script have whole-script confusables for that script (according to Section 4.1
Whole-Script Confusables).
Example 1: 'pаypаl', with Cyrillic 'а's.
There are two scripts, Latin and Cyrillic. The set of Cyrillic characters {a} has a
whole-script confusable in Latin. Thus the string is a mixed-script confusable.
Example 2: 'toys-я-us', with one Cyrillic
character 'я'.
The set of Cyrillic characters {я} does not have a whole-script
confusable in Latin (there is no Latin character that looks like 'я',
nor does the set of Latin characters {o s t u y} have a whole-script confusable in Cyrillic
(there is no Cyrillic character that looks like 't' or 'u'). Thus this string is not a
mixed-script confusable.
Example 3: '1iνе', with a Greek 'ν' and Cyrillic 'е'.
There are three scripts, Latin, Greek, and Cyrillic. The set of Cyrillic characters {е} and
the set of Greek characters {ν} each have a whole-script confusable in Latin. Thus the string
is a mixed-script confusable.
The Unicode Standard supplies information that can be used for determining the script of
characters and detecting mixed-script text. The determination of script is according to
the Unicode Standard [UAX24], using data from the Unicode Character
Database [UCD].
In determining mixed script, Common and Inherited script characters are
ignored, except for non-Identifier Characters (see Section 3.
Identifier Characters). For example, "abc-def" counts
as a single script: the script of "-" is ignored. The string "I♥NY",
on the other hand, counts as mixed-script, since the heart character is outside of the
Identifier Characters.
The following is a Java sample of how this code can work (using the Java version of [ICU]):
private static boolean isMixedScript(String source) {
int lastScript = UScript.INVALID_CODE;
int cp;
for (int i = 0; i < source.length(); i += UTF16.getCharCount(cp)) {
cp = UTF16.charAt(source, i);
int script = UScript.getScript(cp);
if (script == UScript.COMMON || script == UScript.INHERITED) {
if (IdentifierSet.contains(cp)) continue; // skip if not identifier
script = UScript.COMMON;
}
if (lastScript == UScript.INVALID_CODE) lastScript = script;
else if (script != lastScript) return true;
}
return false;
}
This depends on IdentifierSet
(a UnicodeSet
) being set up with the
proper contents according to Section 3. Identifier
Characters.
Using the Unihan data in the Unicode Character Database [UCD] it
is possible to extend this mechanism, to qualify strings as 'mixed script' where they
contain both simplified-only and traditional-only Chinese characters.
As discussed in [UTR36],
confusability among characters cannot be an exact science. There are many
factors that make confusability among character a matter of degree:
- Shapes of characters vary greatly among
fonts used to represent them. The Unicode standard represents them in
the chart section with representative glyphs, but font designers are
free to create their own glyphs. Because fonts can easily be created
representing any Unicode code position using an arbitrary glyph,
character confusability can never be avoided. For example, one could
design a font where the ‘a’ looks like a ‘b’ , ‘c’ like a ‘d’, and so
on.
- Writing systems using context shaping
(such as Arabic, many south-Asian systems) introduce even more variation
in text rendering. Characters don’t really have an abstract shape in
isolation and are only rendered as part of cluster of characters making
words, expressions, and sentences. It is in fact a fairly common
occurrence to find the same visual text representation corresponding to
very different logical words that can only be recognized by context if
at all.
- Font style variant may introduce a
confusability which does not exist in another style (for example: normal
versus italic). For example, in the Cyrillic script, the small letter TE
(U+0442) looks like a small caps Latin ‘T’ in normal style while it
looks like a small Latin ‘m’ in italic style.
The confusability tables were created by
collecting a number of prospective confusables, examining those confusables
according to a set of fonts, and processing the result for transitive
closure.
The prospective confusables were gathered from
a number of sources. Volunteers from within IBM and Microsoft, with native
speakers for languages with different writing systems, gathered initial
lists. The compatibility mappings were also used as a source, as were the
mappings from the draft UTR #30 Character Foldings
[http://unicode.org/reports/tr30/]. Eric van der Poel also contributed a
list derived from running a program over a large number of OpenType fonts to
catch characters that shared identical glyphs within a font. The process of
gathering visual confusables is ongoing: the Unicode Consortium welcomes
submission of additional mappings. In particular, it would be useful to
compare glyphs from common Macintosh, Linux, and Unix fonts as well. The
complex scripts of South / South East Asia also need special attention.
Please submit suggestions for additional
confusables, or suggested corrections to the given ones, with the online
reporting form [Feedback]. Additions must be
listed in a plain-text file in the standard format, such as:
#comment
2500 ; 4E00 # comment
002E ; 0702 # comment
...
The initial focus is on characters that can be
in the recommended profile for identifiers, because they are of most
concern. For mixed-script confusability, the initial focus is on confusable
characters between the Latin script and other scripts, because this is
currently perceived as the most important threat. Other combinations of
scripts should be more extensively reviewed in the future. In addition,
In-script confusability is extremely
user-dependent. For example, in the Latin script, characters with accents or
appendices may look similar to the unadorned characters for some users,
especially if they are not familiar with their meaning in a particular
language. However, most users in position to trust identifiers will have at
least a minimum understanding of the range of characters in their own
script, and there are separate mechanisms available to deal with other
scripts, as discussed in [UTR36].
The fonts used to assess the confusables
included those used by the major operating systems in user interfaces. In
addition, the representative glyphs used in the Unicode Standard were also
considered. Fonts used for the user interface in operating systems are an
important source, because they are the ones that will usually be seen by
users in circumstances where confusability is important, such such as when
using IRIS (Internationalized Resource Identifiers) and their sub-elements
(e.g. domain names). These fonts have a number of other relevant
characteristics. They rarely changed by OS and applications; changes brought
by system upgrades tend to be gradual to avoid usability disruption. Because
user interface elements need to be legible at low screen resolution
(implying a small number of pixel per EM units), fonts used in these
contexts tend to be designed in sans-serif style, which has the tendency to
increase the possibility of confusables. (There are, however, some locales
locales where a serif style is in common use (for example, Chinese).
Furthermore, strict bounding box requirements create even more constraints
for scripts which use relatively large ascenders and descenders. This also
limits space allocated for accent or tone marks, and can also create more
opportunities for confusability.
Pairs of prospective confusables were removed
if they were always visually distinct at common sizes, both within and
across fonts.
This data was then closed under transitivity
(so that if X≅Y and Y≅Z, then X≅Z), and processed to produce the in-script
and cross-script tables. This is so that a single table can be used to
map an input string to a resulting so-called skeleton.
The files contain some internal information in
comments, indicating how the transitive closure was done. For example:
2500 ; 4E00 ; MA # ( ─ ↔ 一) BOX DRAWINGS LIGHT HORIZONTAL ↔ CJK UNIFIED IDEOGRAPH-4E00
# {source:1192} ― {source:961} — {source:1785}
The second comment mark (#), here on a separate
line, indicates intermediate steps in the transitive closure, with {..}
indicating the reason (the original source mapping between the characters).
In this case, the mappings are:
U+2500 (─) ↔ U+2015
(―) ↔ U+2014 (—) ↔ U+4E00
(一)
This skeleton is intended only for
internal use for testing confusability of strings; the resulting text is not
at all suitable for display to users or as a "normalization", since it will
appear to be a hodgepodge of different scripts. In particular, the result of
mapping an identifier will not necessary be an identifier: for example, a
free-standing accent and a corresponding <space + combining mark> may be
confusable. Thus the confusability mappings can be used to test whether two
identifiers are confusable (if their skeletons are the same), but should
definitely not be used as a "normalization" of identifiers.
The data may be refined in future
versions of this specification. For information on handling this, see Section 2.9.1
Backwards Compatibility of [UTR36].
Note allowing mixtures of upper and lowercase
text would complicate the process, and produce a large number of false
positives. For example, mixing cases in Latin and Greek may make the Latin
letters pairs {Y, U} and {N, V} confusable. That is because Y is
confusable with the Greek capital Upsilon, and the lowercase upsilon is
confusable with the lowercase Latin u.
Steven Loomis and other people on the ICU team were very helpful in developing the original
proposal for this technical report. Thanks also to the following people for their feedback or
contributions to this document or earlier versions of it: Douglas Davidson, Martin
Dürst, Asmus Freytag, Deborah Goldsmith, Paul Hoffman, Peter Karlsson,
Gervase Markham, Eric Muller, Erik van der Poel, Michael van Riper, Marcos Sanz,
Alexander Savenkov, Dominikus Scherkl, and Kenneth Whistler.
Data Files
The following files provide data used to
implement the recommendations in this document. The data may be
refined in future versions of this specification. For
information on handling this, see Section 2.9.1
Backwards Compatibility of [UTR36]. |
[idnchars] |
idnchars.txt |
IDN Characters: Provides a profile of
identifiers from UAX #31, Identifier and Pattern Syntax [UAX31]
as a recommended restriction of IDN identifiers for security purposes. |
[idmod] |
xidmodifications.txt |
Identifier Modifications: Provides the list of
additions and restrictions recommended for building a profile of identifiers for
environments where security is at issue. |
[confusables] |
confusables.txt |
Visually Confusable Characters: Provides a mapping
for visual confusables for use in further restricting identifiers for security. The
usage of the file is described in Appendix B.
Confusable Detection. |
[confusablesWS] |
confusablesWholeScript.txt |
Whole Script Confusables. Data for testing for the
possible existence of whole-script and mixed-script confusables. See Appendix B.
Confusable Detection |
[intentional] |
intentional.txt |
Intentional Confusable Mappings. The class of characters whose
glyphs in any particular typeface would probably be designed to be
identical in shape, by intention, at least when using a harmonized
typeface design |
[source] |
source/ |
Source
Data Files. These are the source data files used to build the above
files. |
General References
Warning: all internet-drafts and news links have unstable links; you may have
to adjust the URL to get to the latest document.
[CharMod] |
Character Model for the World Wide Web 1.0:
Fundamentals
http://www.w3.org/TR/charmod/ |
[Charts] |
Unicode Charts (with Last Resort Glyphs)
http://www.unicode.org/charts/lastresort.html
See also:
http://developer.apple.com/fonts/LastResortFont/
http://developer.apple.com/fonts/LastResortFont/LastResortTable.html |
[DCore] |
Derived Core Properties
http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt |
[Display] |
Display Problems?
http://www.unicode.org/help/display_problems.html |
[DNS-Case] |
Donald E. Eastlake 3rd. "Domain Name System (DNS)
Case Insensitivity Clarification". Internet Draft, January 2005
http://www.ietf.org/internet-drafts/draft-ietf-dnsext-insensitive-06.txt
|
[FAQSec] |
Unicode FAQ on Security Issues
http://www.unicode.org/faq/security.html
|
[ICANN] |
Guidelines for the Implementation of
Internationalized Domain Names
http://www.icann.org/general/idn-guidelines-20jun03.htm |
[ICU] |
International Components for Unicode
http://www.ibm.com/software/globalization/icu/ |
[IDNReg] |
Registry for IDN Language Tables
http://www.iana.org/assignments/idn/
Tables are found at:
http://www.iana.org/assignments/idn/registered.htm |
[IDN-Demo] |
ICU (International Components for Unicode) IDN
Demo
http://ibm.com/software/globalization/icu/demo/domain/ |
[Feedback] |
Reporting Errors and Requesting Information Online
http://www.unicode.org/reporting.html |
[Museum] |
Internationalized Domain Names (IDN) in .museum -
Supported Languages
http://about.museum/idn/language.html
|
[Reports] |
Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for
a list of technical reports. |
[RFC1034] |
P. Mockapetris. "DOMAIN NAMES - CONCEPTS AND
FACILITIES", RFC 1034, November 1987.
http://ietf.org/rfc/rfc1034.txt |
[RFC1035] |
P. Mockapetris. "DOMAIN NAMES - IMPLEMENTATION AND
SPECIFICATION", RFC 1034, November 1987.
http://ietf.org/rfc/rfc1035.txt |
[RFC1535] |
E. Gavron. "A Security Problem and Proposed
Correction With Widely Deployed DNS Software", RFC 1535, October 1993
http://ietf.org/rfc/rfc1535.txt |
[RFC3454] |
P. Hoffman, M. Blanchet. "Preparation of
Internationalized Strings ("stringprep")", RFC 3454, December 2002.
http://ietf.org/rfc/rfc3454.txt |
[RFC3490] |
Faltstrom, P., Hoffman, P. and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003.
http://ietf.org/rfc/rfc3490.txt |
[RFC3491] |
Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003.
http://ietf.org/rfc/rfc3491.txt |
[RFC3492] |
Costello, A., "Punycode: A Bootstring encoding of
Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003.
http://ietf.org/rfc/rfc3492.txt |
[RFC3743] |
Konishi, K., Huang, K., Qian, H. and Y. Ko, "Joint
Engineering Team (JET) Guidelines for Internationalized Domain Names (IDN) Registration and
Administration for Chinese, Japanese, and Korean", RFC 3743, April 2004.
http://ietf.org/rfc/rfc3743.txt |
[RFC3986] |
T. Berners-Lee, R. Fielding, L. Masinter. "Uniform
Resource Identifier (URI): Generic Syntax", RFC 3986, January 2005.
http://ietf.org/rfc/rfc3986.txt |
[RFC3987] |
M. Duerst, M. Suignard. "Internationalized Resource
Identifiers (IRIs)", RFC 3987, January 2005.
http://ietf.org/rfc/rfc3987.txt |
[UCD] |
Unicode Character Database.
http://www.unicode.org/ucd/
For an overview of the Unicode Character Database and a list of its associated files. |
[UCDFormat] |
UCD File Format
http://www.unicode.org/Public/UNIDATA/UCD.html#UCD_File_Format |
[UAX9] |
UAX #9: The Bidirectional Algorithm
http://www.unicode.org/reports/tr9/ |
[UAX15] |
UAX #15: Unicode Normalization Forms
http://www.unicode.org/reports/tr15/ |
[UAX24] |
UAX #24: Script Names
http://www.unicode.org/reports/tr24/ |
[UAX31] |
UAX #31, Identifier and Pattern Syntax
http://www.unicode.org/reports/tr31/ |
[UTR36] |
UTR #36: Unicode Security Considerations
http://www.unicode.org/reports/tr36/ |
[UTS18] |
UTS #18: Unicode Regular Expressions
http://www.unicode.org/reports/tr18/ |
[Unicode] |
The Unicode Standard, Version 4.1.0
http://www.unicode.org/versions/Unicode4.1.0/ |
[Versions] |
Versions of the Unicode Standard
http://www.unicode.org/standard/versions/
For information on version numbering, and citing and referencing the Unicode Standard,
the Unicode Character Database, and Unicode Technical Reports. |
The following summarizes modifications from the previous revision of this document.
Revision 1:
- Created from Appendix A, B, and D from [UTR36].
- Created Section 6.
Development Process based on
document L2/06-055.
- Removed DITTO Mark, added intentional
mappings
- Added 5.0 scripts to removals:
Balinese, Cuneiform, Phoenician, Phags_Pa
-
Added the intentional mappings, plus a pointer to source data
Copyright © 2004-2005 Unicode, Inc. All Rights Reserved. The Unicode
Consortium makes no expressed or implied warranty of any kind, and assumes no liability for
errors or omissions. No liability is assumed for incidental and consequential damages in
connection with or arising out of the use of the information or programs contained or
accompanying this technical report. The Unicode
Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are
registered in some jurisdictions.