This file consists of tables with links to mapping data files available. For the most current information please refer to the Unicode ftp site for mapping data (ftp://ftp.unicode.org/Public/MAPPINGS/).
This file is provided as-is by Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been provided on optical media by Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt.
Unicode, Inc. hereby grants the right to freely use the information supplied in this file in the creation of products supporting the Unicode Standard, and to make copies of this file in any form for internal or external distribution as long as this notice remains attached.
Date of last update: 1999-10-09
Revision history:
1. 1999-10-07: created for Unicode 3.0
2. 1999-10-08: editorial changes
3. 1999-10-09: further changes
Various preexisting line ending conventions are used with these, but use of PARAGRAPH SEPARATOR and LINE SEPARATOR is recommended. All commonly occurring line ending conventions should be properly interpreted (even if mixed in the same file). See also UTR 13 (Unicode Newline Guidelines) regarding line/paragraph ending/separation, and UTR 14 (Line Breaking Properties) and its associated Unicode database file regarding line breaking, as well as UTR 9 (The Bidirectional Algorithm).
Character encoding |
Mapping to Unicode/UTF-16 |
Date of last update |
Remark |
Unicode/UTF-8 (UTF-8, UTF-8N) |
Given by algorithm (normative) |
|
In Unicode UTF-8 is limited to planes 0-16 |
Unicode/UTF-16 (UTF-16, UTF-16BE, [UTF-16LE]) |
Identity |
|
Identical to ISO/IEC 10646/UTF-16, if big endian when serialised into octets |
Unicode/SCSU |
Given by algorithm (UTR 6) |
|
Standard Compression Scheme for Unicode; Big endian |
Unicode/UTF-32 (UTF-32, UTF-32BE, [UTF-32LE]) |
Given by algorithm (UTR 19) |
|
UCS-4 limited to planes 0-16 |
[Unicode/UTF-7 WITHDRAWN] |
Was given by algorithm |
|
Was intended only for e-mail. Withdrawn and obsolescent. |
ISO/IEC 10646/UTF-8 |
Given by algorithm (normative) for planes 0-16 |
|
Suitable for "8-bit clean" ASCII oriented programs |
ISO/IEC 10646/UTF-16 |
Identity |
|
UCS-2 extended to planes 0-16; Big endian when serialised into octets |
ISO/IEC 10646/UCS-2 |
Identity |
|
UTF-16 restricted to plane 0 (BMP); Big endian when serialised into octets; Stepping stone to UTF-16 |
ISO/IEC 10646/UCS-4 |
Given by algorithm (normative) for planes 0-16 |
|
Big endian when serialised into octets |
[ISO/IEC 10646/UTF-1 WITHDRAWN] |
Was given by algorithm |
|
Withdrawn and obsolete. |
See also iso8859/readme.txt.
Line ending convention for these is often LINE FEED.
Character encoding |
Mapping to Unicode/UTF-16 |
Date of last update |
Remark |
ISO/IEC 646:1991-IR |
(By implicit algorithm) |
|
'7-bit' ASCII; US-ASCII. |
ISO/IEC 646-SE/FI |
|
|
'7-bit'; Obsolescent |
ISO/IEC 646-DK/NO |
|
|
'7-bit'; Obsolescent |
ISO/IEC 646-DE |
|
|
'7-bit'; Obsolescent |
ISO/IEC 646-FR |
|
|
'7-bit'; Obsolescent |
ISO/IEC 646-IT |
|
|
'7-bit'; Obsolescent |
ISO/IEC 646-ES |
|
|
'7-bit'; Obsolescent |
ETSI 03.38 '7-bit' default alphabet |
|
|
GSM/SMS (UCS-2 can also be used for GSM/SMS) |
ISO/IEC 8859-1:1998 |
1999 July 27 |
Latin-1 (see Latin-9 below) |
|
ISO/IEC 8859-2:1999 |
1999 July 27 |
Latin-2 |
|
ISO/IEC 8859-3:1999 |
1999 July 27 |
Latin-3 |
|
ISO/IEC 8859-4:1998 |
1999 July 27 |
Latin-4 |
|
ISO/IEC 8859-5:1999 |
1999 July 27 |
Latin/Cyrillic |
|
ISO/IEC 8859-6:1999 |
1999 July 27 |
Latin/Arabic; L-to-R storage? |
|
ISO/IEC 8859-7:1987 |
1999 July 27 |
Latin/Greek |
|
ISO/IEC 8859-8:1999 |
1999 July 27 |
Latin/Hebrew; L-to-R storage? |
|
ISO/IEC 8859-9:1999 |
1999 July 27 |
Latin-5 |
|
ISO/IEC 8859-10:1998 |
1999 July 27 |
Latin-6 |
|
ISO/IEC 8859-11 |
|
|
Latin/Thai |
12 |
|
|
Unused 8859 part number |
ISO/IEC 8859-13:1998 |
1999 July 27 |
Latin-7 |
|
ISO/IEC 8859-14:1998 |
1999 July 27 |
Latin-8 |
|
ISO/IEC 8859-15:1999 |
1999 July 27 |
Latin-9 (Latin-1 replacement) |
|
ISO/IEC 8859-16 |
|
|
Latin-10 |
ISO/IEC 6937:1994 |
|
|
Note that a combining character is stored before its base character for ISO/IEC 6937. |
MacOS 8.5 and onwards is Unicode enabled. See also vendors/apple/readme.txt.
Line ending convention for these is often CARRIAGE RETURN.
Character encoding |
Mapping to Unicode/UTF-16 |
Date of last update |
Remark |
Mac OS Arabic |
1999-Sep-22 |
Reading order storage? |
|
Mac OS Central European |
1999-Sep-22 |
CP 10029 |
|
Mac OS Chinese Simplified |
1999-Sep-22 |
|
|
Mac OS Chinese Traditional |
1999-Sep-22 |
|
|
Mac OS Croatian |
1999-Sep-22 |
|
|
Mac OS Cyrillic |
1999-Sep-22 |
CP 10007 |
|
Mac OS Devanagari |
1999-Sep-22 |
|
|
Mac OS Farsi |
1999-Sep-22 |
|
|
Mac OS Greek |
1999-Sep-22 |
CP 10006 |
|
Mac OS Gujarati |
1999-Sep-22 |
|
|
Mac OS Gurmukhi |
1999-Sep-22 |
|
|
Mac OS Hebrew |
1999-Sep-22 |
Reading order storage? |
|
Mac OS Icelandic |
1999-Sep-22 |
CP 10079 |
|
Mac OS Japanese |
1999-Sep-22 |
Apple Shift-JIS |
|
Mac OS Korean |
1999-Sep-22 |
|
|
Mac OS Roman |
1999-Sep-22 |
CP 10000 |
|
Mac OS Romanian |
1999-Sep-22 |
|
|
Mac OS Thai |
1999-Sep-22 |
|
|
Mac OS Turkish |
1999-Sep-22 |
CP 10081 |
|
Mac OS Ukrainian |
1999-Sep-22 |
||
CP 10007 MacCyrillic |
04/24/96 |
||
CP 10006 MacGreek |
04/24/96 |
||
CP 10079 MacIcelandic |
04/24/96 |
||
CP 10029 MacLatin2 |
04/24/96 |
||
CP 10000 MacRoman |
04/24/96 |
||
CP 10081 MacTurkish |
04/24/96 |
||
NEXTSTEP Encoding |
1999 September 23 |
Line ending convention: LF |
Windows NT is Unicode enabled. Windows 95 and onwards can output Unicode text.
Line ending convention for these is often CARRIAGE RETURN followed by LINE FEED.
Character encoding |
Mapping to Unicode/UTF-16 |
Date of last update |
Remark |
CP 874 |
02/28/98 |
Latin/Thai |
|
CP 932 |
04/15/98 |
MS Shift-JIS |
|
CP 936 |
04/15/98 |
MS Chinese (Simpl.) |
|
CP 949 |
04/15/98 |
MS Korean |
|
CP 950 |
04/15/98 |
MS Big-5 (Trad. Chinese) |
|
CP 1250 |
04/15/98 |
|
|
CP 1251 |
04/15/98 |
Latin/Cyrillic |
|
CP 1252 |
04/15/98 |
Extends on ISO/IEC 8859-1 Latin-1 |
|
CP 1253 |
04/15/98 |
Latin/Greek |
|
CP 1254 |
04/15/98 |
|
|
CP 1255 |
04/15/98 |
Latin/Hebrew; Reading order storage? |
|
CP 1256 |
01/5/99 |
Latin/Arabic; Reading order storage? |
|
CP 1257 |
04/15/98 |
|
|
CP 1258 |
04/15/98 |
|
Line ending convention for these is often CARRIAGE RETURN followed by LINE FEED.
See also the IBM README file (vendors/ibm/readme.txt) on encoding mappings.
Character encoding |
Mapping to Unicode/UTF-16 |
Date of last update |
Remark |
CP 437 Latin (US) |
04/24/96 |
Obsolescent |
|
CP 737 Greek (A) |
04/24/96 |
Obsolescent |
|
CP 775 BaltRim |
04/24/96 |
Obsolescent |
|
CP 850 Latin (A) |
04/24/96 |
Obsolescent |
|
CP 852 Latin (B) |
04/24/96 |
Obsolescent |
|
CP 855 Cyrillic (A) |
04/24/96 |
Obsolescent |
|
CP 857 Turkish |
04/24/96 |
Obsolescent |
|
CP 860 Portuguese |
04/24/96 |
Obsolescent |
|
CP 861 Icelandic |
04/24/96 |
Obsolescent |
|
CP 862 Hebrew |
04/24/96 |
Obsolescent; Reading order storage? |
|
CP 863 Canada F |
04/24/96 |
Obsolescent |
|
CP 864 Arabic |
04/24/96 |
Obsolescent; Reading order storage? |
|
CP 865 Nordic |
04/24/96 |
Obsolescent |
|
CP 866 Cyrillic (B) |
04/24/96 |
Obsolescent |
|
CP 869 Greek (B) |
04/24/96 |
Obsolescent |
|
CP 874 Thai |
04/15/98 |
Non-ISO encodings on Unixes, Adobe's encoding, non-MS PC encodings, non-Apple Mac encodings, RDS&DAB encodings, ...
Character encoding |
Mapping to Unicode/UTF-16 |
Date of last update |
Remark |
Adobe Standard Encoding |
30 March 1999 |
||
IBM CP 1006 |
1999 July 27 |
ASCII+Arabic; Reading order storage? |
|
CP 856 |
1999 July 27 |
ASCII+Hebrew; Reading order storage? |
|
KOI 8-R (RFC 1489) |
18 August 1999 |
ASCII+Cyrillic |
|
JIS X 0201 (1976) |
8 March 1994 |
|
|
Shift-JIS |
8 March 1994 |
|
|
Johab |
08/16/99 |
|
See also:vendors/ibm/readme.txt.
Except for Unicode, line ending convention for these is often NEXT LINE.
Character encoding |
Mapping to Unicode/UTF-16 |
Date of last update |
Remark |
Unicode/UTF-EBCDIC |
Given by algorithm (UTR 16) |
|
Only for use where EBCDIC is required. |
IBM EBCDIC CP 424 (Hebrew) |
1999 July 27 |
L-to-R storage? |
|
CP 037 IBM US Canada |
04/24/96 |
|
|
CP 500 IBM International |
04/24/96 |
|
|
CP 875 IBM Greek |
04/24/96 |
|
|
CP 1026 IBM Latin-5 Turkish |
04/24/96 |
|
East Asian without ASCII/EBCDIC, symbol, dingbat, private use area/corporate zone, character entities, cross-references, ...
Character encoding |
Mapping to Unicode/UTF-16 |
Date of last update |
Remark |
IBM PC memory-mapped video graphics |
1999 July 27 |
Obsolescent |
|
SGML character entities |
25 July 1997 |
|
|
Adobe Symbol Encoding |
30 March 1999 |
||
Adobe Zapf Dingbats Encoding |
30 March 1999 |
|
|
Registry of Apple use of Unicode corporate-zone |
1999-Sep-22 |
Registry, not a mapping |
|
Mac OS Dingbats |
1999-Sep-22 |
|
|
Mac OS Symbol |
1999-Sep-22 |
|
|
TCVN-NSCII HyperCard stack |
|
||
Unicode Han Character Cross-Reference |
14 March 1994 |
|
|
Unihan database |
23 September 1996 |
|
|
Korean Hangul Encoding Conversion |
Oct 04, 1995 |
|
|
KS C 5601 |
6 December 1993 |
Note: For Unicode 1.1! Obsolete! |
|
Unified Hangeul (KS C 5601-1992) |
07/24/95 |
For Unicode 2.0 and onwards. |
|
Unified Hangul (KS X 1001) |
08/16/99 |
|
|
JIS X 0208 (1990) |
8 March 1994 |
|
|
JIS X 0212 (1990) |
8 March 1994 |
|
|
GB 12345-80 |
6 December 1993 |
|
|
GB 2312-80 |
6 December 1993 |
|
|
BIG5 |
11 February 1994 |
|
|
CNS 11643-1986 |
21 October 1994 |
|
The 'conscript' registry has a number of unofficial registrations of possible use of the private use areas, for those interested in constructed writing systems. The private use areas can be used for any experimental, temporary, or 'private' characters. There can by definition be no standard use of the private use areas. Non-standardised use of code points that are not designated as private use violates Unicode and ISO/IEC 10646 conformity. The "corporate zone" is part of the private use area in the BMP, but is not excluded from use by anyone.