This file consists of tables with links to
mapping data files available. For the most current information please refer to
the Unicode ftp site for mapping data (ftp://ftp.unicode.org/Public/MAPPINGS/).
This file is provided as-is by Unicode, Inc. (The
Unicode Consortium). No claims are made as to fitness for any particular
purpose. No warranties of any kind are expressed or implied. The recipient
agrees to determine applicability of information provided. If this file has
been provided on optical media by Unicode, Inc., the sole remedy for any claim
will be exchange of defective media within 90 days of receipt.
Unicode, Inc. hereby grants the right to freely use
the information supplied in this file in the creation of products supporting
the Unicode Standard, and to make copies of this file in any form for internal
or external distribution as long as this notice remains attached.
Date of last update: 2000-08-03
Revision history:
1. 1999-10-07: created
for Unicode 3.0
2. 1999-10-08:
editorial changes
3. 1999-10-09: further
changes
4. 2000-08-03: addition
of GSM/7-bit and fix of table headings
Various preexisting line ending conventions are
used with these, but use of PARAGRAPH SEPARATOR and LINE SEPARATOR is recommended.
All commonly occurring line ending conventions should be properly interpreted
(even if mixed in the same file). See also UTR 13 (Unicode Newline
Guidelines) regarding line/paragraph ending/separation, and UTR 14 (Line
Breaking Properties) and its associated Unicode database file regarding line
breaking, as well as UTR 9 (The Bidirectional Algorithm).
Character encoding |
Mapping to Unicode |
Date of last update |
Remark |
Unicode/UTF-8 (UTF-8) |
Given by algorithm (normative) |
|
In Unicode UTF-8 is limited to planes 0-16 |
Unicode/UTF-16 (UTF-16, UTF-16BE) |
Given by algorithm (normative) |
|
Identical to ISO/IEC 10646/UTF-16 |
Unicode/UTF-16 (UTF-16LE) |
Byte pair swap if serialised into octets |
|
Does not conform to 10646 if used in serialisation into octets |
Unicode/SCSU |
Given by algorithm (UTR 6) |
|
Standard Compression Scheme for Unicode; Big endian |
Unicode/UTF-32 (UTF-32, UTF-32BE) |
Given by algorithm (UTR 19) |
|
UCS-4 limited to planes 0-16 |
Unicode/UTF-32 (UTF-32LE) |
Byte quintet reversal if serialised into octets; then given by algorithm (UTR 19) |
|
Does not conform to 10646 if used in serialisation into octets |
[Unicode/UTF-7 WITHDRAWN] |
Was given by algorithm in Unicode 2.0. Not included in Unicode 3.0. |
|
Was intended only for e-mail. Withdrawn and obsolescent. |
|
|||
ISO/IEC 10646/UTF-8 |
Given by algorithm (normative) for planes 0-16 |
|
Suitable for "8-bit clean" ASCII oriented programs |
ISO/IEC 10646/UTF-16 |
Given by algorithm |
|
UCS-2 extended to planes 0-16; Big endian when serialised into octets |
ISO/IEC 10646/UCS-2 |
Identity |
|
UTF-16 restricted to plane 0 (BMP); Big endian when serialised into octets; Stepping stone to UTF-16 |
ISO/IEC 10646/UCS-4 |
Given by algorithm (normative) for planes 0-16 |
|
Big endian when serialised into octets |
[ISO/IEC 10646/UTF-1 WITHDRAWN] |
Was given by algorithm |
|
Withdrawn and obsolete. |
See also iso8859/readme.txt.
Line ending convention for these is often LINE
FEED.
Character encoding |
Mapping to Unicode |
Date of last update |
Remark |
ISO/IEC 646:1991-IR |
(By implicit algorithm) |
|
'7-bit' ASCII; US-ASCII. |
|
|||
ETSI 03.38 '7-bit' default alphabet |
|
GSM/SMS (UCS-2 can also be used for GSM/SMS) |
|
|
|||
ISO/IEC 8859-1:1998 |
1999 July 27 |
Latin-1 (Western Europe, no Euro, not French) |
|
ISO/IEC 8859-2:1999 |
1999 July 27 |
Latin-2 (Central Europe) |
|
ISO/IEC 8859-3:1999 |
1999 July 27 |
Latin-3 |
|
ISO/IEC 8859-4:1998 |
1999 July 27 |
Latin-4 |
|
ISO/IEC 8859-5:1999 |
1999 July 27 |
Latin/Cyrillic |
|
ISO/IEC 8859-6:1999 |
1999 July 27 |
Latin/Arabic; L-to-R storage? |
|
ISO/IEC 8859-7:1987 |
1999 July 27 |
Latin/Greek |
|
ISO/IEC 8859-8:1999 |
1999 July 27 |
Latin/Hebrew; L-to-R storage? |
|
ISO/IEC 8859-9:1999 |
1999 July 27 |
Latin-5 |
|
ISO/IEC 8859-10:1998 |
1999 July 27 |
Latin-6 |
|
ISO/IEC 8859-11 |
|
|
Latin/Thai |
12 |
|
|
Unused 8859 part number |
ISO/IEC 8859-13:1998 |
1999 July 27 |
Latin-7 |
|
ISO/IEC 8859-14:1998 |
1999 July 27 |
Latin-8 |
|
ISO/IEC 8859-15:1999 |
1999 July 27 |
Latin-9 |
MacOS 8.5 and onwards is Unicode enabled. See
also vendors/apple/readme.txt.
Line ending convention for these is often
CARRIAGE RETURN.
Character encoding |
Mapping to Unicode |
Date of last update |
Remark |
Mac OS Arabic |
1999-Sep-22 |
Reading order storage? |
|
Mac OS Central European |
1999-Sep-22 |
CP 10029 |
|
Mac OS Chinese Simplified |
1999-Sep-22 |
|
|
Mac OS Chinese Traditional |
1999-Sep-22 |
|
|
Mac OS Croatian |
1999-Sep-22 |
|
|
Mac OS Cyrillic |
1999-Sep-22 |
CP 10007 |
|
Mac OS Devanagari |
1999-Sep-22 |
|
|
Mac OS Farsi |
1999-Sep-22 |
|
|
Mac OS Greek |
1999-Sep-22 |
CP 10006 |
|
Mac OS Gujarati |
1999-Sep-22 |
|
|
Mac OS Gurmukhi |
1999-Sep-22 |
|
|
Mac OS Hebrew |
1999-Sep-22 |
Reading order storage? |
|
Mac OS Icelandic |
1999-Sep-22 |
CP 10079 |
|
Mac OS Japanese |
1999-Sep-22 |
Apple Shift-JIS |
|
Mac OS Korean |
1999-Sep-22 |
|
|
Mac OS Roman |
1999-Sep-22 |
CP 10000 |
|
Mac OS Romanian |
1999-Sep-22 |
|
|
Mac OS Thai |
1999-Sep-22 |
|
|
Mac OS Turkish |
1999-Sep-22 |
CP 10081 |
|
Mac OS Ukrainian |
1999-Sep-22 |
||
|
|||
CP 10007 MacCyrillic |
04/24/96 |
||
CP 10006 MacGreek |
04/24/96 |
||
CP 10079 MacIcelandic |
04/24/96 |
||
CP 10029 MacLatin2 |
04/24/96 |
||
CP 10000 MacRoman |
04/24/96 |
||
CP 10081 MacTurkish |
04/24/96 |
||
|
|||
NEXTSTEP Encoding |
1999 September 23 |
Line ending convention: LF |
Windows NT is Unicode enabled. Windows 95 and
onwards can output Unicode text.
Line ending convention for these is often
CARRIAGE RETURN followed by LINE FEED.
Character encoding |
Mapping to Unicode |
Date of last update |
Remark |
CP 874 |
02/28/98 |
Latin/Thai |
|
CP 932 |
04/15/98 |
MS Shift-JIS |
|
CP 936 |
04/15/98 |
MS Chinese (Simpl.) |
|
CP 949 |
04/15/98 |
MS Korean |
|
CP 950 |
04/15/98 |
MS Big-5 (Trad. Chinese) |
|
CP 1250 |
04/15/98 |
Central Europe |
|
CP 1251 |
04/15/98 |
Latin/Cyrillic |
|
CP 1252 |
04/15/98 |
Extends on ISO/IEC 8859-1 Latin-1 |
|
CP 1253 |
04/15/98 |
Latin/Greek |
|
CP 1254 |
04/15/98 |
Turkish |
|
CP 1255 |
04/15/98 |
Latin/Hebrew; Reading order storage? |
|
CP 1256 |
01/5/99 |
Latin/Arabic; Reading order storage? |
|
CP 1257 |
04/15/98 |
Baltic |
|
CP 1258 |
04/15/98 |
Vietnamese |
Line ending convention for these is often
CARRIAGE RETURN followed by LINE FEED.
See also the IBM README file (vendors/ibm/readme.txt) on encoding mappings.
Character encoding |
Mapping to Unicode |
Date of last update |
Remark |
CP 437 Latin (US) |
04/24/96 |
Obsolescent |
|
CP 737 Greek (A) |
04/24/96 |
Obsolescent |
|
CP 775 BaltRim |
04/24/96 |
Obsolescent |
|
CP 850 Latin (A) |
04/24/96 |
Obsolescent |
|
CP 852 Latin (B) |
04/24/96 |
Obsolescent |
|
CP 855 Cyrillic (A) |
04/24/96 |
Obsolescent |
|
CP 857 Turkish |
04/24/96 |
Obsolescent |
|
CP 860 Portuguese |
04/24/96 |
Obsolescent |
|
CP 861 Icelandic |
04/24/96 |
Obsolescent |
|
CP 862 Hebrew |
04/24/96 |
Obsolescent; Reading order storage? |
|
CP 863 Canada F |
04/24/96 |
Obsolescent |
|
CP 864 Arabic |
04/24/96 |
Obsolescent; Reading order storage? |
|
CP 865 Nordic |
04/24/96 |
Obsolescent |
|
CP 866 Cyrillic (B) |
04/24/96 |
Obsolescent |
|
CP 869 Greek (B) |
04/24/96 |
Obsolescent |
|
CP 874 Thai |
04/15/98 |
Non-ISO encodings on Unixes, Adobe's encoding, non-MS
PC encodings, non-Apple Mac encodings, RDS&DAB encodings, ...
Character encoding |
Mapping to Unicode |
Date of last update |
Remark |
Adobe Standard Encoding |
30 March 1999 |
||
|
|||
IBM CP 1006 |
1999 July 27 |
ASCII+Arabic; Reading order storage? |
|
CP 856 |
1999 July 27 |
ASCII+Hebrew; Reading order storage? |
|
KOI 8-R (RFC 1489) |
18 August 1999 |
ASCII+Cyrillic |
|
|
|||
JIS X 0201 (1976) |
8 March 1994 |
|
|
Shift-JIS |
8 March 1994 |
|
|
Johab |
08/16/99 |
|
See also:vendors/ibm/readme.txt.
Except for Unicode, line ending convention for
these is often NEXT LINE.
Character encoding |
Mapping to Unicode |
Date of last update |
Remark |
Unicode/UTF-EBCDIC |
Given by algorithm (UTR 16) |
|
Only for use where EBCDIC is required. |
|
|||
IBM EBCDIC CP 424 (Hebrew) |
1999 July 27 |
L-to-R storage? |
|
|
|||
CP 037 IBM US Canada |
04/24/96 |
|
|
CP 500 IBM International |
04/24/96 |
|
|
CP 875 IBM Greek |
04/24/96 |
|
|
CP 1026 IBM Latin-5 Turkish |
04/24/96 |
|
East Asian without ASCII/EBCDIC, symbol,
dingbat, private use area/corporate zone, character entities, cross-references,
...
Character encoding |
Mapping to Unicode |
Date of last update |
Remark |
IBM PC memory-mapped video graphics |
1999 July 27 |
Obsolescent |
|
|
|||
SGML character entities |
25 July 1997 |
|
|
|
|||
Adobe Symbol Encoding |
30 March 1999 |
||
Adobe Zapf Dingbats Encoding |
30 March 1999 |
|
|
|
|||
Registry of Apple use of Unicode corporate-zone |
1999-Sep-22 |
Registry, not a mapping |
|
Mac OS Dingbats |
1999-Sep-22 |
|
|
Mac OS Symbol |
1999-Sep-22 |
|
|
|
|||
TCVN-NSCII HyperCard stack |
|
||
Unicode Han Character Cross-Reference |
14 March 1994 |
|
|
Unihan database |
23 September 1996 |
|
|
|
|||
Korean Hangul Encoding Conversion |
Oct 04, 1995 |
|
|
KS C 5601 |
6 December 1993 |
Note: For Unicode 1.1! Obsolete! |
|
Unified Hangeul (KS C 5601-1992) |
07/24/95 |
For Unicode 2.0 and onwards. |
|
Unified Hangul (KS X 1001) |
08/16/99 |
|
|
|
|||
JIS X 0208 (1990) |
8 March 1994 |
|
|
JIS X 0212 (1990) |
8 March 1994 |
|
|
|
|||
GB 12345-80 |
6 December 1993 |
|
|
GB 2312-80 |
6 December 1993 |
|
|
|
|||
BIG5 |
11 February 1994 |
|
|
CNS 11643-1986 |
21 October 1994 |
|
The 'conscript' registry has a number of
unofficial registrations of possible use of the private use areas, for those
interested in constructed writing systems. The private use areas can be used
for any experimental, temporary, or 'private' characters. There can by
definition be no standard use of the private use areas. Non-standardised use of
code points that are not designated as private use violates Unicode and ISO/IEC
10646 conformity. The "corporate zone" is part of the private use
area in the BMP, but is not excluded from use by anyone.