Character encoding mappings and related files

This file consists of tables with links to mapping data files available. For the most current information please refer to the Unicode ftp site for mapping data (ftp://ftp.unicode.org/Public/MAPPINGS/).

This file is provided as-is by Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been provided on optical media by Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt.

Unicode, Inc. hereby grants the right to freely use the information supplied in this file in the creation of products supporting the Unicode Standard, and to make copies of this file in any form for internal or external distribution as long as this notice remains attached.

Date of last update: 1999-10-09

Revision history:

1. 1999-10-07: created for Unicode 3.0

2. 1999-10-08: editorial changes

3. 1999-10-09: further changes

1. ASCII based

1.1 Unicode, ISO/IEC 10646

Various preexisting line ending conventions are used with these, but use of PARAGRAPH SEPARATOR and LINE SEPARATOR is recommended. All commonly occurring line ending conventions should be properly interpreted (even if mixed in the same file). See also UTR 13 (Unicode Newline Guidelines) regarding line/paragraph ending/separation, and UTR 14 (Line Breaking Properties) and its associated Unicode database file regarding line breaking, as well as UTR 9 (The Bidirectional Algorithm).

Character encoding	Mapping to Unicode/UTF-16	Date of last update	Remark
Unicode/UTF-8 (UTF-8, UTF-8N)	Given by algorithm (normative)		In Unicode UTF-8 is limited to planes 0-16
Unicode/UTF-16 (UTF-16, UTF-16BE, [UTF-16LE])	Identity		Identical to ISO/IEC 10646/UTF-16, if big endian when serialised into octets
Unicode/SCSU	Given by algorithm (UTR 6)		Standard Compression Scheme for Unicode; Big endian
Unicode/UTF-32 (UTF-32, UTF-32BE, [UTF-32LE])	Given by algorithm (UTR 19)		UCS-4 limited to planes 0-16
[Unicode/UTF-7 WITHDRAWN]	Was given by algorithm		Was intended only for e-mail. Withdrawn and obsolescent.

ISO/IEC 10646/UTF-8	Given by algorithm (normative) for planes 0-16		Suitable for "8-bit clean" ASCII oriented programs
ISO/IEC 10646/UTF-16	Identity		UCS-2 extended to planes 0-16; Big endian when serialised into octets
ISO/IEC 10646/UCS-2	Identity		UTF-16 restricted to plane 0 (BMP); Big endian when serialised into octets; Stepping stone to UTF-16
ISO/IEC 10646/UCS-4	Given by algorithm (normative) for planes 0-16		Big endian when serialised into octets
[ISO/IEC 10646/UTF-1 WITHDRAWN]	Was given by algorithm		Withdrawn and obsolete.

1.2 Other character encodings from ISO, IEC, ISO/IEC, ECMA, ETSI

1.3 Mac OS

MacOS 8.5 and onwards is Unicode enabled. See also vendors/apple/readme.txt.

Line ending convention for these is often CARRIAGE RETURN.

Character encoding	Mapping to Unicode/UTF-16	Date of last update	Remark
Mac OS Arabic	vendors/apple/arabic.txt	1999-Sep-22	Reading order storage?
Mac OS Central European	vendors/apple/centeuro.txt	1999-Sep-22	CP 10029
Mac OS Chinese Simplified	vendors/apple/chinsimp.txt	1999-Sep-22
Mac OS Chinese Traditional	vendors/apple/chintrad.txt	1999-Sep-22
Mac OS Croatian	vendors/apple/croatian.txt	1999-Sep-22
Mac OS Cyrillic	vendors/apple/cyrillic.txt	1999-Sep-22	CP 10007
Mac OS Devanagari	vendors/apple/devanaga.txt	1999-Sep-22
Mac OS Farsi	vendors/apple/farsi.txt	1999-Sep-22
Mac OS Greek	vendors/apple/greek.txt	1999-Sep-22	CP 10006
Mac OS Gujarati	vendors/apple/gujarati.txt	1999-Sep-22
Mac OS Gurmukhi	vendors/apple/gurmukhi.txt	1999-Sep-22
Mac OS Hebrew	vendors/apple/hebrew.txt	1999-Sep-22	Reading order storage?
Mac OS Icelandic	vendors/apple/iceland.txt	1999-Sep-22	CP 10079
Mac OS Japanese	vendors/apple/japanese.txt	1999-Sep-22	Apple Shift-JIS
Mac OS Korean	vendors/apple/korean.txt	1999-Sep-22
Mac OS Roman	vendors/apple/roman.txt	1999-Sep-22	CP 10000
Mac OS Romanian	vendors/apple/romanian.txt	1999-Sep-22
Mac OS Thai	vendors/apple/thai.txt	1999-Sep-22
Mac OS Turkish	vendors/apple/turkish.txt	1999-Sep-22	CP 10081
Mac OS Ukrainian	vendors/apple/ukraine.txt	1999-Sep-22	See vendors/apple/cyrillic.txt

CP 10007 MacCyrillic	vendors/micsft/mac/cyrillic.txt	04/24/96	See vendors/apple/cyrillic.txt
CP 10006 MacGreek	vendors/micsft/mac/greek.txt	04/24/96	See vendors/apple/greek.txt
CP 10079 MacIcelandic	vendors/micsft/mac/iceland.txt	04/24/96	See vendors/apple/iceland.txt
CP 10029 MacLatin2	vendors/micsft/mac/latin2.txt	04/24/96	See vendors/apple/centeuro.txt
CP 10000 MacRoman	vendors/micsft/mac/roman.txt	04/24/96	See vendors/apple/roman.txt
CP 10081 MacTurkish	vendors/micsft/mac/turkish.txt	04/24/96	See vendors/apple/turkish.txt

NEXTSTEP Encoding	vendors/next/nextstep.txt	1999 September 23	Line ending convention: LF

1.4 Windows

Windows NT is Unicode enabled. Windows 95 and onwards can output Unicode text.

Line ending convention for these is often CARRIAGE RETURN followed by LINE FEED.

Character encoding	Mapping to Unicode/UTF-16	Date of last update	Remark
CP 874	vendors/micsft/windows/cp874.txt	02/28/98	Latin/Thai
CP 932	vendors/micsft/windows/cp932.txt	04/15/98	MS Shift-JIS
CP 936	vendors/micsft/windows/cp936.txt	04/15/98	MS Chinese (Simpl.)
CP 949	vendors/micsft/windows/cp949.txt	04/15/98	MS Korean
CP 950	vendors/micsft/windows/cp950.txt	04/15/98	MS Big-5 (Trad. Chinese)
CP 1250	vendors/micsft/windows/cp1250.txt	04/15/98
CP 1251	vendors/micsft/windows/cp1251.txt	04/15/98	Latin/Cyrillic
CP 1252	vendors/micsft/windows/cp1252.txt	04/15/98	Extends on ISO/IEC 8859-1 Latin-1
CP 1253	vendors/micsft/windows/cp1253.txt	04/15/98	Latin/Greek
CP 1254	vendors/micsft/windows/cp1254.txt	04/15/98
CP 1255	vendors/micsft/windows/cp1255.txt	04/15/98	Latin/Hebrew; Reading order storage?
CP 1256	vendors/micsft/windows/cp1256.txt	01/5/99	Latin/Arabic; Reading order storage?
CP 1257	vendors/micsft/windows/cp1257.txt	04/15/98
CP 1258	vendors/micsft/windows/cp1258.txt	04/15/98

1.5 DOS

Line ending convention for these is often CARRIAGE RETURN followed by LINE FEED.

See also the IBM README file (vendors/ibm/readme.txt) on encoding mappings.

Character encoding	Mapping to Unicode/UTF-16	Date of last update	Remark
CP 437 Latin (US)	vendors/micsft/pc/cp437.txt	04/24/96	Obsolescent
CP 737 Greek (A)	vendors/micsft/pc/cp737.txt	04/24/96	Obsolescent
CP 775 BaltRim	vendors/micsft/pc/cp775.txt	04/24/96	Obsolescent
CP 850 Latin (A)	vendors/micsft/pc/cp850.txt	04/24/96	Obsolescent
CP 852 Latin (B)	vendors/micsft/pc/cp852.txt	04/24/96	Obsolescent
CP 855 Cyrillic (A)	vendors/micsft/pc/cp855.txt	04/24/96	Obsolescent
CP 857 Turkish	vendors/micsft/pc/cp857.txt	04/24/96	Obsolescent
CP 860 Portuguese	vendors/micsft/pc/cp860.txt	04/24/96	Obsolescent
CP 861 Icelandic	vendors/micsft/pc/cp861.txt	04/24/96	Obsolescent
CP 862 Hebrew	vendors/micsft/pc/cp862.txt	04/24/96	Obsolescent; Reading order storage?
CP 863 Canada F	vendors/micsft/pc/cp863.txt	04/24/96	Obsolescent
CP 864 Arabic	vendors/micsft/pc/cp864.txt	04/24/96	Obsolescent; Reading order storage?
CP 865 Nordic	vendors/micsft/pc/cp865.txt	04/24/96	Obsolescent
CP 866 Cyrillic (B)	vendors/micsft/pc/cp866.txt	04/24/96	Obsolescent
CP 869 Greek (B)	vendors/micsft/pc/cp869.txt	04/24/96	Obsolescent
CP 874 Thai	vendors/micsft/pc/cp874.txt	04/15/98	See vendors/micsft/windows/cp874.txt

1.6 Other ASCII-based

Non-ISO encodings on Unixes, Adobe's encoding, non-MS PC encodings, non-Apple Mac encodings, RDS&DAB encodings, ...

Character encoding	Mapping to Unicode/UTF-16	Date of last update	Remark
Adobe Standard Encoding	vendors/adobe/stdenc.txt	30 March 1999	vendors/adobe/readme.txt

IBM CP 1006	vendors/misc/cp1006.txt	1999 July 27	ASCII+Arabic; Reading order storage?
CP 856	vendors/misc/cp856.txt	1999 July 27	ASCII+Hebrew; Reading order storage?
KOI 8-R (RFC 1489)	vendors/misc/koi8-r.txt	18 August 1999	ASCII+Cyrillic

JIS X 0201 (1976)	eastasia/jis/jis0201.txt	8 March 1994
Shift-JIS	eastasia/jis/shiftjis.txt	8 March 1994
Johab	eastasia/ksc/johab.txt	08/16/99

2. EBCDIC based

3. Others

East Asian without ASCII/EBCDIC, symbol, dingbat, private use area/corporate zone, character entities, cross-references, ...

Character encoding	Mapping to Unicode/UTF-16	Date of last update	Remark
IBM PC memory-mapped video graphics	vendors/misc/ibmgraph.txt	1999 July 27	Obsolescent

SGML character entities	vendors/misc/sgml.txt	25 July 1997

Adobe Symbol Encoding	vendors/adobe/symbol.txt	30 March 1999	vendors/adobe/readme.txt
Adobe Zapf Dingbats Encoding	vendors/adobe/zdingbat.txt	30 March 1999

Registry of Apple use of Unicode corporate-zone	vendors/apple/corpchar.txt	1999-Sep-22	Registry, not a mapping
Mac OS Dingbats	vendors/apple/dingbats.txt	1999-Sep-22
Mac OS Symbol	vendors/apple/symbol.txt	1999-Sep-22

TCVN-NSCII HyperCard stack	EASTASIA/TCVN/TCV-SEA.HQX		eastasia/tcvn/readme.txt
Unicode Han Character Cross-Reference	eastasia/cjkxref.txt	14 March 1994
Unihan database	eastasia/unihan.txt	23 September 1996

Korean Hangul Encoding Conversion	eastasia/ksc/hangul.txt	Oct 04, 1995
KS C 5601	eastasia/ksc/old5601.txt	6 December 1993	Note: For Unicode 1.1! Obsolete!
Unified Hangeul (KS C 5601-1992)	eastasia/ksc/ksc5601.txt	07/24/95	For Unicode 2.0 and onwards.
Unified Hangul (KS X 1001)	eastasia/ksc/ksx1001.txt	08/16/99

JIS X 0208 (1990)	eastasia/jis/jis0208.txt	8 March 1994
JIS X 0212 (1990)	eastasia/jis/jis0212.txt	8 March 1994

GB 12345-80	eastasia/gb/gb12345.txt	6 December 1993
GB 2312-80	eastasia/gb/gb2312.txt	6 December 1993

BIG5	eastasia/other/big5.txt	11 February 1994
CNS 11643-1986	eastasia/other/cns11643.txt	21 October 1994

The 'conscript' registry has a number of unofficial registrations of possible use of the private use areas, for those interested in constructed writing systems. The private use areas can be used for any experimental, temporary, or 'private' characters. There can by definition be no standard use of the private use areas. Non-standardised use of code points that are not designated as private use violates Unicode and ISO/IEC 10646 conformity. The "corporate zone" is part of the private use area in the BMP, but is not excluded from use by anyone.

Character encoding	Mapping to Unicode/UTF-16	Date of last update	Remark
ISO/IEC 646:1991-IR	(By implicit algorithm)		'7-bit' ASCII; US-ASCII.
ISO/IEC 646-SE/FI			'7-bit'; Obsolescent
ISO/IEC 646-DK/NO			'7-bit'; Obsolescent
ISO/IEC 646-DE			'7-bit'; Obsolescent
ISO/IEC 646-FR			'7-bit'; Obsolescent
ISO/IEC 646-IT			'7-bit'; Obsolescent
ISO/IEC 646-ES			'7-bit'; Obsolescent

ETSI 03.38 '7-bit' default alphabet			GSM/SMS (UCS-2 can also be used for GSM/SMS)

ISO/IEC 8859-1:1998	iso8859/8859-1.txt	1999 July 27	Latin-1 (see Latin-9 below)
ISO/IEC 8859-2:1999	iso8859/8859-2.txt	1999 July 27	Latin-2
ISO/IEC 8859-3:1999	iso8859/8859-3.txt	1999 July 27	Latin-3
ISO/IEC 8859-4:1998	iso8859/8859-4.txt	1999 July 27	Latin-4
ISO/IEC 8859-5:1999	iso8859/8859-5.txt	1999 July 27	Latin/Cyrillic
ISO/IEC 8859-6:1999	iso8859/8859-6.txt	1999 July 27	Latin/Arabic; L-to-R storage?
ISO/IEC 8859-7:1987	iso8859/8859-7.txt	1999 July 27	Latin/Greek
ISO/IEC 8859-8:1999	iso8859/8859-8.txt	1999 July 27	Latin/Hebrew; L-to-R storage?
ISO/IEC 8859-9:1999	iso8859/8859-9.txt	1999 July 27	Latin-5
ISO/IEC 8859-10:1998	iso8859/8859-10.txt	1999 July 27	Latin-6
ISO/IEC 8859-11			Latin/Thai
12			Unused 8859 part number
ISO/IEC 8859-13:1998	iso8859/8859-13.txt	1999 July 27	Latin-7
ISO/IEC 8859-14:1998	iso8859/8859-14.txt	1999 July 27	Latin-8
ISO/IEC 8859-15:1999	iso8859/8859-15.txt	1999 July 27	Latin-9 (Latin-1 replacement)
ISO/IEC 8859-16			Latin-10

ISO/IEC 6937:1994			Note that a combining character is stored before its base character for ISO/IEC 6937.

Character encoding	Mapping to Unicode/UTF-16	Date of last update	Remark
Unicode/UTF-EBCDIC	Given by algorithm (UTR 16)		Only for use where EBCDIC is required.

IBM EBCDIC CP 424 (Hebrew)	vendors/misc/cp424.txt	1999 July 27	L-to-R storage?

CP 037 IBM US Canada	vendors/micsft/ebcdic/cp037.txt	04/24/96
CP 500 IBM International	vendors/micsft/ebcdic/cp500.txt	04/24/96
CP 875 IBM Greek	vendors/micsft/ebcdic/cp875.txt	04/24/96
CP 1026 IBM Latin-5 Turkish	vendors/micsft/ebcdic/cp1026.txt	04/24/96