Unicode Technical Report #16
UTF-EBCDIC
Summary
This document presents the specifications of UTF-EBCDIC - EBCDIC Friendly
Unicode (or UCS) Transformation Format.
Status
This document has been reviewed by Unicode members and other interested
parties, and has been approved by the Unicode Technical Committee as a Unicode
Technical Report. It is a stable document and may be used as reference
material or cited as a normative reference from another document.
A Unicode Technical Report (UTR) may contain either informative
material or normative specifications, or both. Each UTR may specify a base
version of the Unicode Standard. In that case, conformance to the UTR requires
conformance to that version or higher.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/.
For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/
.
Please mail corrigenda and other comments to the author(s).
Contents
1 Scope
The term UTF-EBCDIC stands for EBCDIC-friendly Unicode (or UCS) Transformation
Format. EBCDIC, IBM's Extended Binary Coded Decimal Interchange Code, is one of
the widely used 8-bit industry encodings. Detailed information on EBCDIC can be
found in the IBM publication IBM Character Data Representation Architecture,
Reference and Registry, SC09-2190-00, December 1996.
To address the use of Unicode character data in byte-oriented ASCII-based
systems, the Unicode Standard (see section A.2 of the Unicode Standard)
(also ISO/IEC 10646 -1, Amendment no. 2) has defined UTF-8. Use of UTF-8
permits existing ASCII-based systems that have hard-coded dependency on the
encoding of the ASCII repertoire of characters to safely process the
corresponding Unicode characters. There is a similar requirement to transform
Unicode characters to a form that is safe for EBCDIC systems for the control
characters and invariant characters.
This Technical Report defines the UTF-EBCDIC.
Neither UTF-EBCDIC nor its intermediate form called UTF-8-Mod in this
technical report, are intended to be used in open interchange environments. It
is useful in homogeneous EBCDIC systems and networks.
2 Description
The UTF-EBCDIC encoding is derived from the Unicode scalar values following a
two step process:
- Conversion of the Unicode scalar values to a variable length byte sequence
called I8-sequence (intermediate 8-bit sequence) by applying a
modified UTF-8 transformation (UTF-8-Mod), enabling the preservation of 65
control characters as single bytes.
Valid pairs of surrogates (see Section 3.7, Surrogates, in the Unicode
Standard 2.0) must be converted first to their corresponding Unicode scalar
values by applying the UTF-16 transformation. Unicode scalar values in the
range X'10000' to X'10FFFF' obtained from I8-sequences, are transformed into
the corresponding surrogate pairs using the UTF-16 transformation.
- The bytes in the I8-sequence are then converted to the UTF-EBCDIC byte
sequence by using a single-byte to single-byte reversible conversion.
These two steps are defined below.
Note: The following notation is used in this Technical Report: X'nn ..
mm' represent hexadecimal values; <bb...bb> represent values in bit
notation; U+abcd represents a Unicode character.
3 Definition
3.1 Step 1: UTF-8-Mod
The UTF-8-Mod transformation definition is modeled after the UTF-8 definition in
the Unicode standard. UTF-8-Mod transforms the Unicode scalar values into
I8-sequences. The Unicode characters U+0000 to U+001F (corresponding to the C0
control characters X'00' to X'1F' of ASCII), U+0020 to U+007E (the ASCII
repertoire), and U+007F (the ASCII 'DEL' control character) are represented as
single bytes in the I8-sequence, similar to UTF-8. In addition, U+0080 to U+009F
(corresponding to the so-called C1 set of controls in ISO/IEC 6429) are also
represented as single bytes (X'80' to X'9F'). Thus the 65 Unicode characters
corresponding to the 65 ISO/IEC 6429 control characters and the 95 characters
corresponding to the 95 ASCII graphic characters (the G0 set) are represented in
the I8-sequence as single bytes.
When these I8-sequence bytes are converted to the UTF-EBCDIC form, the
corresponding 65 EBCDIC control characters and 95 EBCDIC graphic characters are
preserved as single bytes in the UTF-EBCDIC byte sequence. The 95 EBCDIC graphic
characters include 82 invariant (occupy the same code position) characters
(including SPACE) across most EBCDIC single-byte code pages and 13 variant ASCII
graphic characters (occupy varying code positions). Positions assigned to EBCDIC
controls, the invariant graphic characters and the variant graphics are shown in
Table B.1.
Furthermore, the values X'00'...X'9F' do not appear in any byte of an
I8-sequence except as the direct representation of U+0000 to U+009F. Each
Unicode scalar value that is not a part of a valid surrogate pair is
represented in an I8-sequence by 1, 2, 3 or 4 bytes, depending on the value. A
valid surrogate pair is first converted to its corresponding Unicode scalar
value, which then maps into either 4 bytes or 5 bytes, depending on the value.
The UTF-8-Mod transformation is intended to be used only as an intermediate
step in arriving at UTF-EBCDIC. It is not intended to be used elsewhere.
The I8-sequence is a variable length encoding of Unicode characters as 8-bit
byte sequences, where the high bits of each byte indicate which part of the
sequence a byte belongs to. Table 1 shows how the bits in
a Unicode scalar value (or a valid surrogate pair) are distributed among the
bytes in the I8-sequence. I8-sequence corresponding to a valid surrogate pair is
also shown, including the UTF-16 transformation to convert the valid pair to the
corresponding Unicode scalar value.
Table 1: I8-
Sequence Bit Distribution
Unicode Scalar Value (hex) |
Bit pattern of Unicode Scalar Value |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
5th Byte |
0 to 7F |
000000000xxxxxxx |
0xxxxxxx |
|
|
|
|
80 to 9F |
00000000100xxxxx |
100xxxxx |
|
|
|
|
A0 to 3FF |
000000yyyyyxxxxx |
110yyyyy |
101xxxxx |
|
|
|
400 to 3FFF |
00zzzzyyyyyxxxxx |
1110zzzz |
101yyyyy |
101xxxxx |
|
|
4000 to 3FFFF |
0wwwzzzzzyyyyyxxxxx |
11110www |
101zzzzz |
101yyyyy |
101xxxxx |
|
40000 to 10FFFF |
rwwwwwzzzzzyyyyyxxxxx |
1111100r |
101wwwww |
101zzzzz |
101yyyyy |
101xxxxx |
Note: The UTF-8-Mod transformation is
valid for UCS-4 values X'0' to X'7FFFFFFF' (the full extent of ISO/IEC
10646 coding space). Only the Unicode scalar values corresponding to the
end of plane 16 -- the reach of the UTF-16 transformation -- are shown in
the above table. |
A valid surrogate pair -- a high surrogate from the range X'D800' to X'DBFF'
followed by a low surrogate from the range X'DC00' to X'DFFF' -- must be
converted to its corresponding Unicode scalar value in the range X'10000' to
X'10FFFF', using the UTF-16 transformation. The following table shows the
correspondence between the bit patterns of the surrogate pairs and the
corresponding I8-sequence bytes.
Unicode Scalar Value (hex) |
Bit pattern of valid Surrogate Pairs |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
5th Byte |
10000 to 3FFFF |
110110uuuuwzzzzz
+ 110111yyyyyxxxxx |
11110ppwa |
101zzzzz |
101yyyyy |
101xxxxx |
|
40000 to 10FFFF |
110110uuuuwzzzzz
+ 110111yyyyyxxxxx |
1111100qb |
101ppppwb |
101zzzzz |
101yyyyy |
101xxxxx |
where a uuuu = 000pp -1, or b uuuu = qpppp -1
(to account for addition of 1000016 as in Section 3.7,
Surrogates, in the Unicode Standard 2.0)
When converting Unicode values to I8-sequences, always use the shortest
number of bytes that can represent these values. This preserves uniqueness
of encoding. For example the Unicode value <0000000000000001> is encoded
as <00000001>, not as <11000000> <10100001>. The latter is an
example of an unused I8-sequence. Do not make use of these unused byte
sequences for encoding any other information.
When converting from I8-sequences to Unicode scalar values, however,
implementations do not need to check that the shortest number of bytes is being
used, which simplifies the conversion algorithm.
3.2 Characteristics of the I8-sequence
- Some of the important characteristics of I8-sequence are:
- Unicode characters from U+0000 to U+009F (ASCII repertoire, C0 and C1
controls) map to single-byte I8-sequence values X'00' to X'9F' (ASCII values
X'00' to X'7F' and ISO/IEC 4873 values X'80' to X'9F'). ASCII values or ISO/IEC
4873 control values do not otherwise occur in an I8-sequence. This
paves the way for transforming these into corresponding single-byte EBCDIC
controls and graphics in the second step of UTF-EBCDIC transform.
- The I8-sequence is reasonably compact in terms of number of bytes used for
encoding. It is very simple and efficient to convert to and from Unicode
text.
- The first byte indicates the number of bytes to follow in a multi-byte
sequence. This allows for efficient forward parsing. It is also efficient to
find the start of a character string from an arbitrary location in a byte
stream. You need to search at most five bytes (seven bytes, if the full
range of 31 bits of ISO/IEC 10646 is considered) backwards, and it is simple
to recognize an initial byte. For example, after converting a UTF-EBCDIC
byte back into I8-sequence, in C
isInitialByte = ( (byte & 0xE0) != 0xA0); |
The search for initial or trailing bytes can also be done directly on UTF-EBCDIC
byte by utilizing a shadow vector (see Table 4
described later).
3.3 Step 2: Byte Conversion
Characteristics of the I8- sequence The second step of UTF-EBCDIC transforms the
I8-sequences, using a reversible one-to-one mapping, into the byte sequences of
UTF-EBCDIC.
The 64 control characters (U+0000 to U+001F, U+0080 to U+009F), the ASCII
DELETE character (U+007F), the 95 ASCII graphic characters (including the SPACE
character) (U+0020 to U+007E) are mapped respecting EBCDIC conventions, as
defined in IBM Character Data Representation Architecture, CDRA, with one
exception -- the pairing of EBCDIC Line Feed and New Line control characters are
swapped from their CDRA default pairings to ISO/IEC 6429 Line Feed
(U+000A) and Next Line (U+0085) control characters (to be in line with IBM
OS/390 UNIX Services, or Open MVS practice and preference, stemming from the
hard-coding of X'0A' as the New Line in most ASCII-C compilers.).
The map preserves the invariance for a set of 82 graphic characters
(including SPACE) (known as the IBM Syntactic Graphic Character set), and
maintains consistency with the IBM MVS Open Systems Code page (CPGID 1047) for
the variant characters from within the ASCII repertoire.
The remaining 96 bytes of EBCDIC 8-bit structure are allocated to X'A0' to
X'FF' -- the trailing bytes and leading bytes of the I8-sequence.(from Table 1). The minimum criterion for allocation of these
bytes is that it provides for a reversible map.
The trailing and leading bytes (X'A0' to X'FF' of the I8-sequence) are paired
with the unassigned UTF-EBCDIC bytes in increasing order. Table
2 and Table 3 show the byte maps between the
I8-sequence bytes and UTF-EBCDIC bytes in the forward and reverse directions
respectively. The resulting UTF-EBCDIC multi-byte sequences will be in the same
lexical (numerical) order as their corresponding Unicode scalar values (when the
sequences are zero-filled to equal number of bytes and compared with each
other). Please note that the UTF-EBCDIC single-byte values, however, will not
be in the same order as their corresponding Unicode scalar values.
The resulting UTF-EBCDIC byte sequence can be transparently processed in most
EBCDIC systems. It also retains all the characteristics (see the section 3.2 Characteristics of the I8-sequence above) of
I8-sequence mentioned earlier. Since EBCDIC code page definitions have 13
variants (and only 82 invariants) the choice of the above byte map for the
graphic characters has been made to accommodate the MVS Open Systems environment
for standardization purposes.
Table 2: Byte map from I8-
sequence to UTF-EBCDIC byte sequence
|
ß High
nibble (hex)
Low nibble (hex) Þ (all entries are in hex) |
|
-0 |
-1 |
-2 |
-3 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
-A |
-B |
-C |
-D |
-E |
-F |
0- |
00 |
01 |
02 |
03 |
37 |
2D |
2E |
2F |
16 |
05 |
15 |
0B |
0C |
0D |
0E |
0F |
1- |
10 |
11 |
12 |
13 |
3C |
3D |
32 |
26 |
18 |
19 |
3F |
27 |
1C |
1D |
1E |
1F |
2- |
40 |
5A |
7F |
7B |
5B |
6C |
50 |
7D |
4D |
5D |
5C |
4E |
6B |
60 |
4B |
61 |
3- |
F0 |
F1 |
F2 |
F3 |
F4 |
F5 |
F6 |
F7 |
F8 |
F9 |
7A |
5E |
4C |
7E |
6E |
6F |
4- |
7C |
C1 |
C2 |
C3 |
C4 |
C5 |
C6 |
C7 |
C8 |
C9 |
D1 |
D2 |
D3 |
D4 |
D5 |
D6 |
5- |
D7 |
D8 |
D9 |
E2 |
E3 |
E4 |
E5 |
E6 |
E7 |
E8 |
E9 |
AD |
E0 |
BD |
5F |
6D |
6- |
79 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
91 |
92 |
93 |
94 |
95 |
96 |
7- |
97 |
98 |
99 |
A2 |
A3 |
A4 |
A5 |
A6 |
A7 |
A8 |
A9 |
C0 |
4F |
D0 |
A1 |
07 |
8- |
20 |
21 |
22 |
23 |
24 |
25 |
06 |
17 |
28 |
29 |
2A |
2B |
2C |
09 |
0A |
1B |
9- |
30 |
31 |
1A |
33 |
34 |
35 |
36 |
08 |
38 |
39 |
3A |
3B |
04 |
14 |
3E |
FF |
A- |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
4A |
51 |
52 |
53 |
54 |
55 |
56 |
B- |
57 |
58 |
59 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
6A |
70 |
71 |
72 |
73 |
C- |
74 |
75 |
76 |
77 |
78 |
80 |
8A |
8B |
8C |
8D |
8E |
8F |
90 |
9A |
9B |
9C |
D- |
9D |
9E |
9F |
A0 |
AA |
AB |
AC |
AE |
AF |
B0 |
B1 |
B2 |
B3 |
B4 |
B5 |
B6 |
E- |
B7 |
B8 |
B9 |
BA |
BB |
BC |
BE |
BF |
CA |
CB |
CC |
CD |
CE |
CF |
DA |
DB |
F- |
DC |
DD |
DE |
DF |
E1 |
EA |
EB |
EC |
ED |
EE |
EF |
FA |
FB |
FC |
FD |
FE |
|
Note: I8-sequence bytes C0
... C4, and E0, and the corresponding UTF-EBCDIC bytes 74 ...
78, and B7, will not be used with the shortest number of bytes in
the transformed byte sequences. The corresponding entries are shown
italicized in the above table. |
Table 3: Byte map from UTF-EBCDIC
byte-sequence to I8-sequence
|
ß
High nibble (hex)
Low nibble (hex) Þ (all entries are in hex) |
|
-0 |
-1 |
-2 |
-3 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
-A |
-B |
-C |
-D |
-E |
-F |
0- |
00 |
01 |
02 |
03 |
9C |
09 |
86 |
7F |
97 |
8D |
8E |
0B |
0C |
0D |
0E |
0F |
1- |
10 |
11 |
12 |
13 |
9D |
0A |
08 |
87 |
18 |
19 |
92 |
8F |
1C |
1D |
1E |
1F |
2- |
80 |
81 |
82 |
83 |
84 |
85 |
17 |
1B |
88 |
89 |
8A |
8B |
8C |
05 |
06 |
07 |
3- |
90 |
91 |
16 |
93 |
94 |
95 |
96 |
04 |
98 |
99 |
9A |
9B |
14 |
15 |
9E |
1A |
4- |
20 |
A0 |
A1 |
A2 |
A3 |
A4 |
A5 |
A6 |
A7 |
A8 |
A9 |
2E |
3C |
28 |
2B |
7C |
5- |
26 |
AA |
AB |
AC |
AD |
AE |
AF |
B0 |
B1 |
B2 |
21 |
24 |
2A |
29 |
3B |
5E |
6- |
2D |
2F |
B3 |
B4 |
B5 |
B6 |
B7 |
B8 |
B9 |
BA |
BB |
2C |
25 |
5F |
3E |
3F |
7- |
BC |
BD |
BE |
BF |
C0 |
C1 |
C2 |
C3 |
C4 |
60 |
3A |
23 |
40 |
27 |
3D |
22 |
8- |
C5 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
C6 |
C7 |
C8 |
C9 |
CA |
CB |
9- |
CC |
6A |
6B |
6C |
6D |
6E |
6F |
70 |
71 |
72 |
CD |
CE |
CF |
D0 |
D1 |
D2 |
A- |
D3 |
7E |
73 |
74 |
75 |
76 |
77 |
78 |
79 |
7A |
D4 |
D5 |
D6 |
5B |
D7 |
D8 |
B- |
D9 |
DA |
DB |
DC |
DD |
DE |
DF |
E0 |
E1 |
E2 |
E3 |
E4 |
E5 |
5D |
E6 |
E7 |
C- |
7B |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
E8 |
E9 |
EA |
EB |
EC |
ED |
D- |
7D |
4A |
4B |
4C |
4D |
4E |
4F |
50 |
51 |
52 |
EE |
EF |
F0 |
F1 |
F2 |
F3 |
E- |
5C |
F4 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
5A |
F5 |
F6 |
F7 |
F8 |
F9 |
FA |
F- |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
FB |
FC |
FD |
FE |
FF |
9F |
|
Note: I8-sequence bytes C0
... C4, and E0, and the corresponding UTF-EBCDIC bytes 74 ...
78, and B7, will not be used with the shortest number of bytes in
the transformed byte sequences. The corresponding entries are shown
italicized in the above table. |
3.4 Shadow Flags
In order to assist in finding out if a byte in a UTF-EBCDIC sequence is a
leading byte or a trailing byte, and how many bytes in the sequence
corresponding to a Unicode character, rather than looking at the byte's bit
combination (after converting into its corresponding I8-sequence), or checking
the I8-sequence bytes to the known ranges of leading or trailing bytes, a shadow
flags table - shown in Table 4 - containing the category
of the byte can be utilized. The bytes having a value of '0' in the category
table are control characters, '1' are single bytes, '9' are trailing bytes and
'2'... '7' indicate the number of bytes in the sequence. Even though Table 1 shows I8-sequences of only up to 5 bytes (to
transform up to plane 16), the I8-sequence can contain up to 7 bytes to address
all of the UCS-4 space (31-bits) in ISO/IEC 10646 standard (see Table B.2 in Annex B).
Table 4: Shadow flags
associated with UTF-EBCDIC bytes
LEGEND : |
0 = Single-octet control characters
1 = Single-octet invariant and variant graphic characters from ASCII
2 = Lead octet of a 2-octet string
3 = Lead octet of a 3-octet string
4 = Lead octet of a 4-octet string
5 = Lead octet of a 5-octet string
6 = Lead octet of a 6-octet string
7 = Lead octet of a 7-octet string
9 = A trailing octet of a multi-octet string
(Underscore indicates change from previous draft of this TR) |
ß
High nibble (hex)
Low nibble (hex) Þ (hex) |
|
-0 |
-1 |
-2 |
-3 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
-A |
-B |
-C |
-D |
-E |
-F |
0- |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1- |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2- |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3- |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4- |
1 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
1 |
1 |
1 |
1 |
1 |
5- |
1 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
1 |
1 |
1 |
1 |
1 |
1 |
6- |
1 |
1 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
1 |
1 |
1 |
1 |
1 |
7- |
9 |
9 |
9 |
9 |
2 |
2 |
2 |
2 |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
8- |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
2 |
2 |
2 |
2 |
2 |
9- |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
2 |
2 |
2 |
2 |
2 |
A- |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
2 |
2 |
1 |
2 |
2 |
B- |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
3 |
3 |
3 |
3 |
3 |
3 |
1 |
3 |
3 |
C- |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
3 |
3 |
3 |
3 |
3 |
3 |
D- |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
3 |
3 |
4 |
4 |
4 |
4 |
E- |
1 |
4 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4 |
4 |
4 |
5 |
5 |
5 |
F- |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
5 |
6 |
6 |
7 |
7 |
0 |
Note: I8-sequence bytes C0
... C4, and E0, and the corresponding UTF-EBCDIC bytes 74
... 78, and B7, will not be used with the shortest number of
bytes in the transformed byte sequences. The corresponding shadow flag
values are shown italicized in the above table. |
3.5 Signature
The signature character U+FEFF (zero width no-break space) of Unicode
transforms into the I8-byte sequence X'F1 BF B7 BF' which maps to X'DD 73 66 73'
in UTF-EBCDIC. When this sequence is displayed (erroneously) using different
single-byte EBCDIC code pages, it can be visualized as different character
strings. In Latin-1 EBCDIC code page 1047 (and coincidentally also in Latin-1
code pages 500 and 37), this byte sequence appears as "ùËÃË"(small letter u with grave, capital letter
E with diaeresis, capital letter A with tilde, capital letter E with diaeresis). It can appear differently with other
single-byte EBCDIC code pages. As with UTF-8, the byte-swapped ("little-endian")
serialized Unicode byte strings must be converted to their "big-endian"
equivalents before applying the UTF-EBCDIC transformation.
3.6 Where to Use UTF-EBCDIC?
UTF-EBCDIC is intended to be used inside EBCDIC systems or in closed networks
where there is a dependency on EBCDIC hard-coding assumptions. It is not meant
to be used for open interchange among heterogeneous platforms using different
data encodings. Due to specific requirements for ASCII encoding for line endings
in some Internet protocols, UTF-EBCDIC is unsuitable for use over the Internet
using such protocols. UTF-8 or UTF-16 forms should be used in open interchange.
4 Bibliography
- The Unicode Standard Version 2.0: The Unicode Consortium ISBN
0-201-48345-9, Addison Wesley Developers Press, July 1996.
- CDRA: IBM - Character Data Representation Architecture - Reference and
Registry, SC09-2190-00, December 1996.
- ISO/IEC 10646-1: 1993(E): Information Processing - Universal Coded
Character Set (UCS):Part 1, Basic Multilingual Plane
- Amendment 1 to ISO/IEC 10646-1: Transformation Format for 16 Planes of
Group 00 (UTF-16); 1996
- Amendment 2 to ISO/IEC 10646-1: Transformation Format 8 (UTF-8)
- ISO/IEC 646: Information Processing - 7-Bit Coded Character Set for
Information Interchange
- ASCII - ANSI Standard X3.4; also the International Referenc Version of
ISO/IEC 646 - 1993
- ISO/IEC 4873: Information Processing - 8-Bit Code for Information
Interchange - Structure and Rules for implementation
- ISO/IEC 6429: Information Processing - 7-Bit and 8-Bit Coded Character
Sets - Control Functions for Coded Character Sets
- ISO/IEC 8859-xx: Information Processing - 8-Bit Single-Byte Coded Graphic
Character Sets (several parts)
- SHARE Report SSD No. 366: ASCII and EBCDIC Character Set and Code Issues
in Systems Application Architecture, The ASCII/EBCDIC Character Set Task
Force. Edited by Edwin Hart, The Johns Hopkins University, Applied Physics
Laboratory, Laurel, Maryland, USA; published by Share Inc., 111 East Wacker
Drive, Chicago, Illinois, USA 60601; June 1989
5 Annex A: Intellectual Property Related
Transcript of Letter
regarding Disclosure of IBM Technology - EF-UTF
(Hard copy is on file with the Chair of UTC and the Chair of NCITS/L2)
Transcribed on 1998-07-11
IBM LOGO
International Business Machines Corporation Route 100
Somers, NY 10589
June 2, 1998
The Chair, Unicode Technical Committee
Subject: Disclosure of IBM Technology - EBCDIC-Friendly UCS Transformation
Format (EF-UTF)
The attached document entitled "EBCDIC-Friendly UCS Transformation
Format (EF-UTF)" contains IBM technology that has been filed for
application for Canadian Patent. However, IBM believes that the technology could
be beneficial to the EBCDIC community at large; allowing the community to derive
the enormous benefits provided by UCS (ISO/IEC 10646 and Unicode).
This letter is to inform you that IBM is pleased to make the attached
documentation, and the associated technology that has been filed for patent,
freely available to anyone concerned towards making the transformation format as
part of the UCS standards.
Sincerely
SIGNED
Elizabeth G. Nichols
Director of National Language Support
and Information Development
EGN:ghs
Attachment
(Note: The term EF-UTF has been changed to UTF-EBCDIC at the suggestion of
UTC meting 78 -- V.S. Umamaheswaran)
6 Annex B: Additional Information
6.1 Controls, Variants, and Invariants in
EBCDIC
The positions assigned to the 65 control characters, the 82 invariant graphic
characters (including SPACE) and 13 variant graphic characters among the various
EBCDIC code pages in use is shown in the following table.
Table B.1: Positions of
controls, variants and invariants in EBCDIC
|
|
|
ß
High nibble (hex)
Low nibble (hex) Þ |
|
-0 |
-1 |
-2 |
-3 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
-A |
-B |
-C |
-D |
-E |
-F |
0- |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
1- |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
2- |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
3- |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
cc |
4- |
ii |
|
|
|
|
|
|
|
|
|
|
ii |
ii |
ii |
ii |
vv |
5- |
ii |
|
|
|
|
|
|
|
|
|
vv |
vv |
ii |
ii |
ii |
vv |
6- |
ii |
ii |
|
|
|
|
|
|
|
|
|
ii |
ii |
ii |
ii |
ii |
7- |
|
|
|
|
|
|
|
|
|
vv |
ii |
vv |
vv |
ii |
ii |
ii |
8- |
|
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
|
|
|
|
|
|
9- |
|
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
|
|
|
|
|
|
A- |
|
vv |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
|
|
|
vv |
|
|
B- |
|
|
|
|
|
|
|
|
|
|
|
|
|
vv |
|
|
C- |
vv |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
|
|
|
|
|
|
D- |
vv |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
|
|
|
|
|
|
E- |
vv |
|
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
|
|
|
|
|
|
F- |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
ii |
|
|
|
|
|
cc |
cc = EBCDIC control character positions; ii = EBCDIC
invariants from ASCII repertoire; vv = EBCDIC variants from ASCII repertoire |
6.2 A comparison of UTF-EBCDIC and UTF-8
UTF-EBCDIC is a byte-mapped version of I8-sequence. The bit patterns of UTF-EBCDIC
bytes and UTF-8 therefore are different. A comparison of the bit patterns of UTF-EBCDIC
is not so meaningful. However, the I8-sequence and UTF-8 sequence can be
compared to understand the salient differences between the two. UTF-8-Mod being
derived from UTF-8 retains all of its salient features. A comparative summary of
the basic characteristics of I8-sequence and UTF-8 sequence is shown in Table B.2 below. Note that this table shows the entire
31-bit UCS-4 range in the transformation, whereas Table 1
includes only the BMP and up to plane 16 using surrogate pairs.
Table B.2: Comparison of
I8-Sequence with UTF-8 Generated Byte Sequence
|
I8-sequence |
UTF-8-sequence |
Remarks |
|
|
|
|
No. of bytes in transformed
sequence |
Scalar Values
(hex) |
Scalar Values
(hex) |
|
1 |
00 to 9F |
00 to 7F |
C0, G0 and C1 in I8-sequence
C0 and G0 in UTF-8 |
2 |
A0 to 3FF |
80 to 7FF |
|
3 |
400 to 3FFF |
800 to FFFF |
To end of first quarter of BMP in I8-sequence;
To end of BMP in UTF-8 |
4 |
4000 to 3 FFFF |
1 0000 to 1F FFFF |
To end of plane 3 in I8-sequence;
To end of plane 31 in UTF-8 |
5 |
4 0000 to 3F FFFF |
20 0000 to 3FF FFFF |
To end of plane 63 in I8-sequence |
6 |
40 0000 to 3FF FFFF |
400 0000 to 7FFF FFFF |
|
7 |
400 0000 to 7FFF FFFF |
Not used |
To end of UCS in I8-sequence |
|
|
|
|
Trailing Bytes |
32 values - X'A0' -- X'BF'
B'101vvvvv'
5 v-bits per byte |
64 values - X'80' -- X'BF'
B'10vvvvvv'
6 v-bits per byte |
I8-sequence trailing byte has only five
information bits per trailing byte, compared to 6 in UTF-8 |
|
|
|
|
Lead Bytes |
Hex |
Hex |
|
2-Byte sequence |
C0 -- DF |
C0 -- DF |
Same in both |
3-Byte sequence |
E0 -- EF |
E0 -- EF |
Same in both |
4-Byte sequence |
F0 -- F7 |
F0 -- F7 |
Same in both |
5-Byte sequence |
F8 -- FB |
F8 -- FB |
Same in both |
6-Byte sequence |
FC and FD |
FC and FD |
Same in both |
7-Byte sequence |
FE and FF |
Not used |
Only used in UTF-8-Mod |
6.3 FEFF, FFFE, and FFFF in UTF-EBCDIC
U+FFFE and U+FFFF are not used for character allocation in any plane of Unicode.
U+FEFF (zero width no-break space) is used as a signature for Unicode,
for both UCS-2 and UTF-16 forms. U+FFFE may strongly suggest a byte-reversed
Unicode string. U+FFFF is used to represent a numeric value that is guaranteed
not to be a character, for uses such as the final value at the end of an index.
UTF-8 also avoids the use of X'FF' and X'FE' as octets in its sequences. In
I8-sequence, however, X'FE' and X'FF' may appear. The following paragraphs
expand on which combinations of X'FF' and X'FE' may occur in an I8-sequence or
UTF-EBCDIC sequence.
- X'FE' X'FF', X'FF' X'FE' and X'FF' X'FF' in the I8-sequence
The X'FE' and X'FF' are lead octets of seven-byte I8-sequence (assuming values
from all the planes of UCS-4). They will be surrounded (in a properly formed
I8-sequence) by a value less than X'C0'. None of the sequences X'FF' X'FF', X'FE'
X'FF', and X'FF' X'FE' can appear in a well-formed I8-sequence.
- X'FE' X'FF', X'FF'X'FE' and X'FF' X'FF' in the UTF-EBCDIC sequence
The I8-sequence to UTF-EBCDIC byte mappings are: X'FE' to X'FD', and X'FF' to
X'FE' (see Table 2). The values X'FE' and X'FF' can be
generated in a UTF-EBCDIC byte sequence from I8-sequence values by mapping
X'FF' to X'FE' and X'9F' to X'FF' from Table 2).
X'FF' is the lead byte of a seven-byte I8 sequence and must be followed by six
trailing bytes in the range X'A0' to X'BF', which does not include X'9F'. So the
X'FE' X'FF' sequence cannot appear in UTF-EBCDIC.
The X'9F' is assigned to the control character -- Application Program Command
(APC) -- in ISO-8 C1. According to ISO/IEC 6429, the APC is followed by a
parameter string using bit combinations from 0/8 to 0/13 (X'08' to X'0D') and
2/0 to 7/14 (X'20' to X'7E') and terminated by the control function String
Terminator (ST) (coded at X'9C' in C1). Therefore, the sequence X'FF' X'FF', the
equivalent of two APC controls without intervening parameters or ST-s, also
should not appear in UTF-EBCDIC sequence. None of the valid parameter bit
combinations can generate a 7-byte I8 sequence that starts with X'FF'. So the
sequence X'FF' X'FE' also cannot appear in a UTF-EBCDIC sequence.
6.4 Normalization to Fixed Width
Dealing with a variable number of bytes may not be possible or desirable in some
processing situations (even though proper handling of Unicode text strings will
require the ability to correctly deal with combining sequences). Normalization
into a form with a fixed number of bits is needed for such cases. It would
always be desirable to revert to the original 16-bit form or the corresponding
32-bit form as a normalization to fixed-width data.
However, this would be possible only if processing is tolerant to native
Unicode encoding. If transparency to EBCDIC invariance and controls is needed
also in the normalized form, then Unicode cannot be directly used for
normalization. It can be seen from Table 1 that the last
code position in the BMP (U+FFFF) requires four bytes in the I8-sequence and in
the corresponding UTF-EBCDIC sequence. A 32-bit integer can be used for
normalization of up to four-byte UTF-EBCDIC sequences.
The maximum Unicode scalar value that a four-byte I8-sequence or UTF-EBCDIC
sequence can represent is:
<11110111 10111111 10111111 10111111> (X'3FFFF') |
corresponding to the end of plane 3 in group 0. Using UTF-16 to represent
planes 1 to 16, the surrogate characters in the BMP can be used. By treating the
surrogate characters as any other BMP characters, up to plane 16 can be encoded
using the 16-bit form, and hence can be contained within the 32-bit normalized
form of UTF-EBCDIC. Care has to be taken to correctly process the corresponding
UTF-EBCDIC sequence corresponding to the surrogate pairs, similar to dealing
with combination sequences. When it is desirable to convert valid surrogate
pairs into corresponding Unicode scalar value and then apply UTF-EBCDIC, only up
to plane 3 can be contained within the 32-bit normalized value. For all values
beyond group 0, plane 3 of UCS, the UTF-EBCDIC will contain more than four
octets. The normalization for these cases will need 64 bits (assuming nothing
between 32 and 64 bits is practical).
6.5 Mapping of Bytes in Step 2
The control code position mapping used in default Unicode to EBCDIC code page
mappings, follow the pairings between ISO/IEC 6429 C0, DEL and C1 sets and
EBCDIC controls as defined in IBM Character Data Representation Architecture as
default, and customizing to the practice of OS/390 Unix services (MVS Open).
These pairings may not suit all EBCDIC environments. A well-known problem is
that of mapping EBCDIC New Line to Next Line in C1 of ISO/IEC 6429 versus Line
Feed in C0 was mentioned earlier. Similarly it is known that the 13 variant
characters are different among the various single byte EBCDIC code pages. The
well known impact of this is exemplified by the different code positions of the
Square Bracket characters. Even the lowercase a to z is variant in the EBCDIC
Katakana code page. A judicious one to one byte reversible map to convert only
those code points with category marked as '0'or a '1' may be employed as a step
3. Such a step 3 is not considered to be part of the UTF-EBCDIC transformation
defined in this technical report, and is considered as customization to suit
individual environments.
Similarly the pairing of I8-sequence bytes and UTF-EBCDIC sequence bytes
could be done in multiple ways. The simplest requirement on this byte-pairing is
that it should be unique and reversible. The pairing adopted in this version of
the UTR is based on the request from Oracle Corporation's representative Mr.
Jianping Yang -- to be able to maintain the order of the UTF-EBCDIC multi-byte
sequences the same as the order of the corresponding Unicode scalar values.
6.6 Ordering of UTF-EBCDIC Sequences
The mapping of the I8-bytes to UTF-EBCDIC bytes allows the multi-byte UTF-EBCDIC
sequences (corresponding to a Unicode character each) to be in the same order as
their corresponding Unicode scalar values. The ordering of the trailing bytes
and the leading bytes in the UTF-EBCDIC sequence (from Table
4) is:
trailing bytes << Lead bytes of 2-byte-sequence << .. ..
..
.. .. .. << Lead bytes of 7-byte sequences |
The byte values within each set are ordered in increasing order. Note that
the UTF-EBCDIC single-bytes do not have this property - either among themselves
or between themselves and the bytes of the multi-byte UTF-EBCDIC sequences. The
single-bytes are ordered according to their CP1047 order. So doing a
"binary comparison" of the text would look like:
for (i = 0; i < n; ++i) {
byte1 = source1[i];
byte2 = source2[i];
if (byte1 == byte2) continue; // fast path
// check for the single bytes vs multibytes
if (shadow[byte1] < 2) {
if (shadow[byte2] > 2) return
- 1; // single bytes less than multi
} else {
if (shadow[byte2] < 2) return
1; // multibyte greater than single
}
// now the shadows are of the same type, so just compare
the bytes
if (byte1 < byte 2) return - 1;
return 1;
}
return 0; |
The resulting order is a mix of EBCDIC CP1047 order for the single bytes and
Unicode order for the multi-byte UTF-EBCDIC characters.
However, if the desired order is to be the same order as Unicode scalar
values for all the characters, both the single-byte and the multi-byte
characters, the intermediate I8-sequence bytes should be compared. This approach
also makes the comparison immune to any local customization of the mapping (see Mapping of Bytes in Step 2) and provides a consistent
Unicode value order. The following is a sample for the comparison code.
for (i = 0; i < n; ++i) {
byte1 = source1[i];
byte2 = source2[i];
if (byte1 == byte2) continue; // fast path
// compare the I8-sequence counterparts
// take advantage of the ability of I8-sequence bytes
being similar
// to UTF-8 byte to preserve the same order as Unicode
scalar values
// ebtoi8 is the reverse mapping vector from UTF-EBCDIC
to I8 bytes
if (ebtoi8[byte1] < ebtoi8[byte2]) return - 1;
return 1;
}
return 0; |
If the desire is to preserve the EBCDIC order for the single-bytes (the ASCII
repertoire) or the traditional order of the multi-byte sequences (such as for
EBCDIC-Japanese, EBCDIC-Cyrillic, EBCDIC-Arabic etc.) localization resources
such as a weight look up table in locales should be employed.
6 Acknowledgments
The UTF-EBCDIC transformation was originally created and developed in the
National Language Technical Centre in IBM Toronto Laboratory by Messrs. Baldev
Soor, Alexis Cheng, Rick Pond, Ibrahim Meru and V.S. (Uma) Umamaheswaran. The
original version has been modified based on review feedback on the previous
versions of this Unicode Technical Report.
7 Revisions
This is the seventh revision of this technical report.
It corrects an error in the section 3.5 Signature, to read as "...
this byte sequence appears as "ùËÃË"
(small letter u with grave, capital letter E with diaeresis, capital letter A
with tilde, capital letter E with diaeresis )".
(Thanks to Robert Rosenberg -- Bob.Rosenberg@digitscorp.com for reporting this
error.)
Copyright © 1999, 2000 Unicode, Inc.. All Rights Reserved. The Unicode
Consortium makes no expressed or implied warranty of any kind, and assumes no
liability for errors or omissions. No liability is assumed for incidental and
consequential damages in connection with or arising out of the use of the
information or programs contained or accompanying this technical report.