DRAFT Unicode Technical Report #16
UTF-EBCDIC
EBCDIC-Friendly UCS Transformation Format
This report presents the specifications of UTF-EBCDIC - EBCID-Friendly
UCS Transformation Format.
Status of this document
This draft is published for public review. Previous version of this
document has been considered by the Unicode Technical Committee, and it
has had preliminary approval as a Draft Unicode Technical Report. The Unicode
Technical Committee may approve, reject, or further amend this document
before it becomes an approved Unicode Technical Report. This document does
not, at this time, imply any endorsement by the Consortium’s staff of member
organizations. Please mail comments to unicore@unicode.org.
Scope
The term UTF-EBCDIC stands for EBCDIC-friendly UCS Transformation Format.
EBCDIC, IBM’s Extended Binary Coded Decimal Interchange Code, is one of
the widely used 8-bit industry encodings. Detailed information on EBCDIC
can be found in the IBM publication IBM Character Data Representation
Architecture, Reference and Registry, SC09-2190-00, December 1996.
To address the use of Unicode character data in byte-oriented ASCII-based
systems, the Unicode Standard (see section A.2 of the Unicode Standard)
(also ISO/IEC 10646 –1, Amendment no. 2) has defined UTF-8. Use
of UTF-8 permits existing ASCII-based systems that have hard-coded dependency
on the encoding of the ASCII repertoire of characters to safely process
the corresponding Unicode characters. There is a similar requirement to
transform Unicode characters to a form that is safe for EBCDIC systems
for the control characters and invariant characters.
This Technical Report defines the EBCDIC-friendly UCS transformation
format, UTF-EBCDIC.
Neither UTF-EBCDIC nor its intermediate form called UTF-8M in this technical
report, are intended to be used in open interchange environments. It is
useful in homogeneous EBCDIC systems and networks.
Description
The UTF-EBCDIC encoding is derived from the Unicode values following a
two step process:
Conversion of the Unicode values to a variable length byte sequence called
I8-sequence
(intermediate 8-bit sequence) by applying a modified UTF-8 transformation
(UTF-8M), enabling the preservation of 65 control characters as single
bytes.
-
The bytes in the I8-sequence are then converted to the UTF-EBCDIC byte
sequence by using a single-byte to single-byte reversible conversion.
These two steps are defined below.
Note: The following notation is used in this Technical Report: X'nn
.. mm' represent hexadecimal values; <bb…bb> represent values in bit
notation; U+abcd represents a Unicode character.
Definition
Step 1: UTF-8M
The UTF-8M transformation definition is modeled after the UTF-8 definition
in the Unicode standard. UTF-8M transforms the Unicode values into I8-sequences.
The Unicode characters U+0000 to U+001F (corresponding to the C0 control
characters X'00' to X'1F' of ASCII), U+0020 to U+007E (the ASCII repertoire),
and U+007F (the ASCII ‘DEL’ control character) are represented as single
bytes in the I8-sequence. In addition, U+0080 to U+009F (corresponding
to the so-called C1 set of controls in ISO/IEC 6429) are also represented
as single bytes (X'80' to X'9F'). Thus the 65 Unicode characters corresponding
to the 65 ISO/IEC 6429 control characters and the 95 characters corresponding
to the 95 ASCII graphic characters (the G0 set) are represented in the
I8-sequence as single bytes.
When these are converted to the UTF-EBCDIC form, the corresponding 65
EBCDIC control characters and 95 EBCDIC graphic characters are preserved
as single bytes in the UTF-EBCDIC byte sequence. The 95 EBCDIC graphic
characters include 82 invariant (occupy the same code position) characters
(including SPACE) across most EBCDIC single-byte code pages and 13 variant
ASCII graphic characters (occupy varying code positions). Positions assigned
to controls, the invariant graphic characters and the variant graphics
are shown in Table B.1.
Furthermore, the values X'00'...X'9F' do not appear in any byte of an
I8-sequence except as the direct representation of U+0000 to U+009F. Each
Unicode value that is not a part of a valid surrogate pair is represented
in an I8-sequence by 1, 2, 3 or 4 bytes, depending on the Unicode value.
A valid surrogate pair maps into either 4 bytes or 5 bytes, depending on
the Unicode scalar values represented by the surrogate pair.
UTF-8M is intended to be used only as an intermediate step in arriving
at UTF-EBCDIC. It is not intended to be used elsewhere.
The I8-sequence is a variable length encoding of Unicode characters
using 8-bit sequences, where the high bits indicate which part of the sequence
a byte belongs to. Table 1 shows how the bits in
a Unicode value (or a valid surrogate pair) are distributed among the bytes
in the I8-sequence. The corresponding ranges of Unicode scalar values are
also shown.
Table 1: I8-Sequence
Bit Distribution
Scalar Value (hex)
|
Unicode Value
|
1st Byte
|
2nd Byte
|
3rd Byte
|
4th Byte
|
5th Byte
|
0 to 7F |
000000000xxxxxxx
|
0xxxxxxx
|
|
|
|
|
80 to 9F |
00000000100xxxxx
|
100xxxxx
|
|
|
|
|
A0 to 3FF |
000000yyyyyxxxxx
|
110yyyyy
|
101xxxxx
|
|
|
|
400 to 3FFF |
00zzzzyyyyyxxxxx
|
1110zzzz
|
101yyyyy
|
101xxxxx
|
|
|
4000 to FFFF |
wzzzzzyyyyyxxxxx
|
1111000w
|
101zzzzz
|
101yyyyy
|
101xxxxx
|
|
10000 to
3FFFF |
110110uuuuwzzzzz
+110111yyyyyxxxxx
|
11110ppwa
|
101zzzzz
|
101yyyyy
|
101xxxxx
|
|
40000 to
10FFFF |
110110uuuuwzzzzz
+110111yyyyyxxxxx
|
1111100qb
|
101ppppwb
|
101zzzzz
|
101yyyyy
|
101xxxxx
|
where a uuuu = 000pp -1, or b uuuu
= qpppp -1
(to account for addition of 1000016 as in Section 3.7,
Surrogates, in the Unicode Standard 2.0)
When converting Unicode values to I8-sequences, always use the shortest
number of bytes that can represent these values. This preserves uniqueness
of encoding. For example the Unicode value <0000000000000001> is encoded
as <00000001>, not as <11000000> <10100001>. The latter is an
example of an unused I8-sequence. Do not make use of these unused byte
sequences for encoding any other information.
When converting from I8-sequences to Unicode values, however, implementations
do not need to check that the shortest number of bytes is being used, which
simplifies the conversion algorithm.
As in the case of UTF-8, any byte-serialized Unicode characters of the
"little-endian" form (byte-swapped) must be converted to its corresponding
"big-endian" form before converting to the I8-sequence.
Characteristics of I8-sequence
Some of the important characteristics of I8-sequence are:
-
Unicode characters from U+0000 to U+007E (ASCII repertoire) map to single-byte
I8-sequence values X'00' to X'7E' (ASCII values X'00' to X'7E').
-
Unicode characters from U+0080 to U+009F (C1 controls) map to single-byte
I8-sequence values X'80' to X'9F' (IS 4873 values X'80'to X'9F').
-
ASCII values or IS 4873 values do not otherwise occur in an I8-sequence.
This paves the way for transforming these into corresponding single-byte
EBCDIC controls and graphics in the second step of UTF-EBCDIC transform.
-
It is very simple and efficient to convert to and from Unicode text.
-
The first byte indicates the number of bytes to follow in a multi-byte
sequence. This allows for efficient forward parsing.
It is efficient to find the start of a character string from an arbitrary
location in a byte stream. You need to search at most five bytes (seven
if full range of 31 bits of ISO/IEC 10646 is to be considered) backwards,
and it is simple to recognize an initial byte. For example, after converting
an UTF-EBCDIC byte back into I8-sequence, in C
isInitialByte = ( (byte & 0xE0) != 0xA0);
The search can also be done directly on UTF-EBCDIC byte by utilizing
a shadow vector (see Table 4 described later)
-
The I8-sequence is reasonably compact in terms of number of bytes used
for encoding.
Step 2: Byte Conversion
The second step of UTF-EBCDIC transforms the I8-sequences, using a reversible
one-to-one mapping, into the byte sequences of UTF-EBCDIC.
The 64 control characters (U+0000 to U+003F, U+0080 to U+009F), the
ASCII 'DEL' character (U+007F), the 95 ASCII graphic characters (including
the SPACE character) (U+0020 to U+007E) are mapped respecting EBCDIC conventions,
as defined in IBM Character Data Representation Architecture, CDRA,
with one exception -- the pairing of EBCDIC Line Feed and New Line control
characters are swapped from their CDRA
default pairings to
ISO 6429 Line Feed (U+000A) and Next Line (U+0085) control characters (to
be in line IBM OS/390 UNIX Services, or Open MVS practice and preference,
stemming from the hard-coding of X'0A' as the New Line in most ASCII-C
compilers.).
(Note: This is a significant change not discussed
before at UTC meetings. This was brought to light by the implementers of
UTF-EBCDIC in IBM -- that in the MVS Open environment the mapping to and
from Unicode maps the EBCDIC New Line to Unicode (U+000A) and Unicode (U+0085)
to EBCDIC Line Feed. This draft incorporates this change and requires approval
by UTC members / reviewers.)
The map preserves the invariance for a set of 82 graphic characters
(including SPACE) (known as the IBM Syntactic Graphic Character set), and
maintains consistency with IBM MVS Open Systems Code page (CPGID 1047)
for the variant characters from within the ASCII repertoire.
The remaining 96 bytes of EBCDIC 8-bit structure are allocated to X'A0'
to X'FF' -- the trailing bytes and leading bytes of the I8-sequence.(from
Table
1). The minimum criterion for allocation of these bytes is that it
provides for a reversible map.
The trailing and leading bytes (X'A0' to X'FF' of the I8-sequence) are
paired with the unassigned UTF-EBCDIC bytes in increasing order. Table
2 and table 3 show the byte maps between the
I8-sequence bytes and UTF-EBCDIC bytes in the forward and reverse directions
respectively. The resulting UTF-EBCDIC multi-byte sequences will be in
the same order as their corresponding Unicode scalar values (when the sequences
are zero-filled to equal number of bytes and compared with each other).
Please note that the UTF-EBCDIC single-byte values, however, will not
be in the same order as their corresponding Unicode values.
(Note: At UTC 79 meeting in San Jose, there
was a request for reallocating these bytes to permit the multi-byte EBCDIC-UTF
sequences to be in the same order as their corresponding Unicode values,
as with UTF-8 sequences. This was not accepted at the meeting itself on
the basis of the perceived benefit being not strong enough to possibly
break existing implementations. After the meeting, it has been ascertained
that it is not too late for implementations to change, to be able to derive
the perceived benefits of the above ordering. This draft incorporates this
change and requires approval by UTC members / reviewers.)
The resulting UTF-EBCDIC byte sequence can be transparently processed
in most EBCDIC systems. It also retains all the characteristics (see the
section "Characteristics of I8-sequence"
above) of I8-sequence mentioned earlier. Since EBCDIC code page definitions
have 13 variants (and only 82 invariants) the choice of the above byte
map for the graphic characters has been made to accommodate the MVS Open
Systems environment for standardization purposes.
Table 2: Byte map from I8-sequence
to UTF-EBCDIC byte sequence
ß High nibble
(hex)
Low nibble (hex) Þ (all entries are
in hex) |
|
-0 |
-1 |
-2 |
-3 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
-A |
-B |
-C |
-D |
-E |
-F |
0- |
00 |
01 |
02 |
03 |
37 |
2D |
2E |
2F |
16 |
05 |
15 |
0B |
0C |
0D |
0E |
0F |
1- |
10 |
11 |
12 |
13 |
3C |
3D |
32 |
26 |
18 |
19 |
3F |
27 |
1C |
1D |
1E |
1F |
2- |
40 |
5A |
7F |
7B |
5B |
6C |
50 |
7D |
4D |
5D |
5C |
4E |
6B |
60 |
4B |
61 |
3- |
F0 |
F1 |
F2 |
F3 |
F4 |
F5 |
F6 |
F7 |
F8 |
F9 |
7A |
5E |
4C |
7E |
6E |
6F |
4- |
7C |
C1 |
C2 |
C3 |
C4 |
C5 |
C6 |
C7 |
C8 |
C9 |
D1 |
D2 |
D3 |
D4 |
D5 |
D6 |
5- |
D7 |
D8 |
D9 |
E2 |
E3 |
E4 |
E5 |
E6 |
E7 |
E8 |
E9 |
AD |
E0 |
BD |
5F |
6D |
6- |
79 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
91 |
92 |
93 |
94 |
95 |
96 |
7- |
97 |
98 |
99 |
A2 |
A3 |
A4 |
A5 |
A6 |
A7 |
A8 |
A9 |
C0 |
4F |
D0 |
A1 |
07 |
8- |
20 |
21 |
22 |
23 |
24 |
25 |
06 |
17 |
28 |
29 |
2A |
2B |
2C |
09 |
0A |
1B |
9- |
30 |
31 |
1A |
33 |
34 |
35 |
36 |
08 |
38 |
39 |
3A |
3B |
04 |
14 |
3E |
FF |
A- |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
4A |
51 |
52 |
53 |
54 |
55 |
56 |
B- |
57 |
58 |
59 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
6A |
70 |
71 |
72 |
73 |
C- |
74 |
75 |
76 |
77 |
78 |
80 |
8A |
8B |
8C |
8D |
8E |
8F |
90 |
9A |
9B |
9C |
D- |
9D |
9E |
9F |
A0 |
AA |
AB |
AC |
AE |
AF |
B0 |
B1 |
B2 |
B3 |
B4 |
B5 |
B6 |
E- |
B7 |
B8 |
B9 |
BA |
BB |
BC |
BE |
BF |
CA |
CB |
CC |
CD |
CE |
CF |
DA |
DB |
F- |
DC |
DD |
DE |
DF |
E1 |
EA |
EB |
EC |
ED |
EE |
EF |
FA |
FB |
FC |
FD |
FE |
Underscored entries
denote changes from previous draft of this UTR. |
Note: I8-sequence
bytes C0 … C4, and E0, and the corresponding UTF-EBCDIC
bytes 74 ... 78, and B7, will not be used with the shortest
number of bytes in the transformed byte sequences. The corresponding
entries are shown italicized in the above table. |
Table 3: Byte map from UTF-EBCDIC
byte-sequence to I8-sequence
ß High nibble
(hex)
Low nibble (hex) Þ (all entries are
in hex) |
|
-0 |
-1 |
-2 |
-3 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
-A |
-B |
-C |
-D |
-E |
-F |
0- |
00 |
01 |
02 |
03 |
9C |
09 |
86 |
7F |
97 |
8D |
8E |
0B |
0C |
0D |
0E |
0F |
1- |
10 |
11 |
12 |
13 |
9D |
0A |
08 |
87 |
18 |
19 |
92 |
8F |
1C |
1D |
1E |
1F |
2- |
80 |
81 |
82 |
83 |
84 |
85 |
17 |
1B |
88 |
89 |
8A |
8B |
8C |
05 |
06 |
07 |
3- |
90 |
91 |
16 |
93 |
94 |
95 |
96 |
04 |
98 |
99 |
9A |
9B |
14 |
15 |
9E |
1A |
4- |
20 |
A0 |
A1 |
A2 |
A3 |
A4 |
A5 |
A6 |
A7 |
A8 |
A9 |
2E |
3C |
28 |
2B |
7C |
5- |
26 |
AA |
AB |
AC |
AD |
AE |
AF |
B0 |
B1 |
B2 |
21 |
24 |
2A |
29 |
3B |
5E |
6- |
2D |
2F |
B3 |
B4 |
B5 |
B6 |
B7 |
B8 |
B9 |
BA |
BB |
2C |
25 |
5F |
3E |
3F |
7- |
BC |
BD |
BE |
BF |
C0 |
C1 |
C2 |
C3 |
C4 |
60 |
3A |
23 |
40 |
27 |
3D |
22 |
8- |
C5 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
C6 |
C7 |
C8 |
C9 |
CA |
CB |
9- |
CC |
6A |
6B |
6C |
6D |
6E |
6F |
70 |
71 |
72 |
CD |
CE |
CF |
D0 |
D1 |
D2 |
A- |
D3 |
7E |
73 |
74 |
75 |
76 |
77 |
78 |
79 |
7A |
D4 |
D5 |
D6 |
5B |
D7 |
D8 |
B- |
D9 |
DA |
DB |
DC |
DD |
DE |
DF |
E0 |
E1 |
E2 |
E3 |
E4 |
E5 |
5D |
E6 |
E7 |
C- |
7B |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
E8 |
E9 |
EA |
EB |
EC |
ED |
D- |
7D |
4A |
4B |
4C |
4D |
4E |
4F |
50 |
51 |
52 |
EE |
EF |
F0 |
F1 |
F2 |
F3 |
E- |
5C |
F4 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
5A |
F5 |
F6 |
F7 |
F8 |
F9 |
FA |
F- |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
FB |
FC |
FD |
FE |
FF |
9F |
Underscored entries
denote changes from previous draft of this UTR. |
Note: I8-sequence
bytes C0 … C4, and E0, and the corresponding UTF-EBCDIC
bytes 74 ... 78, and B7, will not be used with the shortest
number of bytes in the transformed byte sequences. The corresponding
entries are shown italicized in the above table. |
Shadow flags
In order to assist in finding out if a byte in a UTF-EBCDIC sequence is
a leading byte or a trailing byte, and how many bytes in the sequence corresponding
to a Unicode character, rather than looking at the byte's bit combination
(after converting into its corresponding I8-sequence), or checking the
I8-sequence bytes to the known ranges of leading or trailing bytes, a shadow
flags table – shown in Table 4 – containing the
category of the byte can be utilized. The bytes having a value of '0' in
the category table are control characters, '1' are single bytes, '9' are
trailing bytes and '2'… '7' indicate the number of bytes in the sequence.
Even though
Table 1 shows I8-sequences of only up
to 5 bytes (to transform up to plane 16), the I8-sequence can contain up
to 7 bytes to address all of the UCS-4 space (31-bits) in ISO/IEC 10646
standard (see
Table B.2 in Annex B).
(Note: The categories of bytes in the shadow
flag table has changed to reflect the changes in Tables 3 and 4 above.
This draft incorporates this change and requires approval by UTC members
/ reviewers.)
Table 4: Shadow flags associated
with UTF-EBCDIC bytes
LEGEND : |
0 = Single-octet control characters
1 = Single-octet invariant and variant graphic characters from ASCII
2 = Lead octet of a 2-octet string
3 = Lead octet of a 3-octet string
4 = Lead octet of a 4-octet string
5 = Lead octet of a 5-octet string
6 = Lead octet of a 6-octet string
7 = Lead octet of a 7-octet string
9 = A trailing octet of a multi-octet string
(Underscore indicates change from previous draft of this TR) |
ß High nibble
(hex)
Low nibble (hex) Þ (hex) |
|
-0 |
-1 |
-2 |
-3 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
-A |
-B |
-C |
-D |
-E |
-F |
0- |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1- |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2- |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3- |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4- |
1 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
1 |
1 |
1 |
1 |
1 |
5- |
1 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
1 |
1 |
1 |
1 |
1 |
1 |
6- |
1 |
1 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
9 |
1 |
1 |
1 |
1 |
1 |
7- |
9 |
9 |
9 |
9 |
2 |
2 |
2 |
2 |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
8- |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
2 |
2 |
2 |
2 |
2 |
9- |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
2 |
2 |
2 |
2 |
2 |
A- |
2 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
2 |
2 |
2 |
1 |
2 |
2 |
B- |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
3 |
3 |
3 |
3 |
3 |
3 |
1 |
3 |
3 |
C- |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
3 |
3 |
3 |
3 |
3 |
3 |
D- |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
3 |
3 |
4 |
4 |
4 |
4 |
E- |
1 |
4 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4 |
4 |
4 |
5 |
5 |
5 |
F- |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
5 |
6 |
6 |
7 |
7 |
0 |
Note: I8-sequence
bytes C0 … C4, and E0, and the corresponding UTF-EBCDIC
bytes 74 ... 78, and B7, will not be used with the shortest
number of bytes in the transformed byte sequences. The corresponding shadow
flag values are shown italicized inthe above table. |
Underscored entries denote
changes from previous draft of this UTR. |
Signature
The signature character U+FEFF (zero width no-break space) of Unicode
transforms into the I8-byte sequence X'F1 BF B7 BF' which maps to X'DD
73 66 73' in UTF-EBCDIC. When this sequence is displayed (erroneously)
using different a single-byte EBCDIC code pages, it can be visualized as
different character strings. In Latin-1 EBCDIC code page 1047 (and coincidentally
also in Latin-1 code pages 500 and 37), this byte sequence appears as "ùËÃÊ"
(small letter u with grave, capital letter E with diaeresis, capital
letter A with tilde, capital letter E with circumflex). It can appear
differently with other single-byte EBCDIC code pages. As with UTF-8, the
byte-swapped ("little-endian") serialized Unicode byte strings must be
converted to their "big-endian" equivalents before applying the UTF-EBCDIC
transformation.
(Note: The signature sequence of UTF-EBCDIC
has changed to reflect the changes in Tables 3 and 4 above. This draft
incorporates this change and requires approval by UTC members / reviewers.)
Where to use UTF-EBCDIC?
UTF-EBCDIC is intended to be used inside EBCDIC systems or in closed networks
where there is a dependency on EBCDIC hard-coding assumptions. It is not
meant to be used for open interchange among heterogeneous platforms using
different data encodings. Due to specific requirements for ASCII encoding
for line endings in some Internet protocols, UTF-EBCDIC is unsuitable for
use over the Internet using such protocols. UTF-8 or UTF-16 forms should
be used in open interchange.
Bibliography
The Unicode Standard Version 2.0: The Unicode Consortium ISBN 0-201-48345-9,
Addison Wesley Developers Press, July 1996.
CDRA: IBM - Character Data Representation Architecture - Reference and
Registry, SC09-2190-00, December 1996.
ISO/IEC 10646-1: 1993(E): Information Processing - Universal Coded Character
Set (UCS):Part 1, Basic Multilingual Plane
Amendment 1 to ISO/IEC 10646-1: Transformation Format for 16 Planes
of Group 00 (UTF-16); 1996
Amendment 2 to ISO/IEC 10646-1: Transformation Format 8 (UTF-8)
ISO/IEC 646: Information Processing - 7-Bit Coded Character Set for
Information Interchange
ASCII – ANSI Standard X3.4; also the International Referenc Version
of ISO/IEC 646 - 1993
ISO/IEC 4873: Information Processing - 8-Bit Code for Information Interchange
- Structure and Rules for implementation
ISO/IEC 6429: Information Processing - 7-Bit and 8-Bit Coded Character
Sets - Control Functions for Coded Character Sets
ISO/IEC 8859-xx: Information Processing - 8-Bit Single-Byte Coded Graphic
Character Sets (several parts)
SHARE Report SSD No. 366: ASCII and EBCDIC Character Set and Code Issues
in Systems Application Architecture, The ASCII/EBCDIC Character Set Task
Force. Edited by Edwin Hart, The Johns Hopkins University, Applied Physics
Laboratory, Laurel, Maryland, USA; published by Share Inc., 111 East Wacker
Drive, Chicago, Illinois, USA 60601; June 1989
ANNEX A: Intellectual Property Related
Transcript of Letter
regarding Disclosure of IBM Technology - EF-UTF
(Hard copy is on file with the Chair of UTC and the Chair of NCITS/L2)
Transcribed on 1998-07-11
IBM LOGO
International Business Machines Corporation Route 100
Somers, NY 10589
June 2, 1998
The Chair, Unicode Technical Committee
Subject: Disclosure of IBM Technology - EBCDIC-Friendly UCS Transformation
Format (EF-UTF)
The attached document entitled "EBCDIC-Friendly UCS Transformation Format
(EF-UTF)" contains IBM technology that has been filed for application for
Canadian Patent. However, IBM believes that the technology could be beneficial
to the EBCDIC community at large; allowing the community to derive the
enormous benefits provided by UCS (ISO/IEC 10646 and Unicode).
This letter is to inform you that IBM is pleased to make the attached
documentation, and the associated technology that has been filed for patent,
freely available to anyone concerned towards making the transformation
format as part of the UCS standards.
Sincerely
SIGNED
Elizabeth G. Nichols
Director of National Language Support
and Information Development
EGN:ghs
Attachment
(Note: The term EF-UTF has been changed to UTF-EBCDIC at the suggestion
of UTC meting 78 -- V.S. Umamaheswaran)
ANNEX B: Additional Information
Positions of controls, variants and invariants in EBCDIC
The positions assigned to the 65 control characters, the 82 invariant graphic
characters (including SPACE) and 13 variant graphic characters among the
various EBCDIC code pages in use is shown in the following table.
Table B.1: Positions of
controls, variants and invariants in EBCDIC
ß High nibble (hex)
Low nibble (hex) Þ |
|
-0
|
-1
|
-2
|
-3
|
-4
|
-5
|
-6
|
-7
|
-8
|
-9
|
-A
|
-B
|
-C
|
-D
|
-E
|
-F
|
0-
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
1-
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
2-
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
3-
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
cc
|
4-
|
ii
|
|
|
|
|
|
|
|
|
|
|
ii
|
ii
|
ii
|
ii
|
vv
|
5-
|
ii
|
|
|
|
|
|
|
|
|
|
vv
|
vv
|
ii
|
ii
|
ii
|
vv
|
6-
|
ii
|
ii
|
|
|
|
|
|
|
|
|
|
ii
|
ii
|
ii
|
ii
|
ii
|
7-
|
|
|
|
|
|
|
|
|
|
vv
|
ii
|
vv
|
vv
|
ii
|
ii
|
ii
|
8-
|
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
|
|
|
|
|
|
9-
|
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
|
|
|
|
|
|
A-
|
|
vv
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
|
|
|
vv
|
|
|
B-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
vv
|
|
|
C-
|
vv
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
|
|
|
|
|
|
D-
|
vv
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
|
|
|
|
|
|
E-
|
vv
|
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
|
|
|
|
|
|
F-
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
ii
|
|
|
|
|
|
cc
|
cc = EBCDIC control character positions;
ii = EBCDIC invariants from ASCII repertoire; vv = EBCDIC variants from
ASCII repertoire |
A comparison of UTF-EBCDIC and UTF-8
UTF-EBCDIC is a byte-mapped version of I8-sequence. The bit patterns of
UTF-EBCDIC bytes and UTF-8 therefore are different. A comparison of the
bit patterns of UTF-EBCDIC is not so meaningful. However, the I8-sequence
and UTF-8 sequence can be compared to understand the salient differences
between the two. UTF-8M being derived from UTF-8 retains all of its salient
features. A comparative summary of the basic characteristics of I8-sequence
and UTF-8 sequence is shown in Table B.2
below. Note that this table shows the entire range 31 bits in the
transformation, whereas Table 1 includes only the
BMP and up to plane 16 using surrogate pairs.
Table B.2: Comparison
of I8-Sequence with UTF-8 Generated Byte Sequence
|
I8-sequence
|
UTF-8-sequence
|
Remarks
|
|
|
|
|
No. of bytes in transformed sequence |
Scalar Values
(hex)
|
Scalar Values
(hex)
|
|
1
|
00 to 9F
|
00 to 7F
|
C0, G0 and C1 in I8-sequence
C0 and G0 in UTF-8 |
2
|
A0 to 3FF
|
80 to 7FF
|
|
3
|
400 to 7FFF
|
800 to FFFF
|
To middle of BMP in I8-sequence
To end of BMP in UTF-8 |
4
|
8000 to 3 FFFF
|
1 0000 to 1F FFFF
|
To end of plane 3 in I8-sequence
To end of plane 16 in UTF-8 |
5
|
4 0000 to 3F FFFF
|
20 0000 to 3FF FFFF
|
To end of plane 16 in I8-sequence |
6
|
40 0000 to 3FF FFFF
|
400 0000 to 7FFF FFFF
|
To end of UCS in UTF-8 |
7
|
400 0000 to 7FFF FFFF
|
Not used
|
To end of UCS in I8-sequence |
|
|
|
|
Trailing Bytes |
32 values - X'A0' -- X'BF'
B'101vvvvv'
5 v-bits per byte
|
64 values - X'80' -- X'BF'
B'10vvvvvv'
6 v-bits per byte
|
I8-sequence trailing byte has only five information
bits per trailing byte, compared to 6 in UTF-8 |
|
|
|
|
Lead Bytes for: |
Hex
|
Hex
|
|
2-Byte sequence |
C0 -- DF
|
C0 -- DF
|
Same in both |
3-Byte sequence |
E0 -- EF
|
E0 -- EF
|
Same in both |
4-Byte sequence |
F0 -- F7
|
F0 -- F7
|
Same in both |
5-Byte sequence |
F8 -- FB
|
F8 -- FB
|
Same in both |
6-Byte sequence |
FC and FD
|
FC and FD
|
Same in both |
7-Byte sequence |
FE and FF
|
Not used
|
Only used in UTF-8M |
Special nature of UCS values FEFF, FFFE and FFFF
U+FFFE and U+FFFF are not used for character allocation in any plane of
Unicode. U+FEFF (zero width no-break space) is used as a signature
for Unicode, for both UCS-2 and UTF-16 forms. U+FFFE may strongly
suggest a byte-reversed Unicode string. U+FFFF is used to represent a numeric
value that is guaranteed not to be a character, for uses such as the final
value at the end of an index. UTF-8 also avoids the use of X'FF' and X'FE'
as octets in its sequences. In I8-sequence, however, X'FE' and X'FF' may
appear. The following paragraphs expand on which combinations of X'FF'
and X'FE' may occur in an I8-sequence or UTF-EBCDIC sequence.
-
X'FE' X'FF', X'FF' X'FE' and X'FF' X'FF' in the I8-sequence
The X'FE' and X'FF' are lead octets of seven-byte I8-sequence (assuming
values from all the planes of UCS-4). They will be surrounded (in a properly
formed I8-sequence) by a value less than X'C0'. None of the sequences X'FF'
X'FF', X'FE' X'FF', and X'FF' X'FE' can appear in a well-formed I8-sequence.
-
X'FE' X'FF', X'FF'X'FE' and X'FF' X'FF' in the UTF-EBCDIC sequence
The I8-sequence to UTF-EBCDIC byte mappings are: X'FE' to X'FD',
and X'FF' to X'FE' (see Table 2). The
values X'FE' and X'FF' can be generated in a UTF-EBCDIC byte sequence from
I8-sequence values by mapping X'FF' to X'FE' and X'9F' to X'FF' (from Table
2).
X'FF' is the lead byte of a seven-byte I8 sequence and must be followed
by six trailing bytes in the range X'A0' to X'BF', which does not include
X'9F'. So the X'FE' X'FF' sequence cannot appear in UTF-EBCDIC.
The X'9F' is assigned to the control character -- Application Program
Command (APC) -- in ISO-8 C1. According to ISO/IEC 6429, the APC is followed
by a parameter string using bit combinations from 0/8 to 0/13 (X'08' to
X'0D') and 2/0 to 7/14 (X'20' to X'7E') and terminated by the control function
String Terminator (ST) (coded at X'9C' in C1). Therefore, the sequence
X'FF' X'FF', the equivalent of two APC controls without intervening parameters
or ST-s, also should not appear in UTF-EBCDIC sequence.
None of the valid parameter bit combinations can generate a 7-byte I8 sequence
that starts with X'FF'. So the sequence X'FF' X'FE' also cannot appear
in a UTF-EBCDIC sequence.
(Note: The above section has been rewritten to
reflect the changes in Tables 3 and 4 above. This draft incorporates this
change and requires approval by UTC members / reviewers.)
Normalization to fixed width
Dealing with a variable number of bytes may not be possible or desirable
in some processing situations (even though proper handling of Unicode text
strings will require the ability to correctly deal with combining sequences).
Normalization into a form with a fixed number of bits is needed for such
cases. It would always be desirable to revert to the original 16-bit form
or the corresponding 32-bit form as a normalization to fixed-width data.
However, this would be possible only if processing is tolerant to native
Unicode encoding. If transparency to EBCDIC invariance and controls is
needed also in the normalized form, then Unicode cannot be directly used
for normalization. It can be seen from Table 1 that
the last code position in the BMP (U+FFFF) requires a four-bytes in the
I8-sequence and in the corresponding UTF-EBCDIC sequence. A 32-bit integer
can be used for normalization of up to four-byte UTF-EBCDIC sequences.
The maximum Unicode scalar value that a four-byte I8-sequence or UTF-EBCDIC
sequence can represent is:
<11110111 10111111 10111111 10111111> (X'3FFFF')
corresponding to the end of plane 3 in group 0. Using UTF-16 to represent
planes 1 to 16, the surrogate characters in the BMP can be used. By treating
the surrogate characters as any other BMP characters, up to plane 16 can
be encoded using the 16-bit form, and hence can be contained within the
32-bit normalized form of UTF-EBCDIC. Care has to be taken to correctly
process the corresponding UTF-EBCDIC sequence corresponding to the surrogate
pairs, similar to dealing with combination sequences. When it is desirable
to convert valid surrogate pairs into corresponding Unicode scalar value
and then apply UTF-EBCDIC, only up to plane 3 can be contained within the
32-bit normalized value. For all values beyond group 0, plane 3 of UCS,
the UTF-EBCDIC will contain more than four octets. The normalization for
these cases will need 64 bits (assuming nothing between 32 and 64 bits
is practical).
On the mapping of bytes in step
2
The control code position mapping used in default Unicode to EBCDIC code
page mappings, follow the pairings between IS 6429 C0, DEL and C1 sets
and EBCDIC controls as defined in IBM Character Data Representation Architecture
as
default, and customizing to the practice of OS/390 Unix services (MVS
Open). These pairings may not suit all EBCDIC environments. A well-known
problem is that of mapping EBCDIC New Line to Next Line in C1 of IS 6429
versus Line Feed in C0 was mentioned earlier. Similarly it is known that
the 15 variant characters are different among the various single byte EBCDIC
code pages. The well known impact of this is exemplified by the different
code positions of the Square Bracket characters. Even the lowercase a to
z is variant in the EBCDIC Katakana code page. A judicious one to one byte
reversible map to convert only those code points with category marked as
'0'or a '1' may be employed as a step 3. Such a step 3 is not considered
to be part of the UTF-EBCDIC transformation that is defined in this technical
report, and is considered as customization to suit individual environments.
Similarly the pairing of I8-sequence bytes and UTF-EBCDIC sequence bytes
could be done in multiple ways. The simplest requirement on this byte-pairing
is that it should be unique and reversible. The pairing adopted in this
version of the UTR is based on the request from Oracle Corporation's representative
Mr. Jianping Yang -- to be able to maintain the order of the UTF-EBCDIC
multi-byte sequences the same as the order of the corresponding Unicode
scalar values.
On the ordering of UTF-EBCDIC sequences
The mapping of the I8-bytes to UTF-EBCDIC bytes allows the multi-byte UTF-EBCDIC
sequences (corresponding to a Unicode character each) to be in the same
order as their corresponding Unicode values. The ordering of the
trailing bytes and the leading bytes in the UTF-EBCDIC sequence (from Table
4) is:
trailing bytes << Leading bytes of 2-byte-sequence
<< .. .. .. <<
Leading bytes of 7-byte
sequences
The byte values within each set is ordered in increasing order. Note
that the UTF-EBCDIC single-bytes do not have this property – either among
themselfves or between themselves and the bytes of the multi-byte UTF-EBCDIC
sequences. The single-bytes are ordered according to their CP1047
order. So doing a "binary comparison" of the text would look like:
for (i = 0; i < n; ++i) {
byte1 = source1[i];
byte2 = source2[i];
if (byte1 == byte2) continue; // fast path
// check for the single bytes vs multibytes
if (shadow[byte1] < 2) {
if (shadow[byte2] >
2) return - 1; // single bytes less than multi
} else {
if (shadow[byte2] <
2) return 1; // multibyte greater than single
}
// now the shadows are of the same type, so just
compare the bytes
if (byte1 < byte 2) return - 1;
return 1;
}
The resulting order is a mix of EBCDIC CP1047 order for the single bytes
and Unicode order for the multi-byte UTF-EBCDIC characters.
However, if the desired order is to be the same order as Unicode values
for all the characters, both the single-byte and the multi-byte characters,
the intermediate I8-sequence bytes should be compared. This approach
also makes the comparison immune to any local customization of the mapping
(see On the Mapping .. ) and provides a
consistent Unicode value order. The following is a sample for the
comparison code.
for (i = 0; i < n; ++i) {
byte1 = source1[i];
byte2 = source2[i];
if (byte1 == byte2) continue; // fast path
// compare the I8-sequence counterparts
// take advantage of the ability of I8-sequence
bytes being similar
// to UTF-8 byte to preserve the same order
as Unicode values
// ebtoi8 is the reverse mapping vector from
UTF-EBCDIC to I8 bytes
if (ebtoi8[byte1] < ebtoi8[byte2]) return
- 1;
return 1;
}
If the desire is to preserve the EBCDIC order for the single-bytes (the
ASCII repertoire) or the traditional order of the multi-byte sequences
(such as for EBCDIC-Japanese, EBCDIC-Cyrillic, EBCDIC-Arabic etc.) localization
resources such as a weight look up table in locales should be employed.
Credits
The UTF-EBCDIC transformation was originally created and developed
in the National Language Technical Centre in IBM Toronto Laboratory by
Messrs. Baldev Soor, Alexis Cheng, Rick Pond, Ibrahim Meru and V.S. (Uma)
Umamaheswaran. The original version has been modified based on review
feedback on the previous versions of this Unicode Technical Report..
Changes from previous revisions
This is the stable second version of this technical report. It has been
completely re-written based on input received from some UTC reviewers to
make it more suitable as a Unicode technical report than a tutorial document.
It incorporates changes to address the review comments received on the
distributed copy prior to Unicode Technical Committee meeting UTC 79, and
corrects errors pointed out by several e-mail comments received. It also
incorporates the significant changes to the byte pairings in Step 2 to
swap the pairings of Line Feed and New Line controls and reorder the allocation
of the trailing bytes and leading bytes to result in UTF-EBCDIC multi-byte
sequences being in the same order as their corresponding Unicode scalar
values (the single byte values will be in their EBCDIC CP 1047 order).
Copyright
Copyright (C)1999 Unicode, Inc.. All Rights Reserved. The Unicode Consortium
makes no expressed or implied warranty of any kind, and assumes no liability
for errors or omissions. No liability is assumed for incidental and consequential
damages in connection with or arising out of the use of the information
or programs contained or accompanying this technical report.