UTF-17

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jun 21 2001 - 17:38:40 EDT


In the way of solutions seeking a problem, I would like to
propose a new UTF: UTF-17.

UTF-17 converts each Unicode code point to a sequence of
1 synchronizing byte followed by 7 further bytes, for a total
of 8 bytes per character. Each code point in the range
0..10FFFF is treated as a 21-bit integer, and the 21 bits
are distributed according to the following formula:

x xxxx xxxx xxxx xxxx xxxx

==>

00111000 00110xxx 00110xxx 00110xxx 00110xxx 00110xxx 00110xxx 00110xxx

In UTF-17, for example, the Han character sequence <U+5341, U+4E03>
('17'), would be converted to:

<38 30 31 31 31 36 30 31 38 30 31 30 37 30 30 33>

Because all UTF-17 bytes are in the range 0x30..0x38, this
UTF-17 byte sequence would also be visible displayed in
ASCII (or Latin-1) as: "8011160180107003".

One special exception is provided. U+0000 is transformed into
UTF-17 with 00000xxx as the pattern for the 8th byte, rather
than 00110xxx. Thus U+0000 has the unique representation:

<38 30 30 30 30 30 30 00> (or "8000000'\0'")

This is for C compatibility, so that any null-terminated Unicode
string will also be null-terminated in UTF-17.

UTF-17 is self-synchronizing, since it has a unique lead byte
for each byte sequence, and all trail bytes start with a
separate bit sequence.

UTF-17 is fixed-width, and so could be implemented as a wchar_t
processing code on new 64-bit systems (the wave of the future).

UTF-17 is highly patterned, and would be easy to auto-identify
for any charset converter.

UTF-17 is easy to calculate, even for the hex-impaired, as only
8 bit combinations need be remembered, and they correspond directly
to the second digits of the ASCII 0..7 as hex codes.

UTF-17 will interoperate easily with UTF-64.

UTF-17 is compatible with ASCII, as long as you avoid digits in
your ASCII text.

Since all UTF-17 bytes display as digits, it is programmer
friendly. All UTF-17 values will display visibly and correctly
in any debugger, and the programmer need only recall that
"80111601" means U+5341, for instance, to get back to the
original Unicode character.

It is true that UTF-17 takes up twice the space of UTF-32, but
with 64-bit machines and the continuing rapid progress in
the lowering of cost/megabyte of storage, this should not
really be a barrier to the rapid acceptance of UTF-17.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT