The motivation for this new UTF-c encoding is provided by the inefficient UTF-8 encoding of most of the world’s alphabetic scripts, where each character is typically encoded in two or three bytes. UTF-c allows one alphabet apart from the Latin/ASCII alphabet to be encoded in one byte per character, provided the letters fall within a 64 character Unicode block. For example the Cyrillic alphabet can be encoded as one byte per character instead of the two bytes used by UTF-8. Also the Indic alphabets can be encoded as one byte per character instead of three. As an additional benefit, twice as many characters can be encoded in two bytes, and four times as many characters can be encoded in three bytes.
UTF-c also has a four byte file prefix, which identifies the file as a UTF-c text file, and encodes the page number of the selected alphabet. The file prefix consists of four zero width control bytes {FS, GS, RS, US}, so that existing browsers and text editors can handle the files correctly, provided the appropriate code-page or font/script is selected.
Code point |
Bits |
Binary value |
UTF-c bytes |
U+00..U+7f |
7 |
0xxxxxxx |
0xxxxxxx |
U+c0..U+ff |
6 |
11xxxxxx |
11xxxxxx |
U+80..U+bf, |
12 |
U – 0x80 |
10yyyyxx 10xxxxxx |
U+1080 to |
18 |
U – 0x1080 |
10zzyyyy 11yyyyxx 10xxxxxx |
U+041080 to |
21 |
U – 0x41080 |
10ººººzz 11zzyyyy 11yyyyxx 10xxxxxx |
Main Features:
· no null bytes except for null character
· alphabetic scripts of common languages can be encoded in 1 byte per character
· backward-compatible with ASCII and other code-pages
· full Unicode character set, but with no byte-order-marks
· may be quickly scanned in forward and backward directions
· avoids over-long forms of characters
The accompanying C++ program converts between UTF-c and UTF-8 text files, and is provided here to give an example of how UTF-c files may be processed.