UTF-c

Preliminary Proposal for a new Unicode Text Format

The motivation for this new UTF-c encoding is provided by the inefficient UTF-8 encoding of most of the world’s alphabetic scripts, where each character is typically encoded in two or three bytes. UTF-c allows one alphabet apart from the Latin/ASCII alphabet to be encoded in one byte per character, provided the letters fall within a 64 character Unicode block. For example the Cyrillic alphabet can be encoded as one byte per character instead of the two bytes used by UTF-8. Also the Indic alphabets can be encoded as one byte per character instead of three. As an additional benefit, twice as many characters can be encoded in two bytes, and four times as many characters can be encoded in three bytes.

UTF-c also has a four byte file prefix, which identifies the file as a UTF-c text file, and encodes the page number of the selected alphabet. The file prefix consists of four zero width control bytes {FS, GS, RS, US}, so that existing browsers and text editors can handle the files correctly, provided the appropriate code-page or font/script is selected.

Code point	Bits	Binary value	UTF-c bytes
U+00..U+7f	7	0xxxxxxx	0xxxxxxx
U+c0..U+ff (default)	6	11xxxxxx	11xxxxxx
U+80..U+bf, U+100..U+107f	12	U – 0x80 0000yyyy xxxxxxxx	10yyyyxx 10xxxxxx
U+1080 to U+04107f	18	U – 0x1080 000000zz yyyyyyyy xxxxxxxx	10zzyyyy 11yyyyxx 10xxxxxx
U+041080 to U+10ffff	21	U – 0x41080 0000zzzz yyyyyyyy xxxxxxxx	10ººººzz 11zzyyyy 11yyyyxx 10xxxxxx

Main Features:

· no null bytes except for null character

· alphabetic scripts of common languages can be encoded in 1 byte per character

· backward-compatible with ASCII and other code-pages

· full Unicode character set, but with no byte-order-marks

· may be quickly scanned in forward and backward directions

· avoids over-long forms of characters

The accompanying C++ program converts between UTF-c and UTF-8 text files, and is provided here to give an example of how UTF-c files may be processed.