Unicode Technical Note #6

BOCU-1:
MIME-Compatible Unicode Compression

Version	1
Authors	Markus W. Scherer, Mark Davis
Date	2002-08-09
This Version	http://www.unicode.org/notes/tn6/tn6-1.html
Previous Version	[none]
Latest Version	http://www.unicode.org/notes/tn6

Summary

This document describes a Unicode compression scheme that is MIME-compatible (directly usable for emails) and preserves binary order (for databases and sorted lists). BOCU-1 is a MIME-compatible application of the Binary Ordered Compression for Unicode [BOCU] base algorithm.

Status

This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes.

Introduction
Features of BOCU-1
Signature Byte Sequence
Encoding Algorithm
Sample C Code
BOCU-1 Performance
1. SCSU vs. BOCU-1
2. Test Setup
References
Modifications

1 Introduction

The most popular encoding for Unicode text data exchange is UTF-8. It is relatively simple and widely applicable: MIME/email/HTML/XML, in-process use in some systems, etc. However, UTF-8 uses more bytes per code point for non-Latin scripts than language-specific codepages.

In some markets where scripts other than Latin are used, the high bytes/code point ratio of UTF-8 (and of UTF-16 for scripts other than Latin and CJK ideographs) has been criticized and used as a motivation to not use Unicode for the exchange of documents in some languages. Larger document sizes are also a problem in low-bandwidth environments such as wireless networks for handheld systems.

The Standard Compression Scheme for Unicode [SCSU] was created as a Unicode compression scheme with a byte/code point ratio similar to language-specific codepages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered as one. [SCSU] is not suitable for MIME “text” media types, i.e., it cannot be used directly in emails and similar protocols. [SCSU] requires a complicated encoder design for good performance.

BOCU-1 combines the wide applicability of UTF-8 with the compactness of [SCSU]. It is useful for short strings and maintains code point order.

2 Features of BOCU-1

2.1 Compression

The byte/code point ratio is 1 for runs of code points from the same block of 0x80 code points (and for Hiragana), and 2 for runs of CJK Unihan code points, as with [SCSU]. The startup overhead is very low (similar to [SCSU]), which makes it useful for very short strings like individual names. The maximum number of bytes per code point is four.

With BOCU-1, texts in languages with Latin-based scripts take about the same amount of space as with UTF-8, while texts in all other languages take about 25%-60% less space. Compared to UTF-16, texts in all languages with small character repertoires take approximately half as much space in BOCU-1. For large character sets, i.e., Chinese/Japanese/Korean, texts are about the same size for UTF-16 and BOCU-1.

2.2 Binary Order

The lexical order of BOCU-1 bytes is the same as the code point order of the original text — like UTF-8 but unlike [SCSU] — which allows the compression of large, sorted lists of strings. This makes BOCU-1 suitable for use in databases to reduce storage requirements as described above.

2.3 MIME Compatibility

The C0 control codes NUL, CR and LF and nine others are encoded with the same byte values as in US-ASCII, and those byte values are used only for the encoding of these twelve C0 control codes. This makes BOCU-1 suitable for MIME “text” media types, directly usable in emails and generally “friendly” for ASCII-based tools. The SUB control and its byte value 1A is included in this set to avoid problems in DOS/Windows/OS/2 systems where 1A is interpreted as an end-of-file marker in text mode.

2.4 Details

BOCU-1 is byte-oriented and stateful. Its inter-character state consists of a single integer. It is deterministic, i.e., the same complete input text is always encoded the same way by all encoders (unlike with [SCSU]). Like [SCSU] it can be classified as a Character Encoding Scheme (CES) or as a Transfer Encoding Syntax (TES). It is a “charset” according to IANA, and it is suitable for MIME “text” media types.

The state is reset at each C0 control (U+0000..U+001F, includes CR, LF, TAB). CRLF-separated lines do not affect each other’s encoding. Together with BOCU-1 being deterministic, this allows line-based file comparisons (diff) and makes BOCU-1 usable with RCS, CVS and other source-code control systems (unlike [SCSU]). This also allows some limited random access.

Byte values for single-byte codes and lead bytes overlap with trail bytes. So unlike UTF-8, character boundaries cannot be determined in random access, except by backing up to a reset point. Byte values 7F..9F (DEL and C1 control codes) are used as lead and trail bytes. US-ASCII characters (code points U+0021..U+007E) are not encoded with the same bytes as in US-ASCII. Therefore, the charset must be specified with a signature byte sequence or in a higher-level protocol.

As a single/lead byte, byte value FF is used as a special “reset-state-only” command. It does not encode any code point. FF is also a normal trail byte. Having a “reset only” command allows simple concatenation of BOCU-1 byte streams. (All other BOCU-1 byte sequences would append some code point.) Using FF to reset the state breaks the ordering! The use of FF resets is discouraged. Byte stream concatenation without resetting with FF requires to scan back to a C0 control whose byte value is not used for trail bytes (last known reset to initial state); then decode to the end of the first stream and encode the first non-U+0020 code point of the second stream according to the current state; then append the rest of the second stream. The same procedure could be used to remove an FF reset command.

3 Signature Byte Sequence

An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.

4 Encoding Algorithm

The basic algorithm is as described in Binary Ordered Compression for Unicode [BOCU].

BOCU-1 differs from the generic algorithm by using a different set of byte value ranges and by encoding U+0000..U+0020 directly with byte values 00..20. In addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of each word.

Partial pseudo-code for a per-code point encoding function is as follows:

encode(int &prev, int c) {
    if(c<=0x20) {
        output (byte)c;
        if(c!=0x20) {
            prev=0x40;
        }
    } else {
        int diff=c-prev;
        // encode diff in 1..4 bytes and output them

        // adjust prev
        if(c is Hiragana) {
            prev=middle of Hiragana;
        } else if(c is CJK Unihan) {
            prev=middle of CJK Unihan;
        } else if(c is Hangul) {
            prev=middle of Hangul;
        } else {
            prev=(c&~0x7f)+0x40;
        }
    }
}

5 Sample C Code

The sample C code serves as the full specification of BOCU-1. Every conformant encoder and decoder must generate equivalent output and detect any illegal input code points and illegal input byte sequences. Recovery from illegal input is not specified. Single surrogates are encoded if present in the input (e.g., unmatched single surrogate code units in UTF-16). Proper input of supplementary code points (e.g., matched surrogate pairs in UTF-16) must be encoded by code points.

This code uses International Components for Unicode [ICU] standard headers and the one implementation file icu/source/common/utf_impl.c. (It is not necessary to link the entire [ICU] common library.) This is for convenience in the surrounding test functions and not necessary for the core BOCU-1 functions. These headers and implementation file provide the following:

Include inttypes.h or define its types.
Define UChar for UTF-16 as an unsigned 16-bit type (wchar_t or uint16_t).
Define UTF* macros to handle reading and writing of in-process UTF-8/16 strings.

This code is under the X license (ICU version) [ICULicense].

Files:

bocu1.h (constants and macros)
bocu1.c (encoder and decoder functions)
bocu1tst.c (test code with main() function, see below)

A complete, compiled sample executable for Windows® from this source code is available for download. Aside from basic implementation and consistency tests, this also provides file conversion between UTF-8 and BOCU-1. Use a command-line argument of “?” or “-h” for usage.

6 BOCU-1 Performance

This is a performance comparison between BOCU-1, UTF-8, and [SCSU] when implemented as [ICU] converters, which convert to and from UTF-16. The time is for roundtrip conversions from the internal UTF-16 form to the encoding and back. Values are relative to UTF-8.

	BOCU-1		SCSU
Languages	Size of text	Time to convert to/from UTF-16	Size of text	Time to convert to/from UTF-16
English/French	100%	160..170%	100%	125%
Greek/Russian/Arabic/Hebrew	60%	65..70%	55%	70%
Hindi	40%	45%	40%	45%
Thai (see below)	45%	60%	40%	55%
Japanese	60%	150%	55%	110%
Korean	75%	155%	85%	70%
Chinese	75%	165%	70%	65%

(Smaller percentages are better. Percentages are rounded to the nearest 5%.)

The compression ratio is smaller for web pages (lots of ASCII in HTML). The performance difference tends to be smaller for smaller buffers. When the text is transmitted between machines (emails, web pages), then the transmission time may swamp the conversion time. Smaller text will then transmit faster.

6.1 SCSU vs. BOCU-1

[SCSU] and BOCU-1 have similar compression characteristics.
[SCSU] tends to be faster than BOCU-1.
BOCU-1 is applicable in many more environments (it is MIME-friendly, CVS-friendly, and has the same binary order as UTF-8).
The optimized [ICU] converter for BOCU-1 is somewhat less complex than the one for [SCSU], and the output is deterministic, unlike with [SCSU].

6.2 Test Setup

The texts are the “What is Unicode” pages from www.unicode.org, except for Thai. Note that english.html contains non-ASCII characters in the index sidebar. The Thai text, th18057.txt, has a different structure: It is a Thai word list from [ICU]’s test data, with one Thai word on each line. Compared to the other texts, it contains only a few characters between CRLF.

This comparison uses full-fledged [ICU] converters for UTF-8, [SCSU] and BOCU-1. “Full-fledged [ICU] converter” means that this is with the [ICU] conversion API, designed for external encodings, as used e.g. by an XML parser or web browser.

The [ICU] converter code for [SCSU] and BOCU-1 that is tested here is part of [ICU] 2.2. The [SCSU] converter was optimized slightly (conversion function variants without offset handling). The BOCU-1 converter is optimized compared to the reference code in the design document. The test machine is a Pentium 2 laptop.

7 References

[BOCU]	Binary Ordered Compression for Unicode http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html Describes the base algorithm for BOCU-1.
[SCSU]	Unicode Technical Standard #6: A Standard Compression Scheme for Unicode http://www.unicode.org/reports/tr6/
[ICULicense]	The X license, modified for ICU http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html
[ICU]	International Components for Unicode http://oss.software.ibm.com/icu/ Open-source C/C++ and Java libraries for Unicode and I18N support.

Modifications

The following summarizes modifications from the previous version of this document.

1	Initial version

Copyright © 2002 Markus W. Scherer, Mark Davis. All Rights Reserved. The Unicode Consortium and [authors] make no expressed or implied warranty of any kind, and assume no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical note. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.