Technical Reports |
Version | 2 |
---|---|
Authors | Markus W. Scherer, Mark Davis |
Date | 2004-01-01 |
This Version | http://www.unicode.org/reports/tr99/tr99-2.html (http://www.mindspring.com/~markus.scherer/unicode/tr-bocu/tr-bocu-20040123.html) |
Previous Version | http://www.unicode.org/notes/tn6/tn6-1.html |
Latest Version | http://www.unicode.org/reports/tr99 |
Base Unicode Version | Unicode 2.0.0 |
This document describes a Unicode compression scheme that is MIME-compatible (directly usable for emails) and preserves binary order (for databases and sorted lists). BOCU-1 is a MIME-compatible application of the Binary Ordered Compression for Unicode [BOCU] base algorithm.
This document is a proposed draft Unicode Technical Standard (from Unicode Technical Note #6). Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. Each UTS specifies a base version of the Unicode Standard. Conformance to the UTS requires conformance to that version or higher.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
The most popular encoding for Unicode text data exchange is UTF-8. It is relatively simple and widely applicable: MIME/email/HTML/XML, in-process use in some systems, etc. However, UTF-8 uses more bytes per code point for non-Latin scripts than language-specific codepages.
In some markets where scripts other than Latin are used, the high bytes/code
point ratio of UTF-8 (and of UTF-16 for scripts other than Latin and CJK
ideographs and Korean Hangul) has been criticized and used as a motivation to not use Unicode for
the exchange of documents in some languages. Larger document sizes are also
a problem in low-bandwidth environments such as wireless networks for handheld
systems.
The Standard Compression Scheme for Unicode [SCSU] was created as a Unicode compression scheme with a byte/code point ratio similar to language-specific codepages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered as one. [SCSU] is not suitable for MIME “text” media types, i.e., it cannot be used directly in emails and similar protocols. [SCSU] requires a relatively complicated encoder design for good performance (see there for details about small vs. full encoders).
BOCU-1 combines the wide applicability of UTF-8 with the compactness of [SCSU]. It is useful for short strings and maintains code point order.
Binary
Ordered Compression for Unicode [BOCU] employs a kind of
difference encoding tailored to Unicode text. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.
Compression is achieved because normal Unicode text contains sequences of code points from the same script block. Small alphabets are allocated in blocks of 128 code points, so that the difference between two code points from such a block can be encoded with a single byte. Large script blocks like CJK Unihan are covered with 2-byte-encoded differences.
In order to minimize the absolute values of differences and thus improve the compression, the previous code point is adjusted to the middle of a script block. This covers a 128-block with differences of -64..+63 around the center code point instead of requiring differences of -127..+127 without the adjustment. A special range check for the larger CJK Unihan block ensures that it is covered with differences of about +-10500 which can be encoded with 2 bytes each.
BOCU preserves binary code point order by allocating lead bytes for encoded differences such that increasing lead byte values correspond to increasing difference values. Because signed differences are encoded, the center of the lead byte range is used for a zero difference, and lead bytes for positive and negative differences are symmetrically arranged around it, with the lead byte values farther from the middle for larger absolute differences.
BOCU-1 differs from the generic algorithm mainly by using a different set of byte value ranges to achieve MIME compatibility. It encodes U+0000..U+0020 directly with byte values 00..20 and does not use the ASCII byte values for common C0 control codes (especially CR and LF) for anything else but encoding those control codes. In addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of each word.
As a result, only the code points above U+0020 are encoded with multi-byte sequences for their differences, and the lead byte range for these sequences starts at 0x21. Some byte values below 0x20 corresponding to rarely-used C0 control codes are also used as trail bytes in multi-byte sequences.
BOCU-1 uses FF as a special reset byte; see 2.4 Details.
The byte/code point ratio is 1 for runs of code points from the same block of 0x80 code points (and for Hiragana), and 2 for runs of CJK Unihan code points, as with [SCSU]. All characters up to U+2DD4B, which includes almost all assigned Unicode characters, will always be encoded with at most 3 bytes each. The startup overhead is very low (similar to [SCSU]), which makes it useful for very short strings like individual names. The maximum number of bytes per code point is four.
With BOCU-1, texts in languages with Latin-based scripts take about the same amount of space as with UTF-8, while texts in all other languages take about 25%-60% less space. Compared to UTF-16, texts in all languages with small character repertoires take approximately half as much space in BOCU-1. For large character sets, i.e., Chinese/Japanese/Korean, texts are about the same size for UTF-16 and BOCU-1.
The lexical order of BOCU-1 bytes is the same as the code point order of the original text — like UTF-8 but unlike [SCSU] — which allows the compression of large, sorted lists of strings. This makes BOCU-1 suitable for use in databases to reduce storage requirements as described above.
The C0 control codes NUL, CR and LF and nine others are encoded with the same byte values as in US-ASCII, and those byte values are used only for the encoding of these twelve C0 control codes. This makes BOCU-1 suitable for MIME “text” media types, directly usable in emails and generally “friendly” for ASCII-based tools. The SUB control and its byte value 1A is included in this set to avoid problems in DOS/Windows/OS/2 systems where 1A is interpreted as an end-of-file marker in text mode.
BOCU-1 is byte-oriented and stateful. Its inter-character state consists of a single integer. It is deterministic, i.e., the same complete input text is always encoded the same way by all encoders (unlike with [SCSU]). Like [SCSU] it can be classified as a Character Encoding Scheme (CES) or as a Transfer Encoding Syntax (TES). It is a “charset” according to IANA, and it is suitable for MIME “text” media types.
The state is reset at each C0 control (U+0000..U+001F, includes CR, LF, TAB). CRLF-separated lines do not affect each other’s encoding. Together with BOCU-1 being deterministic, this allows line-based file comparisons (diff) and makes BOCU-1 usable with RCS, CVS and other source-code control systems (unlike [SCSU]). This also allows some limited random access.
Byte values for single-byte codes and lead bytes overlap with trail bytes. So unlike UTF-8, character boundaries cannot be determined in random access, except by backing up to a reset point. Byte values 7F..9F (DEL and C1 control codes) are used as lead and trail bytes. US-ASCII characters (code points U+0021..U+007E) are not encoded with the same bytes as in US-ASCII. Therefore, the charset must be specified with a signature byte sequence or in a higher-level protocol.
As a single/lead byte, byte value FF is used as a special “reset-state-only” command. It does not encode any code point. FF is also a normal trail byte. Having a “reset only” command allows simple concatenation of BOCU-1 byte streams. (All other BOCU-1 byte sequences would append some code point.) Using FF to reset the state breaks the ordering and the deterministic encoding! The use of FF resets is discouraged. Byte stream concatenation without resetting with FF requires to scan back to a C0 control whose byte value is not used for trail bytes (last known reset to initial state); then decode to the end of the first stream and encode the first non-U+0020 code point of the second stream according to the current state; then append the rest of the second stream. The same procedure could be used to remove an FF reset command.
An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.
The basic algorithm is as described in Binary
Ordered Compression for Unicode [BOCU].
BOCU-1 differs from the generic algorithm by using a different set of byte
value ranges and by encoding U+0000..U+0020 directly with byte values 00..20. In
addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of
each word.
BOCU-1 encodes exactly the Unicode code points U+0000..U+10FFFF. Single surrogates are encoded if present in the input (e.g., unmatched single surrogate code units in 16-bit Unicode strings). Proper input of supplementary code points (e.g., matched surrogate pairs in UTF-16) must be encoded by code points.
Partial pseudo-code for a per-code point encoding function is as follows:
// inter-code point state variable to be passed into the encode() function // previous code point, adjusted according to the code below static int prev=0x40; // initial setting in the middle of ASCII // pass the "adjusted previous code point" state variable // and the next Unicode code point; // outputs the bytes for the code point and sets prev accordingly encode(int &prev, int c) { if(c<=0x20) { // space or C0 control output (byte)c; // output the character in its ASCII encoding if(c!=0x20) { // do not change the state for a space prev=0x40; } } else { int diff=c-prev; // encode diff in 1..4 bytes and output them // adjust prev // usually, set it to the middle of a 128-block of code points // which works well with the assignment pattern of scripts in Unicode; // use custom adjustments for Hiragana, which is not 128-aligned, // and for CJK Unihan and Hangul syllables, which have much larger blocks if(0x3040<=c<=0x309f) { // c is Hiragana prev=0x3070; // middle of Hiragana } else if(0x4e00<=c<=0x9fa5) { // c is CJK Unihan prev=0x7711; // middle of CJK Unihan // (0x4e00+half of reach of 2-byte difference) } else if(0xac00<=c<=0xd7a3) { // c is Hangul prev=0xc1d1; // middle of Hangul=(0xd7a3+0xac00)/2 } else { prev=(c&~0x7f)+0x40; // middle of a 128-block } } }
Any code point above U+0020 (space) is encoded with a multi-byte sequence
representing the difference between the code point value and the prev
state variable (see the diff
variable in the pseudo-code above).
The sequence is 1 to 4 bytes long; it is longer for larger absolute values of
the difference. Consecutive code points from the same 128-block are encoded with
single bytes each.
Most byte values are used for both lead/single bytes and trail bytes, which allows to encode large integers with few bytes but makes BOCU-1 unsuitable for text processing (as opposed to storage and data transfer). An exception is that BOCU-1 encoded text can be efficiently compared in Unicode code point order because that order is preserved in the lexical order of BOCU-1 byte streams (as long as the FF reset byte is not used).
The following table shows the use of each byte value. All but 13 byte values are used as trail bytes and are assigned “digit” values in ascending order.
Byte value | Lead/single byte | Trail byte “digit” values |
---|---|---|
FF | reset prev to 0x40, does not
encode any code point |
14..F2 (20..242) =byte value - 0D (13) |
FE | lead byte for 4-byte sequences for positive differences | |
FB..FD | lead bytes for 3-byte sequences for positive differences | |
D0..FA | lead bytes for 2-byte sequences for positive differences | |
50..CF | single bytes for differences -0x40..+0x3F | |
25..4F | lead bytes for 2-byte sequences for negative differences | |
22..24 | lead bytes for 3-byte sequences for negative differences | |
21 | lead byte for 4-byte sequences for negative differences | |
20 | single byte for space (U+0020), does not change the state | - |
1C..1F | single byte for U+0000..U+001F | 10..13 (16..19) |
1A..1B | - | |
10..19 | 06..0F (6..15) | |
07..0F | - | |
01..06 | 00..05 | |
00 | - |
Differences of -0x40..+0x3F (-64..+63) are encoded with single bytes 0x50..0xCF by adding 0x90.
Differences outside of this range are encoded in a way similar to decimal or hexadecimal numbers, but with a number base of 243 (=256-13) instead of 10 or 16. Lead bytes (“lead digits”) are chosen such that byte sequences for lower difference values sort lexically before those for higher difference values.
Encode them with the following steps:
t
) of dividing
the difference value (diff
) by 243 (the number of
possible trail byte values).
quotient=diff/243
t=diff%243
(modulo,
division remainder)diff=(quotient*243)+t
;
adjust if necessary by adding 243 to t
and subtracting 1 from the
quotient
.t
.
Encode it with a trail byte according to table 1
above.diff=quotient
diff
value (the last quotient) from step 4.Diff. value, hexadecimal |
Decimal | Final bytes | Number of trail bytes |
Base lead byte value |
Diff. offset, hexadecimal |
Decimal |
---|---|---|---|---|---|---|
10FFFF | 1114111 | FE 19 B4 94 | 3 | FE | 2DD0C | 187660 |
2DD0C | 187660 | FE 01 01 01 | ||||
2DD0B | 187659 | FD FF FF | 2 | FB | 2911 | 10513 |
2911 | 10513 | FB 01 01 | ||||
2910 | 10512 | FA FF | 1 | D0 | 40 | 64 |
40 | 64 | D0 01 | ||||
3F | 63 | CF | 0 | 90 | 0 | 0 |
1 | 1 | 91 | ||||
0 | 0 | 90 | ||||
-1 | -1 | 8F | ||||
-40 | -64 | 50 | ||||
-41 | -65 | 4F FF | 1 | 50 | -40 | -64 |
-2911 | -10513 | 25 01 | ||||
-2912 | -10514 | 24 FF FF | 2 | 25 | -2911 | -10513 |
-2DD0C | -187660 | 22 01 01 | ||||
-2DD0D | -187661 | 21 FF FF FF | 3 | 22 | -2DD0C | -187660 |
-10FFFF | -1114111 | 21 F0 58 79 |
A BOCU-1 byte stream is decoded (converted back to Unicode text) by reversing the encoding algorithm. Briefly:
prev
state variable is used, initialized (to 0x40) and
adjusted just like in the encoder.prev=0x40
.prev
.c=prev+(byte-0x90)
and adjust prev
.prev
:
prev
and output the resulting Unicode code point.
Adjust prev
.prev=0x40
. No code point output.While there is no illegal lead/single byte, there are illegal trail bytes (see table 1). Multi-byte sequences that result in code points outside of the Unicode range U+0000..U+10FFFF are illegal. Recovery from illegal input is not specified.
The sample C code serves as the full specification of
BOCU-1. Every conformant encoder and decoder must generate equivalent output and
detect any illegal input code points and illegal input byte sequences. Recovery
from illegal input is not specified. Single surrogates are encoded if present in
the input (e.g., unmatched single surrogate code units in UTF-16). Proper input
of supplementary code points (e.g., matched surrogate pairs in UTF-16) must be
encoded by code points.
This code uses International Components for Unicode
[ICU] standard
headers and the one implementation file icu/source/common/utf_impl.c
.
(It is not necessary to link the entire [ICU] common library.) This is for
convenience in the surrounding test functions and not necessary for the core
BOCU-1 functions. These headers and implementation file provide the following:
This code is under the X license (ICU version) [ICULicense].
Files:
main()
function, see below)A complete, compiled sample executable for Windows® from this source code is available for download. Aside from basic implementation and consistency tests, this also provides file conversion between UTF-8 and BOCU-1. Use a command-line argument of “?” or “-h” for usage.
This is a performance comparison between BOCU-1, UTF-8, and [SCSU] when implemented as [ICU] converters, which convert to and from UTF-16. The time is for roundtrip conversions from the internal UTF-16 form to the encoding and back. Values are relative to UTF-8.
BOCU-1 | SCSU | |||
---|---|---|---|---|
Languages | Size of text | Time to convert to/from UTF-16 |
Size of text | Time to convert to/from UTF-16 |
English/French | 100% | 160..170% | 100% | 125% |
Greek/Russian/Arabic/Hebrew | 60% | 65..70% | 55% | 70% |
Hindi | 40% | 45% | 40% | 45% |
Thai (see below) | 45% | 60% | 40% | 55% |
Japanese | 60% | 150% | 55% | 110% |
Korean | 75% | 155% | 85% | 70% |
Chinese | 75% | 165% | 70% | 65% |
(Smaller percentages are better. Percentages are rounded to the nearest 5%.)
The compression ratio is smaller for web pages (lots of ASCII in HTML). The performance difference tends to be smaller for smaller buffers. When the text is transmitted between machines (emails, web pages), then the transmission time may swamp the conversion time. Smaller text will then transmit faster.
The texts are the “What is Unicode” pages from www.unicode.org, except for Thai. Note that english.html contains non-ASCII characters in the index sidebar. The Thai text, th18057.txt, has a different structure: It is a Thai word list from [ICU]’s test data, with one Thai word on each line. Compared to the other texts, it contains only a few characters between CRLF.
This comparison uses full-fledged [ICU] converters for UTF-8, [SCSU] and BOCU-1. “Full-fledged [ICU] converter” means that this is with the [ICU] conversion API, designed for external encodings, as used e.g. by an XML parser or web browser.
The [ICU] converter code for [SCSU] and BOCU-1 that is tested here is part of [ICU] 2.2. The [SCSU] converter was optimized slightly (conversion function variants without offset handling). The BOCU-1 converter is optimized compared to the reference code in the design document. The test machine is a Pentium 2 laptop.
[BOCU] | Binary Ordered Compression for Unicode http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html Describes the base algorithm for BOCU-1. |
[ICU] |
International Components for Unicode http://oss.software.ibm.com/icu/ Open-source C/C++ and Java libraries for Unicode and I18N support. |
[ICULicense] | The X license, modified for ICU
http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html |
[SCSU] | Unicode Technical Standard #6: A Standard Compression Scheme for Unicode http://www.unicode.org/reports/tr6/ |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports. |
Thanks to Doug Ewell for valuable feedback on this document.
The following summarizes modifications from the previous version of this document.
2 | Changed into a Unicode Technical Standard; added a formal description. The sample C code has minor fixes for 16-bit compilers. |
1 | Initial version |
Copyright © 2002-2004 Unicode, Inc., Markus W. Scherer, Mark Davis. All Rights Reserved. The Unicode Consortium and authors make no expressed or implied warranty of any kind, and assume no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.