From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Aug 14 2004 - 08:21:08 CDT
About UTS#6: SCSU (A Standard Compression Scheme for Unicode).
http://www.unicode.org/reports/tr6/tr6-3.5.html
I know that this is not part of the SCSU standard, but the reference section
10 about private extensions of SCSU seems to forget some other wellknown
transport encoding syntaxes that allows transporting SCSU content within
streams where usage of control bytes (like the null byte) is restricted.
One well-known method is to apply a "COBS" encoding.
See reference and implementation details in
http://www.acm.org/sigcomm/sigcomm97/papers/p062.pdf
It is MUCH better than the proposed method in section 10.1 that uses "DLE
escaping", and the method is generic enough to allow escaping ANY byte value
(not only the 0x00 byte):
(1) When used with the default profile (which just avoids the null byte
value), COBS allows avoiding any occurence of the null byte with the worst
case producing not more than 1 byte every 254 source bytes, and no more than
1 additional byte for any random source stream.
(2) With an extended COBS profile, where N byte values need to be avoided
in the encoded stream, the worst case produces only 1 additional byte for
every (255-N) source bytes, and also no more than 1 additional byte for any
random source stream. So this can be used to restrict the output stream to
avoid ALL control bytes that are undesirable during transport, notably all
C0 control bytes used by SCSU as "tags" (i.e. bytes 0x00-0x1F except
CR=0x0D, LF=0x0A, TAB=0x09), or even all C1 control bytes (in 0x80-0x9F,
notably the NL character).
(3) A COBS profile that would avoid all C0&C1 control bytes except CR, LF
and TAB would cost no more than 1 additional byte for every 226 bytes of
SCSU-encoded source bytes: this worst case represents less than +0.5% of
transported data size, still much better than the +100% you get in the worst
case with the transport syntaxes suggested in 10.1!
(4) COBS can be used as well to restrict the allowed bytes to the 7-bit
range, making SCSU plus a COBS transfer encoding syntax in this COBS profile
suitable for emails, and still much better than UTF-7 for Asian languages or
multilanguage documents that largely benefit from the SCSU compression.
A COBS profile can also handle the case of repeated byte values in the
SCSU compressed stream (case discussed in section 10.2 of UTS#6).
It also works much better than other well-known Transform Encoding Syntaxes
like Base64 or Quoted-Printable, often used for emails but that behave
poorly with Asian languages: these TES also have very poor worst cases (that
can completely break the compression benefits offered by SCSU).
Implementing COBS is also very straightforward, with very little CPU
overhead (COBS will just need an internal buffering with a maximum of 254
bytes with the default profile that avoids null byte values, which is very
reasonnable, and easy to implement in low-cost hardware too).
Because of these properties, there's no need to modify the standard SCSU
algorithm: one just needs to apply COBS encoding directly on the output of
the SCSU compressor. COBS appears then as a better solution than what is
suggested in section 10.1 and 10.2 of TR6...
Setting up COBS profiles is not necessary when implementing SCSU, so such
extensions are really not needed. I would suggest that TR6 removes the
section 10, and instead puts it into an annexe showing how a transport
encoding syntax can be used to solve the suggested problems:
The solutions exposed in section 10.1 and 10.2 are definitely not the best
ones if one needs a good compression of Unicode, because their usage have
very bad worst cases that double the size of the output stream.
Another option would be to add section 10.3 referencing COBS as a better
transfer encoding syntax, and saying that the existing 10.1 and 10.2
solutions should better be modeled as simple transfer encoding syntaxes too,
completely out of scope of the SCSU UTF itself, that really don't need such
extensions in its core, where it will produce interoperability problems, now
that it is a Unicode Technical Standard, to be implemented notably in XML or
HTML parsers.
This archive was generated by hypermail 2.1.5 : Sun Aug 15 2004 - 09:41:53 CDT