From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jan 20 2004 - 14:54:29 EST
From: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>
> Has anyone done any work on Unicode formats for this use-case? Does
> anyone have any references or ideas to share?
If you want something very simple to convert between UTF-8 and UTF-16, why
not using them directly, by requiring a leading BOM and encoding the string
using the shorter between UTF-8 and UTF-16, removing the BOM only if the
UTF-8 string contains only 7-bit ASCII? As UTF-16 will need to start with a
BOM, coded U+FEFF, i.e. with leading bytes {0xFE,0xFF} or {0xFF,0xFE}, there
will never be any confusion between ASCII and UTF-16. Also no possible
confusion between ASCII and UTF-8 with BOM, and between UTF-8 and UTF-16
which have BOM coded differently.
So you get the advantages of all worlds, without necessarily implementing a
complex compressor like BOCU-1 or SCSU: your final encoded wtrings will be
either:
- 7-bit ASCII
- 8-bit UTF-8 starting with a forced leading BOM
- 16-bit UTF-16 starting with a forced leading BOM
The cost is only the size of the BOM if coding something else than 7-bit
ASCII: 3 bytes for UTF-8, 2 bytes for UTF-16. In all cases, the final
encoding will be the shorter of the above 3 possible alternatives. Deciding
which alternative to use can be performed in a single pass where you could
the number of bytes needed for UTF-8 and UTF-16 without the BOM, and whever
there are characters out of the 7-bit ASCII range (this allows you to
allocate the final buffer to perform the actual encoding once you have
determined the size of each approach).
Finally, nothing forbids using a single compressor after this step (for
example a deflate compressor without the GZIP parameters header, as
implemented in zlib and Java), if this helps: as your string will start
either with a leading ASCII byte or by a 3bytes UTF-8 encoded BOM, or a
2bytes UTF-16 encoded BOM, you could also argue that the leading BOM may be
removed and replaced by a single NON-ASCII byte. As you have 128 such bytes,
the same byte can specify one of these meanings:
- 0..127: ASCII byte, which is itself part of a string coded with 7-bit
ASCII only
- 129: indicates an uncompressed UTF-8 string, coded after this byte without
the BOM
- 130: indicates an uncompressed UTF-16LE string, coded after this byte
without the BOM
- 130: indicates an uncompressed UTF-16BE string, coded after this byte
without the BOM
- 192: indicates a compressed string, coded after this byte as a deflated
stream of ASCII bytes
- 193: indicated a compressed string, coded after this byte as a deflated
stream of UTF-8 bytes without the leading BOM
- 194: indicated a compressed string, coded after this byte as a deflated
stream of UTF-16LE bytes without the leading BOM
- 195: indicated a compressed string, coded after this byte as a deflated
stream of UTF-16BE bytes without the leading BOM
You can creates many variants of this for your internal storage...
This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 16:40:06 EST