From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 30 2003 - 18:43:12 EST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> As I have not determined the correct size of these bitfields, I need
> some intermediate solution to pack them a little, and the UTF-8 TES
> (not the UTF-8 CES used by Unicode)venient for now, until I change it
> to a better encoding, which may or may not leak out (I am not sure
> that I need to make the encoding accessible from an interface, except
> for debugging).
I hope I understand the "venient" passage correctly.
I'm pretty sure you mean "... the UTF-8 CES (not the UTF-8 CEF used by
Unicode)..." A CEF maps code points to code units, and you don't mean
that because you're not mapping Unicode code points.
A CES, on the other hand, maps code units to bytes, and that *is* what
you are doing with the code units in your internal mechanism: mapping
them to bytes using the original 31-bit definition of UTF-8.
A TES is a very specific thing. Apparently this term is reserved for
mappings that explicitly solve a particular problem, such as MIME
compatibility or compression. So quoted-printable is a good example of
a TES, because it makes an arbitrary text stream -- already encoded in
UTF-8, Windows code page 1252, or whatever -- transferable through
mechanisms that support RFC 822, avoiding all of the bytes that mean
something special. Likewise, Base64 is applied directly to an arbitrary
byte stream, which means the data was already encoded in a CES before
applying the additional Base64 layer.
I've always had trouble with the assertion that SCSU (for example) is a
TES rather than a CES. Certainly it solves a particular problem
(compression) and avoids, to an extent, gratuitous use of bytes like 0D
and 0A. However, it is applied to a sequence of *Unicode code points*,
not code units, and certainly not bytes the way QP is. You don't take
the UTF-8-encoded stream <C2 BF 51 75 C3 A9 3F> and encode *those seven
bytes* in SCSU; rather, you encode the stream of five Unicode code
points <00BF 0051 0075 00E9 003F>.
That said, the definitions in UTR #17 were surprisingly difficult for me
to wrap my brain around in general, so I might be off-base on some of
this.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 19:14:51 EST