About Unicode encodings

From: Miguel Angel Suárez (mglsrz@sion.com)
Date: Sat Feb 16 2002 - 10:17:08 EST


In these notes I want to discuss the notion of "encoding" and introduce a
series of concepts that I find useful. My intention is to try to identify
abstract properties of encodings and to be able to classify encodings
according these
properties.

In its most general terms, and encoding is a binary relation (that is, a set
of ordered pairs) whose members are of the form (s, b), where s is a string
and b is a byte sequence. If E is an encoding, as usual, I write s E b to
express that (s, b) is in E, in which case I say that b is an E-encoding of
s.

This is a quite general definition, so general that it may be too general.
First, with this definition we may have an encoding E, two different strings
s0 and s1 and a byte sequence b such that s0 E b and s1 E b, that is, b
is an E-encoding of two different strings. Should the definition disallow
this? I don't know. Such an encoding may be useful if we restrict ourselves
to use it to encode a subset of strings on which the enconding is
non-ambiguous. Second, we tend to intuitively assume that if we can encode a
string s in an encoding E, then we can also encode any substring of s.
However, the definition above does not demand this.

The definition talks about encoding strings, not characters. This is not a
problem if, as usual, we identify a character c with the string "c". This
makes strings of length one very important when talking about encodings, so
I call these strings "singletons".

For a given encoding E, it is posible for a string to have more than one
E-encoding,
so we may have an enconding E, a string
s and two different byte sequences b0 and b1 such that both b0 and b1 are
E-encodings of s. An encoding E is said to be "functional" when no string
has more than
one E-encoding. Hence, if E is functional and we have s E b0 and s E b1, we
can conclude that b0 and b1 are identical byte sequences.

I will use "e" to represent a functional encoding, and use the functional
notation:
hence, e(s) is THE encoding of s in e. Also, when e is functional encoding ,
the following statements are equivalent:
- e(s) is defined
- s is in the domain of e
- s has an e-encoding.

Is it good or bad for an encoding to be non-functional? It all depends on
what we want to do with the encoded strings. If we simply want to store and
retrieve
them, a non-functional encoding seems to be as good as a functional one.
But what if we want to operate on the strings? The most elementary operation
is equality, so suppose that we have an encoding E and two byte sequences b0
and b1 which are E-encodings of strings s0 and s1, and suppose that we must
determine wheter s0 is equal to s1 just operating on the encodings. If the
encoding is functional, all we have to do is determine whether b0 and b1 are
identical byte sequences; if they are identical, s0 equals s1; if they are
not identical, s0 is not equal to s1 (if b0 and b1 are different, they
certainly encode different strings, because no string has more than one
E-encoding). This strategy works when E is functional, but fails miserably
if E is non-functional, so testing equality with functional encodings is
typically (much?) more efficient than testing equality with non-functional
encodings.

It is possible to define an encoding E in such a way that one or more
strings have no E-encoding. An encoding is said to be "total" if all strings
have at least one encoding; an
encoding is "partial" if it is not total. So, for instance, UTF-8, UTF-16x
and
UTF-32x are total encodings, while UCS2 is a partial encoding. What about
ASCII? Is ASCII a Unicode encoding? It seems to me that the Standard does
not consider ASCII a Unicode encoding. However, I personally treat it as a
partial Unicode encoding. Of course, analogously, all the ISO-8859-x and a
lot of other encodings that I am not even aware of, can be considered
partial Unicode encodings. What do YOU think about this?

Another concept that I find important is what I call a "serial" encoding.
A serial encoding is a functional encoding e such that the following
conditions hold:
   a) for any strings s0 and s1, e(s0) and e(s1) are both defined if and
only if e(s0 + s1) is defined, and if all these expressions are defined,
then
                  e(s0 + s1) = e(s0)e(s1)
   b) for any two different singletons s0 and s1 in the domain of e, then
neither e(s0) is a prefix of e(s1), nor e(s1) is a prefix of e(s0).

Let us discuss this definition. Condition a) is two conditions in one; on
the one hand, it says that if we have defined the encodings e(s0) and e(s1),
then, e(s0 + s1) is automatically defined: e(s0 + s1) is the encoding of s0
immediately followed by the encoding of s1, and we have no right to change
this definition. On the other hand, a) also says that if e(s0 + s1) is
defined, then, necessarily e(s0) and e(s1) must also be defined an the three
encodings must satisfy the given equality. In particular, this implies that
if s has an e-encoding, all characters appearing in s must have an
e-encoding (remember that the definition given above does not demand this).
Condition a) also implies that a serial encoding is totally determined by
how it encodes characters; for instance, if I defined e("a") and e("b"),
then condition a) totally determines how e encodes all strings containing
only a's and b's. Another consequence of the definition is that a character
is always encoded by the same byte sequence, irrespective of the context in
which it appears.

Condition b) is a little subtle. In particular, it implies that e(s0) and
e(s1) must be different, but it demands more than that. Why? Consider the
following example:

s | e(s)
________________
a | 0x01
b | 0x01 0x02

We extend e by e(s0 + s1) = e(s0)e(s1). With this definition e("a") is a
prefix of e("b"). Now suppose that we have the following byte sequence
        0x01 0x02 yy zz ...
that is an e-encoded string. The idea behind a serial encoding is to be able
to decode an encoded string character by character; from condition a) we
know that the first few bytes are the encoded form of the first character.
So we have to determine a prefix of this byte sequence that is an encoded
character. The problem is obviously that we have two prefixes that are
encoded characters, namely 0x01 and 0x01 0x02. With condition b) of the
definition, this cannot happen, so, with a serial encoding (as defined
above), we are sure that for any encoded string, there is exactly one prefix
that is an encoded character, which implies that the described decoding
strategy can be successfully applied.

A serial encoding e is "fixed-length" when for all singletons s0 and s1 in
the domain of e,
e(s0) and e(s1) are of the same length; those serial encodings that are not
fixed-length are "variable-length". For instance, UTF-8 and UTF-16x are
variable-length serial encodings, while UCS2 and UTF-32x are fixed-length
serial encodings.

Author's address: mglsrz@sion.com



This archive was generated by hypermail 2.1.2 : Sat Feb 16 2002 - 09:45:50 EST