From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed May 28 2008 - 07:03:09 CDT
Kenneth Whistler
> John Jenkins said:
> > > UTF-16, after all, is stateful: if you lose the BOM, things
> can look
> > very different.
>
> That is true of the UTF-16 encoding *scheme*. (See TUS 5.0,
> D98, p. 106.)
No, I also included the encoding *forms* as well, without reference to the
byte order, but just the relative order of code units.
The BOM is another case where you need *another* state variable. But even in
encoding schemes without any BOM, you need a state variable to parse the
encoded text. This is true for all encodings that are handling streams of
code units or streams of bytes, and in fact any data stream where there's
always a relative order needed to interpret them (at least up to the level
of bits, or discrete numeric symbols in communication and transport
devices).
The *only* level that is stateless in Unicode is the level of the stream of
*code points* that have their own distinctive identity and their own
properties independantly of their context, but may still be given some
additional infered properties from the context (such as the effective
directionality for characters with weak or neutral directionaly). Code
points exist as undividable single points in a well defined discrete space
(i.e. as elements in a finite set), and they don't have any "length" or
"current state", their cardinality is always 1 for encoding only their
existence.
Their effective representation (as a integer number or as a boolean bitset
with just one bit set to one) is not relevant, and not even their relative
order (the encoding space itself has no dimension, even if it is enumerable,
and efectively defined along with a normative enumeration that maps them to
integers, but without saying that they would be integers themselves, as code
points have *no* defined arithmetic behavior except in small subsets of the
space for some applications).
Of course, to handle code points in computers, you need at least an encoding
form (at the interface level) or scheme (for the effective storage or
transmision). Such mapping (encoding and decoding) always requires a
stateful operation, even for the simplest UTF-32BE or UTF32-LE encoding
schemes. The good question to ask is where the state variable resides: in
the encoder/decoder themselves, but not anywhere in the data stream of code
units, bytes or bits. Such state varaible gets set in operations known as
"I/O" operations: all these operations are ordered (in processing time, or
storage address, or relative position in the stream).
Saying that any encoding scheme or form is stateless is completely false:
all you can say is that some representation require *less* free state
variables than others, but you absolutely cannot exclude *all* state
variables. As a consequence, *all* Unicode encoding schemes or forms are
stateful (the only exception is the UTF-32 encoding form when working with
it at the interface level, because this is the only standardized
representation that uses a bijective one-to-one mapping between code points
and numeric code units).
You can also compare the various schemes or forms by the amount of space
needed to store these state variables:
* for UTF-8 streams of bytes without BOM, this space is a number from 0 to 3
(so it requires two bits) ;
* for UTF-8 streams of bytes with possible BOM, you need another bit of
state to represent the presence or absence of the leading BOM ;
* when working at the bit level, you need three other bits of state to
represent the bit position and order in bytes.
* You can do the same kind analysis for UTF-16 and UTF-32 encoding forms and
schemes, but you'll need also a few bit variables state variable to
represent the byte order and relative position of bytes in streams in that
order.
The number of state variables needed is zero *only* for the standardized
UTF-32 encoding form (or for any non-standard encoding forms that represent
code points in any numeric representation capable of storing about 21 bit of
information with at least 17×2^16 distinct values). Some state variables are
implied by the hardware architecture handling the representation and cannot
be changed easily at the software level without costly conversion operations
(such as bit reordering), but they do not "disappear": they are effectively
implemented by the computing host. When working at the level of encoding
*schemes* (not *forms*), these variables are always present and must be
supported by the software handling them, in order to have or rebuild eaily
usable code units.
So there exists absolutely NO "stateless" encoding schemes. Another way to
say it: ALL encoding *schemes* are "stateful", even if you don't immediately
perceive the effective need of these state variables, and even if these
variables are very few, extremely small and simple to handle in software
(but sometimes handling them can be quite costly in terms of application
performance or needed computing resource, especially when it requires bit
reordering).
This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 10:14:16 CDT