From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Dec 27 2009 - 18:10:55 CST
It has been suggested that the number of states needed in a finite state
machine is a useful metrics in comparing encoding schemes. I suggest a
different point of differentiation that has more widespread practical
importance.
Legacy double byte encodings require a very small number of machine
states to decode, but anyone who has experience with these encodings
knows how difficult they can be to work with in practice. Clearly, it is
then not the size of the finite state machine that matters.
What distinguishes the UTFs from encodings such as DBCS and ISO-2022 are
two other metrics:
1. the number of input bytes needed from random access to determine the
state
2. the number of input bytes for which state information is to be maintained
These two metrics are similar, but not identical.
The first speaks directly to the topic of resynchronization. In the
legacy DBCS encodings, certain byte values satisfied the conditions to
be either a leading byte or a trailing byte. The encoding scheme imposes
no limit on the length of runs of such bytes, making resynchronization,
in the worst case, the same as re-reading the data stream from the
start. Compare that to the UTFs where the worst case requires examining
4 bytes to resynchronize. (4 bytes allows error checking and
identification of the character, while three bytes is enough to
calculate where the adjacent character boundaries must be in well-formed
input).
The second metric refers to encodings like ISO-2022 or SCSU which use
control bytes or sequences switch among character sets. There are cases,
where such as scheme could be set up to allow easy resynchronization in
terms of character boundaries, yet still require that state information
be maintained for very long (unbounded) stretches of data. Assume 2022
style combination of several single byte character sets. If that
restriction is known (by announcement), then resynchronizing to any
character boundary is trivial (as long as you recognize and avoid the
escape codes). However, interpreting (or correctly converting) any given
character is impossible without going back to the most recent character
set switching escape code.
Except in the limiting case of a straight-through converter, that is, in
most practical architectures, there are many situations where these two
metrics describe quite well how "difficult" an encoding is.
A./
This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 18:13:47 CST