From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 27 2008 - 09:38:34 CDT
Jeroen Ruigrok van der Werven wrote:
> Say you have defined an stateful object that object can tell
> you about its datatype, probably memory use, and so on. Of
> course, this all depends on what is defined to be the 'state'.
Of course yes! UTF-8 (and also UTF-16) is a stateful encoding because you
have to remember the state of the previous leading bytes (or leading high
surrogate in UTF-16) when a non-leading byte (or non-leading low surrogate
in UTF-16) occurs.
However, this state is bounded in UTF-8 (not in ISO 2022), where you may
need to remember the state for unlimited distance from where it was set aby
another prior code.
"Stateful" is not a particularly useful distinction for encodings. In fact,
almost everything we handle is stateful (starting at least at the bit level:
you need to keep the state of some other prior bits to recognize distinct
codes for distinct characters).
What is more productive, when speaking about encodings, is the minimum
distance (in terms of volume, or time of transmission...) at which the state
is fully defined, because it also conditions other things, notably:
- the resistance to errors of transmission, or recoverability from such
errors, or
- the searchability from arbitrary position in the middle of text: how
much do you have to read backward from an arbitrary position in order to be
sure to decode the rest of the text correctly with all the needed decoding
state variables correctly defined unambiguously?
- can you predict this backward distance in a limited set of read
operations?
UTF-8 and UTF-16 resist to the three conditions above with a finite/bounded,
small, and fully predictive number of operations (requiring a fully
predetermined finite set of state variables), when ISO 2022 does not offer
the same features (even if it requires a finite set of state variables, it
does not offer full predictability for searches from arbitrary position in
large texts, so its processing is "almost necessarily" sequential only,
unless you use some heuristic "guessing", similar to the one used in web
browsers to guess which encoding is used in some web page without explicit
meta-data specifiying it and you are prepared to accept: the existence of
false guesses, or errors, or need to redecode the same text starting from
several other positions and see what makes the more "sense" for your
application).
This archive was generated by hypermail 2.1.5 : Tue May 27 2008 - 10:47:25 CDT