From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 19 2005 - 19:08:02 CDT
In addition to clarifications provided by Peter and Philippe,
which I won't repeat, ...
> Surely you are not denying that surrogates, ... are stateful mechanisms?
> It is irrelevant for the discussion
> of stateful mechanisms in encoding and the problems they pose for
> fragment interpretability whether or not those mechanisms are in the
> text content; they are in the text stream and must be dealt with.
Surrogate pairs are *not* a stateful mechanism in the sense
that that term is generally applied to character encodings.
Dean quoted:
> SURROGATES:
>
> The Unicode Standard 4.1, section 3.9
> "In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
> represented as
> <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."
He failed to quote the parallel text nearby:
"In UTF-8, the code point sequence <004D, 0430, 4E8C, 10302> is
represented as <4D D0 B0 E4 BA 8C F0 90 8C 82>, where ...
<F0 90 8C 82> corresponds to U+10302."
This is not "stateful" -- in both cases it is simply an encoding
scheme that has a non-one-to-one mapping of code units to
encoded character.
In UTF-16, 0xD800 does not set a "state" which then alters the
interpretation of a subsequent code unit. 0xDF02 has its own, unique
status, regardless of what precedes or follows it. Some sequences
are valid, some are not -- that's all.
In UTF-8, 0xF0 does not set a "state" which then alters the
interpretation of a subsequent byte. 0x90 has its own, unique
status, regardless of what precedes or follows it. Some sequences
are valid, some are not -- that's all.
The ISO 2022 framework, on the other hand, *is* generally acknowledged
to be a stateful approach to character encoding. See the example
shown in Figure 1-2, p. 4 of TUS 4.0:
The presence of the byte sequence <1B 2D> in an ISO 2022 text stream
*alters* the interpretation of an immediately following 0x46 byte
from being LATIN CAPITAL LETTER F to being a code set shifter picking
the character set ISO 8859-7, which sets a further state changing
the interpretation of all subsequent bytes in the stream (until
the next escape sequence).
The presence of the byte sequence <1B 24 42> in an ISO 2022 text stream
*alters* the interpretation of an immediately following 0x46 byte
from being LATIN CAPITAL LETTER F to being the initial byte of
a two byte Shift-JIS encoding of the Japanese ideograph for
'hi' "day", and sets a state changing the interpretation of all
subsequent bytes in the stream (until the next escape sequence).
*That* is stateful character encoding.
--Ken
This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 19:09:13 CDT