From: Dean Snyder (dean.snyder@jhu.edu)
Date: Fri May 20 2005 - 13:53:37 CDT
Tim Greenwood wrote at 1:24 PM on Friday, May 20, 2005:
>On 5/19/05, Dean Snyder <dean.snyder@jhu.edu> wrote:
>> Well that, of course, depends on how you define state, acknowledgment of
>> which, I presume, is related to both your qualified dissension and your
>> use of quotes around the word "state" here.
>
>While I do not agree that your definition of state matches that
>commonly accepted, it is a coherent argument.
The surrogate mechanism and its UTF-8 analog are SELF-BOUNDING state
mechanisms; whereas ones like the bidi mechanism are OTHER-BOUNDING.
They are both stateful in that they exhibit co-dependency across atoms
(code units).
>However if you make that
>argument then you must address Ken's other point. You criticise the
>use of 'stateful' code units in UTF-16, yet do not do the same for
>UTF-8. Why not? The structure of both is very similar.
No particular reason, other than I consider it a side-stepping
distraction from the discussion of surrogates.
But, of course, the UTF-8 mechanism makes the same point I am making for
UTF-16, in fact, it makes it even stronger. The fact that you may have
to backtrack anywhere from one to three code units in order to interpret
code unit sequences in UTF-8 makes it more fragment fragile than UTF-16
- the stateful mechanism is spread over twice as many code units.
As the Unicode Standard (section 2.5) says regarding multiple code units
for single characters - "This property [self-synchronization] has
another very important implication: corruption of a single code unit
corrupts only a single character; none of the surrounding characters are
affected."
That, of course, is the ingenuous sheep's clothing; the wolf inside the
sheep's clothing however is the complexity and its concomitant fragility.
But in referring back to one of my main points: when, in the future, we
move to a monolithic 4-byte text encoding architecture this all becomes
needless complexity and none of this statefulness between code units and
code points would exist.
In such an era I suggest we refer to the text encoding atom as a
"gulp" (as opposed to the current "byte" ;-)
Respectfully,
Dean A. Snyder
Assistant Research Scholar
Manager, Digital Hammurabi Project
Computer Science Department
Whiting School of Engineering
218C New Engineering Building
3400 North Charles Street
Johns Hopkins University
Baltimore, Maryland, USA 21218
office: 410 516-6850
cell: 717 817-4897
www.jhu.edu/digitalhammurabi/
http://users.adelphia.net/~deansnyder/
This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 13:58:00 CDT