Re: Stateful encoding mechanisms

From: Dean Snyder (dean.snyder@jhu.edu)
Date: Thu May 19 2005 - 12:07:45 CDT

Next message: Peter Constable: "RE: ASCII and Unicode lifespan"

Previous message: Hans Aberg: "Re: ASCII and Unicode lifespan"
Maybe in reply to: Dean Snyder: "Stateful encoding mechanisms"
Reply: Philippe Verdy: "Re: Stateful encoding mechanisms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Constable wrote at 8:46 AM on Thursday, May 19, 2005:

>Note that surrogates, BOM and annotation characters (FFF9..FFFB) are not
>used in the text content of a file:

I do not understand why you are making the irrelevant, and even
partially wrong, assertion here that surrogates, BOM and annotation
characters are not used in the text content of a file?

I was not addressing the concept of "text content of a file"; I
specifically addressed "stateful mechanisms for plain text encoding".

Surely you are not denying that surrogates, BOM and annotation
characters are stateful mechanisms? It is irrelevant for the discussion
of stateful mechanisms in encoding and the problems they pose for
fragment interpretability whether or not those mechanisms are in the
text content; they are in the text stream and must be dealt with.

And for that matter, I don't understand why you left out the bidi
operators here, which I also mentioned. Do you consider them part of the
text content?

SURROGATES:

The Unicode Standard 4.1, section 3.9
"In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
represented as
<004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."

How can you say that, for example, the surrogates in this very example
in TUS are not used in text content?

BOM:

The Unicode Standard 4.1, section 15.8
"Detection of U+FFFE at the start of an input stream should be taken as
a strong indication that the input stream should be byte-swapped before
interpretation."

Note the use of the word "strong" here, signaling the BOM's ambiguity. U
+FEFF can occur almost anywhere in a text stream but if it is a BOM it
is used to interpret the text content, and is therefore, by definition,
a stateful mechanism. Notice the troublesome possibility of a text
fragment that happens to begin with U+FEFF used originally as a zero
width no-break space but now "should be taken as a strong [yet wrong]
indication that the input stream should be byte-swapped before
interpretation".

ANNOTATION CHARACTERS:

The Unicode Standard 4.1, section 15.9
"For all regular editing and text-processing algorithms, the annotated
characters
are treated as part of the text stream. The annotating text is also part
of the content,
but for all or some text processing, it does not form part of the main
text stream."

Obviously the annotation characters themselves are not rendered but they
are rendering triggers, and they are in the text stream, and they are
stateful.

Dean A. Snyder

Assistant Research Scholar
Manager, Digital Hammurabi Project
Computer Science Department
Whiting School of Engineering
218C New Engineering Building
3400 North Charles Street
Johns Hopkins University
Baltimore, Maryland, USA 21218

office: 410 516-6850
cell: 717 817-4897
www.jhu.edu/digitalhammurabi/
http://users.adelphia.net/~deansnyder/

Next message: Peter Constable: "RE: ASCII and Unicode lifespan"
Previous message: Hans Aberg: "Re: ASCII and Unicode lifespan"
Maybe in reply to: Dean Snyder: "Stateful encoding mechanisms"
Reply: Philippe Verdy: "Re: Stateful encoding mechanisms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 14:07:09 CDT