From: Gregg Reynolds (unicode@arabink.com)
Date: Fri Jun 24 2005 - 15:18:38 CDT
James Kass wrote:
> Gregg Reynolds wrote,
>  
> 
>>The unicode definition of "plain text" works for me; it's more or less 
>>mathematical and allows us to avoid metaphysics.  But you surely see 
>>that the definition of "rich text" is hopelessly broken and inconsistent 
>>with that of plain text, no?
> 
> 
> Surely I can see that the definition of rich text is inconsistent
> with that of plain text.  After all, if they weren't inconsistent,
> they'd be the same thing and the glossary entry for "rich text"
> could be changed to:  'see "plain text"'.
consistent does not mean identical.
> 
> But, what's hopelessly broken about it?
> 
Hi James,
Sorry about getting back to you late.
I hope the following (longish) message will make clear I don't bring 
this stuff up just to be curmudgeonly.
 From the glossary:
"Plain Text. Computer-encoded text that consists only of a sequence of 
code points from a given standard, with no other formatting or 
structural information."
Not bad; but not good enough.  It should say "a sequence of codepoints 
*each of which has single-character semantics*...".  I.e. a standard 
which defines a codepoint for "red" or "skip 24 points" or "poodle" 
cannot be used for plaintext.
"Rich Text. Also known as styled text. The result of adding information 
to plain text. Examples of information that can be added include font 
data, color, formatting information, phonetic annotations, interlinear 
text, and so on. The Unicode Standard does not address the 
representation of rich text. It is expected that systems and 
applications will implement proprietary forms of rich text. Some public 
forms of rich text are available (for example, ODA, HTML, and SGML). 
When everything except primary content is removed from rich text, only 
plain text should remain."
Most obvious problem:  SGML is plain text, as is XML, a subset of PDF, 
etc.  HTML is also plaintext; it happens to have some formatting 
semantics at the lexical level, but considered as a "sequence of 
codepoints" it clearly meets the Unicode definition of plain text.  For 
that matter, isn't RTF plaintext with formatting semantics?  I'm not 
that familiar with it, but doesn't it use a plain text character repertoire?
The basic problem: by these definitions, plain text and rich text are in 
semantically different categories.  One is a sequence of code points; 
the other is - what?  Figure on ground?  Ink on paper?  Any result of 
presenting plain text visually?
What can it mean to "add information" to plain text, given that plain 
text is by definition a sequence of codepoints?  If you add 
"information" consisting of codepoints with character semantics, then 
you still have plain text.  If you add "information" consisting of 
codepoints with non-character semantics, well then you no longer have 
text of any kind.  You have non-text.  If you add "information" by 
writing a syntax-coloring editor, you haven't added anything to the 
plain text, you've added a completely separate semantic layer.
The fact that a plain text string may conform to a higher-level grammar 
(like XML), even if that grammar also has an associated non-text 
semantics (like HTML), doesn't change the fact that the string is plain 
text.
So the important distinction is not between plain text and rich text, 
but between plain text and non-text on the one hand, and text versus 
representation on the other.  Or at a higher level, between that family 
of grammars that use plaintext at the lowest syntactic level, and those 
that use non-text at the lowest level.  The former includes SGML, HTML, 
XML, RTF, SVG, etc. etc.  The latter includes the MSWord doc format, 
xls, image formats, various proprietary typesetting languages, etc.  The 
Unicode glossary would be improved if, instead of "The Unicode Standard 
does not address the representation of rich text" it said something like 
"Unicode does not impose any syntactic or semantic constraints on 
higher-level grammars that use Unicode at the character text level."
This is important in the context of training.  I occasionally have to 
try to explain XML in 30 seconds or less to non-techy business types. 
One of the crucial points (IMO) is that XML is plain text, which means 
the kind of file corruption problems we often have with Word docs go 
away, since we can use any one of thousands of plaintext editors to 
examine and fix the docs.  The contrast with .doc files is not plain v. 
rich, but plain v. non-text, and therefore tool-agnostic v. vendor 
dependent.  The fact that the non-text elements of the .doc format may 
represent formatting information is irrelevant; you can't edit them no 
matter what they mean without a specialized editor.
Complimentary to this is the importance of the notion of a distinction 
between the thing and its representation, which is where XSL stylesheets 
come in.  XSL stylesheets don't turn plain text into rich text; they may 
generate (possibly "fancy", colorful) representations of a plain text 
information asset.  Such representations may themselves use a plaintext 
(HTML) or a non-text (PDF) language.  But the information asset remains 
in plaintext.  When I show somebody a hardcopy of a colorful fancied-up 
PDF document generated from an XML document, I say, not "this is rich 
text", but "this is a plain text document formatted with a stylesheet; 
we can change it however we want without disturbing the plaintext".  It 
seems to me that using the terminology as you and some others recommend 
would make this impossible.  I just don't see how this idea of "rich 
text" is really very useful.
-gregg
This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 15:20:19 CDT