From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Mon Aug 09 2010 - 13:21:13 CDT
John H. Jenkins wrote:
> The basic idea is that "plain text" is the minimum amount of
> information to process the given language in a "normal" way.
That's a bit vague. We don't normally "process" languages; we read texts. 
Whether font or color variation is essential for understanding really 
depends on the author's purposes and choices, not on language,
> FOR
> EXAMPLE, ALTHOUGH ENGLISH CAN BE WRITTEN IN ALL-CAPS, IT USUALLY
> ISN'T, AND DOING IT LOOKS WRONG.
I wouldn't say it looks wrong. Surely it is often typographically poor or 
just stupid, but it might be a consequence of technical limitations (there 
are still loads of systems that make no case distinction in texts, so in any 
relevant aspect, they are effectively "uppercase-only"), and all-caps 
English is quite understandable, though boring to read, provided that some 
precautions are made by writers.
> We therefore have both upper- and
> lower-case letters for English.
It's just a distinction that you _can_ (and usually do) make in plain text 
English. It's not an inherent distinction: all-caps English is still 
English, though poorly written by modern standards.
> Arabic, on the other hand, absolutely must have some way of allowing
> for different letter shapes in different contexts, or it looks just
> wrong, so Arabic "plain text" must have facility to allow for that,
> either by explicitly having different characters for the different
> shapes the letters take, or by providing a default layout algorithm
> that defines them.
But "layout algorithms" are not part of character encoding or part of the 
definition of "plain text". It's not OK to render plain text Arabic, encoded 
at logical level (i.e., letters encoded abstractly and not as contextual 
forms), in a simplistic manner that uses a one letter - one glyph model. But 
that's not part of the definition of "plain text" at all.
> Yes, there are issues which end up being judgment calls, and it's
> easy to come up with cases where you can't really capture the full
> semantic intent of the author without what Unicode calls "rich text."
We don't need to invent contrived examples for that. Every time an author 
uses italics or bolding to make an essential point in emphasizing something 
he does something that cannot be captured in a plain version of the text. To 
make an even simpler point, if you insert an essential content image into a 
document you step outside the realm of plain text.
I don't see any better definition for "plain text" than a negative one: it 
is text without formatting, except to the extent that forced line breaks and 
the choice of alternative forms for a character (to the extent that such 
differences are encoded in the character code) can be considered as 
formatting. "Plain text", though apparently a very simple concept, is a very 
abstract one. I don't think you can explain the concept to your neighbor 
while standing on one foot, if at all.
Human writing did not originate as plain text, and at the surface level, it 
is never "plain text": it always has some specific physical appearance, and 
abstract "plain text" can only be found below the surface, as the underlying 
data format where only character identities (character numbers in a specific 
code) are encoded, with no reference to a particular rendering.
-- Yucca, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Mon Aug 09 2010 - 13:23:47 CDT