From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Mon Aug 09 2010 - 13:21:13 CDT
John H. Jenkins wrote:
> The basic idea is that "plain text" is the minimum amount of
> information to process the given language in a "normal" way.
That's a bit vague. We don't normally "process" languages; we read texts.
Whether font or color variation is essential for understanding really
depends on the author's purposes and choices, not on language,
> FOR
> EXAMPLE, ALTHOUGH ENGLISH CAN BE WRITTEN IN ALL-CAPS, IT USUALLY
> ISN'T, AND DOING IT LOOKS WRONG.
I wouldn't say it looks wrong. Surely it is often typographically poor or
just stupid, but it might be a consequence of technical limitations (there
are still loads of systems that make no case distinction in texts, so in any
relevant aspect, they are effectively "uppercase-only"), and all-caps
English is quite understandable, though boring to read, provided that some
precautions are made by writers.
> We therefore have both upper- and
> lower-case letters for English.
It's just a distinction that you _can_ (and usually do) make in plain text
English. It's not an inherent distinction: all-caps English is still
English, though poorly written by modern standards.
> Arabic, on the other hand, absolutely must have some way of allowing
> for different letter shapes in different contexts, or it looks just
> wrong, so Arabic "plain text" must have facility to allow for that,
> either by explicitly having different characters for the different
> shapes the letters take, or by providing a default layout algorithm
> that defines them.
But "layout algorithms" are not part of character encoding or part of the
definition of "plain text". It's not OK to render plain text Arabic, encoded
at logical level (i.e., letters encoded abstractly and not as contextual
forms), in a simplistic manner that uses a one letter - one glyph model. But
that's not part of the definition of "plain text" at all.
> Yes, there are issues which end up being judgment calls, and it's
> easy to come up with cases where you can't really capture the full
> semantic intent of the author without what Unicode calls "rich text."
We don't need to invent contrived examples for that. Every time an author
uses italics or bolding to make an essential point in emphasizing something
he does something that cannot be captured in a plain version of the text. To
make an even simpler point, if you insert an essential content image into a
document you step outside the realm of plain text.
I don't see any better definition for "plain text" than a negative one: it
is text without formatting, except to the extent that forced line breaks and
the choice of alternative forms for a character (to the extent that such
differences are encoded in the character code) can be considered as
formatting. "Plain text", though apparently a very simple concept, is a very
abstract one. I don't think you can explain the concept to your neighbor
while standing on one foot, if at all.
Human writing did not originate as plain text, and at the surface level, it
is never "plain text": it always has some specific physical appearance, and
abstract "plain text" can only be found below the surface, as the underlying
data format where only character identities (character numbers in a specific
code) are encoded, with no reference to a particular rendering.
-- Yucca, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Mon Aug 09 2010 - 13:23:47 CDT