Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Mon Aug 09 2010 - 13:21:13 CDT

  • Next message: Peter Constable: "RE: number padless?"

    John H. Jenkins wrote:

    > The basic idea is that "plain text" is the minimum amount of
    > information to process the given language in a "normal" way.

    That's a bit vague. We don't normally "process" languages; we read texts.
    Whether font or color variation is essential for understanding really
    depends on the author's purposes and choices, not on language,

    > FOR
    > EXAMPLE, ALTHOUGH ENGLISH CAN BE WRITTEN IN ALL-CAPS, IT USUALLY
    > ISN'T, AND DOING IT LOOKS WRONG.

    I wouldn't say it looks wrong. Surely it is often typographically poor or
    just stupid, but it might be a consequence of technical limitations (there
    are still loads of systems that make no case distinction in texts, so in any
    relevant aspect, they are effectively "uppercase-only"), and all-caps
    English is quite understandable, though boring to read, provided that some
    precautions are made by writers.

    > We therefore have both upper- and
    > lower-case letters for English.

    It's just a distinction that you _can_ (and usually do) make in plain text
    English. It's not an inherent distinction: all-caps English is still
    English, though poorly written by modern standards.

    > Arabic, on the other hand, absolutely must have some way of allowing
    > for different letter shapes in different contexts, or it looks just
    > wrong, so Arabic "plain text" must have facility to allow for that,
    > either by explicitly having different characters for the different
    > shapes the letters take, or by providing a default layout algorithm
    > that defines them.

    But "layout algorithms" are not part of character encoding or part of the
    definition of "plain text". It's not OK to render plain text Arabic, encoded
    at logical level (i.e., letters encoded abstractly and not as contextual
    forms), in a simplistic manner that uses a one letter - one glyph model. But
    that's not part of the definition of "plain text" at all.

    > Yes, there are issues which end up being judgment calls, and it's
    > easy to come up with cases where you can't really capture the full
    > semantic intent of the author without what Unicode calls "rich text."

    We don't need to invent contrived examples for that. Every time an author
    uses italics or bolding to make an essential point in emphasizing something
    he does something that cannot be captured in a plain version of the text. To
    make an even simpler point, if you insert an essential content image into a
    document you step outside the realm of plain text.

    I don't see any better definition for "plain text" than a negative one: it
    is text without formatting, except to the extent that forced line breaks and
    the choice of alternative forms for a character (to the extent that such
    differences are encoded in the character code) can be considered as
    formatting. "Plain text", though apparently a very simple concept, is a very
    abstract one. I don't think you can explain the concept to your neighbor
    while standing on one foot, if at all.

    Human writing did not originate as plain text, and at the surface level, it
    is never "plain text": it always has some specific physical appearance, and
    abstract "plain text" can only be found below the surface, as the underlying
    data format where only character identities (character numbers in a specific
    code) are encoded, with no reference to a particular rendering.

    -- 
    Yucca, http://www.cs.tut.fi/~jkorpela/ 
    


    This archive was generated by hypermail 2.1.5 : Mon Aug 09 2010 - 13:23:47 CDT