Terminology (was Re: Proofreading fonts)

From: Gregg Reynolds (unicode@arabink.com)
Date: Tue Jul 12 2005 - 02:30:35 CDT

  • Next message: Bob Hallissy: "RE: Proofreading fonts"

    Kenneth Whistler wrote:

    > O.k., but as you surmised in an earlier note, what you are trying
    > to do here is distinct from a *character* encoding of the sort
    > that the Unicode Standard does.


    One problem (IMO), not with Unicode per se, but with its metalanguage,
    is that we don't really have good technical terminology for many of the
    concepts involved in talk of written language and encoding. So I
    propose the following terminology which I hope will be somewhat

    1. Unicode "character" => gramma (pl. grammata)

    2. Unicode "plaintext" => shallowtext (surfacetext?)

    3. Unicode "markup" => [restrict this term to its literal meaning, i.e.
    marking up text by adding more text elements ("characters")]

    4. semantic "character" => grammeme (better than sememe?)

    5. grammemic text => deeptext

    Motivation: Unicode uses these terms with a restricticed, technical
    meaning. Unfortunately, they are common words with wider denotations
    and lots of (culturally-dependent) connotations. "Character" in
    particular is very complex. In my estimation, most people think of some
    combination of gramma and grammeme when they hear the word "character".
      (There's an interesting discussion to be had about the inner lives of
    characters, but that's for another thread. I'll just point out that in
    many religious traditions "characters" are almost mystical critters, and
    for good reason.)

    So now I can (I hope) articulate more precisely (and abstractly) some
    assertions I've made elsewhere about the relation between Unicode and
    various written language communities:

    Proposition A: the relation between shallowtext and deeptext is not
    uniform across written languages.

    Proposition B: it is possible to classify written languages according
    to the type of encoding design that best reflects the semiotic operation
    of the written language. E.g., English is a shallowtext language, and
    (written) Arabic is a deeptext language. Which is another way of saying
    individual grammata in Arabic have broader/deeper/more complex meaning
    than the grammata of English.

    Corollary: a shallowtext encoding "works" best for a wlanguage like
    English, in that it doesn't omit any of the semiotic operations of the
    written text. It doesn't work as well for a deeptext wlanguage like
    Arabic, because it omits large chunks meaning. That is, the grammata of
    written Arabic carry a heavier semantic load than the grammata of
    written English, but shallowtext encodings explicitly ignore that load,
    whereas a deeptext encoding can capture it.
    > of course.) It doesn't get into issues of morphological or
    > phonological analysis, nor should it, in my assessment.

    For English, no. But I think you have to ask how such analysis is
    related to literacy. You can't be literate in Arabic if you can't
    recognize the morphological and phonological structure of written words.
      In contrast to English, such meanings are often born by single characters.
    > What you are presenting might well be a very interesting and useful
    > way to represent Arabic text, but from the Unicode point-of-view
    > it is a *markup* of the plain text with more information beyond
    > what is simply carried by the surface form of the letters.

    I understand your meaning, but strictly speaking this begs the
    (metaphysical?) question of just what information "is simply carried by
    the surface form of the letters". I think a pretty good argument could
    be made that the surface form of the letters carries both nothing and
    everything. Nothing, because letters only operate within a semiotic
    system (which includes deep orthography, morphology, etc.); and
    everything, because, well, if you can analyze the semiotic operations of
    a letter (or the surface form thereof), then it must be that the letter
    carries all of those operations (meanings). :) I suppose one has to
    ask "who wants to know?"; a literate might "see" lots of meaning in the
    surface form; somebody who has simply memorized the letterforms but
    doesn't know the language will "see" only the surface gramma.

    I think the Unicode point of view would be that the surface form carries
    no semantics, no?
    > The important thing, from my point of view, is that this kind
    > of issue and this kind of representation of text is not
    > a character encoding issue per se, but rather builds on top
    > of the character encoding to present a deeper analysis of the
    > text that carries information not simply the result of the
    > identification of the characters alone.

    That's one (legit) way of looking at it. But you can turn it on its
    head, as well. I.e. a shallowtext (grammata) encoding necessarily
    piggybacks on a (possibly implicit) deeptext understanding. Which I
    guess is maybe another way of saying that "identification of the
    characters alone" depends on an implicit notion of deeptext. Maybe. I
    guess that's a hypothesis.

    > In principle, this is no different than color coding all the
    > "c's" in English text to indicate their different pronunciations,

    Yes and no. Structurally maybe. But pragmatically it's quite
    different. A phonocode for English might be useful for learners, but it
    wouldn't really be very useful for literates. It doesn't seem likely
    that very many people would be interested in, say, searching for all
    occurences of "c" pronounced /k/. You wouldn't sort by pronunciation,
    usually. By contrast, explicity encoding e.g. radicals for Arabic would
    be enormously useful for pretty much everybody. Dictionaries are
    organized by root structure, so if you can't pick out the radicals in a
    word, well good luck finding it in the dictionary.

    (BTW, just in case it looks like I'm trying to be difficult: improved
    technical terminology and a clearly contrastive encoding design should
    make it easier to explain what Unicode is and isn't. So I hope its useful.)


    This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 02:31:55 CDT