Re: Demystifying the Politburo (was: Re: Arabic encoding model (alas, static!))

From: Gregg Reynolds (
Date: Sat Jul 09 2005 - 09:46:16 CDT

  • Next message: fantasai: "[CSS3 Text] New W3C Working Draft Released"

    Kenneth Whistler wrote:
    > Gregg said:
       This gets back
    >>to the design principles (and the interests that drive them) of Unicode,
    >>which work better for some languages than others.
    > for some... writing systems than others.

    Ok, now this is an interesting point. I should have said "written
    languages"; you contrasted "writing systems". What's the diff?

    When I first got into this stuff in earnest, in the late 90s, right
    about the time I earned my "Ranting Unicode Newbie" wings, I ended up
    with the notion that written languages are, well, languages in graphical
    form. I recall having found solid theoretical support for this, but
    naturally the argument escapes me at the moment.

    In any case, I like a "written language" because (to me at least) it
    implies that letterforms are part of a larger linguistic reality,
    related to the various levels of standard grammar, i.e. phonology,
    syntax. That's why I say Unicode works better for some written
    languages than others. To put it another way, different (spoken )
    language types have different semantic relations to their (written)
    graphical forms. From there we can go to different encoding
    philosophies. Unicode is one, based on a certain way of looking at
    things. Looking from a "written languages" perspective, one comes up
    with a different set of design principles. (Can you tell I'm struggling
    a bit to articulate just what the heck I have in mind?)

    In contrast, "writing system" captures a different set of ideas. It's
    the right term to use for Unicode, insofar as it does *not* imply
    anything about grammar, semantics, etc. To me Unicode looks like a kind
    of surface orthography, concerned only about the (abstract structure of)
    marks on the paper, not the linguistic structures represented by those
    marks. Hence "writing systems".

    In fact one could argue that letterforms are of secondary importance to
    written languages, just as digit forms are of secondary importants to
    mathematics. One could create a written English text using Arabic (or
    Korean etc) letterforms if one were so inclined (by which I *don't* mean
    transliteration); it would still be written English. Actually, the
    recent posting regarding the relation of Tamil grammar to letterforms
    captures the idea nicely.

    Which brings up the question of "what is a character, *really*?"

    Maybe "written languages use writing systems" is what I mean.

    Does that make any sense? (If any of you real linguists can pull the
    worms out of my head on this point please do. They're nice worms. If
    not, I'd appreciate help articulating this stuff cleanly. I think it
    might be helpful to newcomers.)
    > Furthermore, as much as it would be nice to have Arabic simply
    > be implemented consistently right-to-left, in any *practical*
    > implementation, you *must* deal with bidirectionality.

    Hmmm. I guess it depends on what you mean by "practical", and for whom.
      I can easily imagine monodirectional implementations being very
    practical for certain user groups. In fact, I use such software all the
    time: vim (, which has non-bidi Arabic support, and
    (believe it or not) emacs, which has very strange but usable
    quasi-monodirectional Arabic support (lines flow left to right, words
    right to left. CRAZY.). I also use emacs to write Arabic with latin-1
    characters. Written Arabic *language*, non-Arabic writing system? More
    on this later...
    > I realize that you think you may have a better mousetrap in
    > approaching the problem of encoding Arabic text than the
    > encoding used in the Unicode Standard --- but...

    Well, better for some things, maybe; but also based on different design
    principles. I expect to post a webpage soon with a whole bunch of CRAZY
    ideas for encoding (written) Arabic. No fewer than 19 - count 'em,
    nineteen! - key design decisions! Some of them may even fit into the
    Unicode model. (Much of the fun of speculative Arabic encoding design
    derives from the fact that traditional Arabic grammar finds lots of
    semantics in individual characters. In fact, I'll bet a strong argument
    could be made that every Arabic letterform in a text has either
    graphotactic (empty) semantics or morphemic semantics.)

    > However you cut the pie, you are still faced with the
    > difficulties that the script presents you in dealing with
    > the basic information processing requirements: keyboard
    > input, text storage, searching, sorting, editing, layout
    > and rendering, and so on. The whole stack of information
    > processing has to function -- and has to function in the
    > context of existing software systems, data storage technologies,
    > databases, fonts, libraries, internet protocols, and on and
    > on ... or you haven't got any solution at all. Just ideas
    > and a theory.

    Running code always wins. I've actually given this quite a bit of
    thought. (You'd be amazed what you can do with just emacs in the way of
    encoding experimentation.) Unfortunately I'm a rather lazy hacker so
    I'll have to depend on the kindness of strangers.


    This archive was generated by hypermail 2.1.5 : Sat Jul 09 2005 - 09:48:30 CDT