From: Gregg Reynolds (unicode@arabink.com)
Date: Tue Jul 12 2005 - 02:30:35 CDT
Kenneth Whistler wrote:
>
> O.k., but as you surmised in an earlier note, what you are trying
> to do here is distinct from a *character* encoding of the sort
> that the Unicode Standard does.
Yep.
One problem (IMO), not with Unicode per se, but with its metalanguage,
is that we don't really have good technical terminology for many of the
concepts involved in talk of written language and encoding. So I
propose the following terminology which I hope will be somewhat
self-explanatory:
1. Unicode "character" => gramma (pl. grammata)
2. Unicode "plaintext" => shallowtext (surfacetext?)
3. Unicode "markup" => [restrict this term to its literal meaning, i.e.
marking up text by adding more text elements ("characters")]
4. semantic "character" => grammeme (better than sememe?)
5. grammemic text => deeptext
Motivation: Unicode uses these terms with a restricticed, technical
meaning. Unfortunately, they are common words with wider denotations
and lots of (culturally-dependent) connotations. "Character" in
particular is very complex. In my estimation, most people think of some
combination of gramma and grammeme when they hear the word "character".
(There's an interesting discussion to be had about the inner lives of
characters, but that's for another thread. I'll just point out that in
many religious traditions "characters" are almost mystical critters, and
for good reason.)
So now I can (I hope) articulate more precisely (and abstractly) some
assertions I've made elsewhere about the relation between Unicode and
various written language communities:
Proposition A: the relation between shallowtext and deeptext is not
uniform across written languages.
Proposition B: it is possible to classify written languages according
to the type of encoding design that best reflects the semiotic operation
of the written language. E.g., English is a shallowtext language, and
(written) Arabic is a deeptext language. Which is another way of saying
individual grammata in Arabic have broader/deeper/more complex meaning
than the grammata of English.
Corollary: a shallowtext encoding "works" best for a wlanguage like
English, in that it doesn't omit any of the semiotic operations of the
written text. It doesn't work as well for a deeptext wlanguage like
Arabic, because it omits large chunks meaning. That is, the grammata of
written Arabic carry a heavier semantic load than the grammata of
written English, but shallowtext encodings explicitly ignore that load,
whereas a deeptext encoding can capture it.
>
> of course.) It doesn't get into issues of morphological or
> phonological analysis, nor should it, in my assessment.
For English, no. But I think you have to ask how such analysis is
related to literacy. You can't be literate in Arabic if you can't
recognize the morphological and phonological structure of written words.
In contrast to English, such meanings are often born by single characters.
>
> What you are presenting might well be a very interesting and useful
> way to represent Arabic text, but from the Unicode point-of-view
> it is a *markup* of the plain text with more information beyond
> what is simply carried by the surface form of the letters.
I understand your meaning, but strictly speaking this begs the
(metaphysical?) question of just what information "is simply carried by
the surface form of the letters". I think a pretty good argument could
be made that the surface form of the letters carries both nothing and
everything. Nothing, because letters only operate within a semiotic
system (which includes deep orthography, morphology, etc.); and
everything, because, well, if you can analyze the semiotic operations of
a letter (or the surface form thereof), then it must be that the letter
carries all of those operations (meanings). :) I suppose one has to
ask "who wants to know?"; a literate might "see" lots of meaning in the
surface form; somebody who has simply memorized the letterforms but
doesn't know the language will "see" only the surface gramma.
I think the Unicode point of view would be that the surface form carries
no semantics, no?
>
> The important thing, from my point of view, is that this kind
> of issue and this kind of representation of text is not
> a character encoding issue per se, but rather builds on top
> of the character encoding to present a deeper analysis of the
> text that carries information not simply the result of the
> identification of the characters alone.
That's one (legit) way of looking at it. But you can turn it on its
head, as well. I.e. a shallowtext (grammata) encoding necessarily
piggybacks on a (possibly implicit) deeptext understanding. Which I
guess is maybe another way of saying that "identification of the
characters alone" depends on an implicit notion of deeptext. Maybe. I
guess that's a hypothesis.
>
> In principle, this is no different than color coding all the
> "c's" in English text to indicate their different pronunciations,
Yes and no. Structurally maybe. But pragmatically it's quite
different. A phonocode for English might be useful for learners, but it
wouldn't really be very useful for literates. It doesn't seem likely
that very many people would be interested in, say, searching for all
occurences of "c" pronounced /k/. You wouldn't sort by pronunciation,
usually. By contrast, explicity encoding e.g. radicals for Arabic would
be enormously useful for pretty much everybody. Dictionaries are
organized by root structure, so if you can't pick out the radicals in a
word, well good luck finding it in the dictionary.
(BTW, just in case it looks like I'm trying to be difficult: improved
technical terminology and a clearly contrastive encoding design should
make it easier to explain what Unicode is and isn't. So I hope its useful.)
-gregg
This archive was generated by hypermail 2.1.5 : Tue Jul 12 2005 - 02:31:55 CDT