From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 11 2005 - 21:01:02 CDT
Gregg,
> Ok, you asked for it. Here's an example taken from my own little
> speculative semantic encoding design for Arabic. Soon to be inflicted
> on an innocent world.
>
> The letterform waw U+0648 has at least four distinct functions in
> written Arabic.
O.k., but as you surmised in an earlier note, what you are trying
to do here is distinct from a *character* encoding of the sort
that the Unicode Standard does.
The Unicode encoding sees a waw in the written form, and represents
that by a waw in the text representation, with a single waw
character encoded. (Compatibility presentation form gorp, aside,
of course.) It doesn't get into issues of morphological or
phonological analysis, nor should it, in my assessment.
What you are presenting might well be a very interesting and useful
way to represent Arabic text, but from the Unicode point-of-view
it is a *markup* of the plain text with more information beyond
what is simply carried by the surface form of the letters.
Another way to look at it is simply to correlate your Latin-1
transliteration scheme with the plain text representation, and
consider that the markup (however implemented):
1. waw-rad: waw --> W
2. waw-nonrad: waw --> w
3. sister of damma: waw --> û (Latin-1 u-circumflex, in case anyone
gets character hash here)
4. lazy waw: waw --> o
As long as your markup scheme synchronizes the plain text element
on the left with your Latin-1 transcriptional equivalent on the
right, by whatever means, you have the piece of information
then available to make the distinctions you are after.
How that is *rendered* then is "an exercise left for the implementer".
*hehe*. It could be simply interlinear annotation, or it could be
popup tooltips, or it could be in the kind of hacked up font you
all talking about that would visually diacriticalize waw's of
different types. Or you could just color code the text, separating
out all the radical waw's in green, and the lazy waw's in pink, or
whatever, based on the information you have represented in the
markup.
The important thing, from my point of view, is that this kind
of issue and this kind of representation of text is not
a character encoding issue per se, but rather builds on top
of the character encoding to present a deeper analysis of the
text that carries information not simply the result of the
identification of the characters alone.
In principle, this is no different than color coding all the
"c's" in English text to indicate their different pronunciations,
for example -- which could also be carried around by
subcategorizing and marking them up with phonetic information,
including c's participation in digraphs:
c(=k)olor c(=s)ic(=k)ada
[ch](=esh)ute [ch](=k)yle [ch](=t-esh)ime
and so on and so on.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jul 11 2005 - 21:02:00 CDT