From: Gregg Reynolds (unicode@arabink.com)
Date: Mon Jul 11 2005 - 20:01:22 CDT
Asmus Freytag wrote:
> At 03:26 PM 7/11/2005, Peter Kirk wrote:
>
>> In fact I think Gregg started this thread with a bad example. The two
>> encodings for a with circumflex are canonically equivalent and so
>> different encodings of the same data. The cases Gregg really needs to
>> deal with are when the alternatives are not canonically equivalent but
>> semantically distinct.
>
It was a great example! I just didn't make myself clear. ;) I meant
it as a graphic design problem, not as a practical problem to be solved.
>
> I'm still waiting for an actual (or correctly contrived) example.
>
Ok, you asked for it. Here's an example taken from my own little
speculative semantic encoding design for Arabic. Soon to be inflicted
on an innocent world.
The letterform waw U+0648 has at least four distinct functions in
written Arabic.
1. waw-rad. latin1 translit: W; phono: consonant /w/; semantics:
radical; e.g. Wjd وجد /wajada/; shows up in the dictionary under the
letter waw.
2. waw-nonrad. latin-1 translit: w; phono: consonant /w/; semantics:
non-radical; e.g. bwâdr بوادر /bawâdir/; shows up under b-d-r, the waw
is ignored for (first-level) lexical lookup.
3. sister of damma. latin-1 translit: û; phono: short vowel /u/;
semantics: non-lexical (it can change meanings within a lexical
category, though, e.g. from active to passive voice, etc); e.g. mktûb,
مكتوب /maktoob/; like damma, does not affect lexical ordering (except
within subentries under the root k-t-b); mnemonic: called sister of
damma because it always comes after damma (which may not be written
explicitly) and denotes a lengthening of the vowel /u/.
4. lazy waw. latin-1: o; phono: null; semantics: null; e.g. bo's
بؤس/bu's/ where ' is hamza; purely graphotactic; mnemonic: too lazy to
bear the burden of phonological or lexical meaning; too lazy to grow the
tail that would make it look like a real waw.
Ok, so now we have four different encoding elements. BTW, they don't
have to map to single codepoints. My scheme maps them to latin-1, for
the transliteration. They could be mapped to PUA points, or to XML
elements. In any case, they all have the same typographic denotation,
namely waw U+0648. But you probably would have a hard time writing
software that could automatically check spelling/encoding. So you need
a font with four almost but not quite identical waw glyphs. I think.
For example, lazy waw might use a small subfixed ring or null sign.
-gregg
This archive was generated by hypermail 2.1.5 : Mon Jul 11 2005 - 20:03:22 CDT