RE: Latin ligatures and Unicode

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Wed Dec 22 1999 - 14:51:08 EST

Next message: John Cowan: "Re: Latin ligatures and Unicode"
Previous message: Doug Ewell: "Re: Where to Add new Currency Sign?"
Maybe in reply to: Eberhard Pehlemann: "Latin ligatures and Unicode"
Next in thread: John Cowan: "Re: Latin ligatures and Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> -----Original Message-----
> From: John Cowan [mailto:jcowan@reutershealth.com]
> Sent: Wednesday, December 22, 1999 11:44 AM
>
> "Reynolds, Gregg" wrote:
>
> > But with ZWNBSP, we have no semantics with respect to
> joining behavior, or
> > if we do it's well-hidden.
>
> ZWNBSP has no effect on joining behavior, correct. You were saying

Not "no effect", but "no semantics". "No semantics" means (to me) that in
practice implementers get to do as they please. If the standard says, as
Mark just noted in a message, that they are to be ignored for the purposes
of join analysis, then I stand corrected; but I haven't been able to find
anything (admittedly I'm looking at v. 2) that says this. Wouldn't be
surprised to find its in there somewhere, but I would like to know where.

> that kitAbuhA should be expressed as kitAbu+hA where + is some sort
> of word boundary that doesn't involve whitespace.

More precisely, that doesn't alter the normal joining behavior, i.e. is
purely semantic. (Sometimes the characters involved don't join, so
typographic whitespace would be involved.)

> ZWNBSP is precisely
> a word boundary that doesn't generate any whitespacem neither
> horizontal
> nor vertical. The b and h would remain joined, but word-boundary
> analysis would show two words here.

Well, sort of, but "word" isn't sufficient. You'd still want to be able to
distinguish between distinct lexemes packaged as a single lexigraphic word
and lexigraphic words - ordinarily whitespace delimited, but then again,
because Arabic encodes word boundaries in the letterforms themselves, one
could also remove SP word boundaries and use ZWSP (I think ;). In terms of
a user interface:

find-lexigraph-only("kitAbu+hum")

==> does not match e.g. "fa-kitAbu+hum"

find-lexeme("kAtibu")

                ==> matches "fa-kAtibu+hum", "kAtibu+hA", "al-kAtibu", etc
                ==> does not match "kAtibuwna"??? (no LEXDELIM after 'u')
                ==> does not match "makAtibu+hum" ???

The last one - makAtibu+hum - is an interesting case; "kAtibu" may or may
not be construed as a distinct lexeme in it, depending on how you look at
it, but the "ma-" prefix is a morpheme, definitely not an independently
meaningful lexeme. Still, decomposing such a form into its consituent root
(k,t,b) and theme (ma-prefix, internal shape) is utterly elementary for
anybody with a little Arabic. That's how it would be entered and looked up
in dictionaries, for example.

I'm not sure the ZW... stuff would provide sufficient nuance for this kind
of stuff, since you could use it both within and between "words".

> > But more to the point, I would argue that "use" and
> "interpretation" are and
> > should be distinct. An encoding should provide a
> semantics, not usage
> > guidelines.
>
> Unicode has historically provided usage guidance, not semantics (still
> less formal semantics).

Seems a shame; a little formal semantics would go a long way.

>
> > Example: li-al-HayAt, "to Al-HayAt", as in "Write to
> Al-HayAt for more
> > info" (it's a newspaper). To indicate that al-HayAt is a
> proper name, you
> > enclose it in guillemets or some other typographic quoting
> figures; since
> > "li-al" is ligated, this means you have to break the join.
> "li-" then takes
> > initial form, as does the following alif of "-al". "li"
> and "al" also
> > happen to be distinct lexemes, so we want them both
> demarcated as such. How
> > would you encode that, both with and without the guillemets?
>
> So you want a ZWNBSP between "al" and "HayAt" in any case, and between
> "li" and "al" if there is no punctuation. Inserting the guillemets
> should provoke the correct shaping results. No need for ZWJ
> or ZWNJ here.

But the "li-" in "li-<<-al-..." must be lam-initial, so I think ZWJ would be
the thing for it. Otherwise wouldn't the guillemets send it into isolate
form?

> > I suspect I could come up with examples where ZWNBSP could
> divide a single
> > lexigraphic word into two parts, both of which could be
> interpreted as
> > distinct lexigraphic words, in which case an implementation
> could either
> > join or not join and still get readable Arabic.
>
> To get non-joining behavior, you need both ZWNJ (for
> non-joining) and ZWNBSP
> (for word separation).
>
> > > > In this example, ZWJ falls between two characters of the
> > > joining class; it
> > > > has no effect on their form, and the ligation is formed.
> > >
> > > Then there is no point in it, at least not according to the
> > > standard definitions.
> >
> > See above; semantics doesn't (shouldn't) address issues of utility.
>
> It occurs to me at this point that we may be on different tracks.
> ZWJ and ZWNJ are about *shaping* stricto sensu: they have to do
> with whether initial, medial, final, or isolated forms are chosen.
> Arabic ligatures as such are *not* affected.

My mistake; I tend to think of "ligature" as meaning "tie stroke", and
"collocation" as what is usually meant by "ligature" in the context of
Unicode. I'm referring to the former, not the latter. I think I need to
get some examples typeset.

> > > > While we're at it, we also need a way to stretch the
> space between two
> > > > adjacent Arabic letterforms that don't join, but
> without introducing word
> > > > separation. Tatweel would work just fine if marking
> semantics were made
> > > > dependent on syntactic context - i.e. it should not be
> considered
> > > > "join-causing"; it's semantics should simply be
> "stretch whatever's there,
> > > > be it whitespace or a ligating stroke."
> > >
> > > That is the function of NBSP.
> >
> > Same problem with the relation of joining, spacing, and
> word boundaries.
> > Might be just an issue of making what's implicit explicit:
> place NBSP (and
> > ZWNBSP) in the dual-join category. Then it has the same
> semantics as ZWJ,
> > with NB added.
>
> Every non-Arabic character (except ZWJ itself) is
> non-joining. The whole
> notion of joining *across* visible whitespace makes little
> sense to me.

Remember whitespace is (typographically) negative space. The tie stoke in a
string of Arabic letterforms is best thought of (IMHO) as a bridge across
the void. How's that for dramatic? So a tatweel (which, btw, does not
refer to a unit stroke or character in Arabic, but to a general verbal noun
"lengthening") would be more accurately and apocalyptically construed as a
widening of the void, indirectly requiring the heroic, Nietzschean extension
of the connecting uberstroke. Oh, the humanity!

> The function of NBSP is to create visible whitespace without
> a word boundary.
> If you want a connecting line, use TATWEEL.
>
> > What does it mean to put a space of any kind between two
> ligated letterforms?
>
> SP is also a non-joining character. I thought you were
> asking about isolated
> diacritics, which are represented by SP+diacritic.
>

In some cases one may want to place diacritics over some whitespace or a
tatweel stroke, within a word. So you wouldn't want a SP there. The best
solution, I think, would be to define tatweel as, essentially, NBSP of class
dual join. Then it stretches things out, preserves joining behaviour, and
carries a diacritic if asked to.

Anyway thanks (you and Mark) for the feedback; this gives me some stuff to
ponder over the holidays. Obviously need to finish some samples too. (I'm
out for about three weeks starting tomorrow.)

Thanks and happy holidays,

-gregg

Next message: John Cowan: "Re: Latin ligatures and Unicode"
Previous message: Doug Ewell: "Re: Where to Add new Currency Sign?"
Maybe in reply to: Eberhard Pehlemann: "Latin ligatures and Unicode"
Next in thread: John Cowan: "Re: Latin ligatures and Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT