Re: combining characters using ZWJ

From: Eric Muller (emuller@adobe.com)
Date: Sat Jan 28 2006 - 13:41:50 CST

Next message: Sandeep Srivastava: "Re: combining characters using ZWJ"

Previous message: Mark Davis: "Re: combining characters using ZWJ"
In reply to: Sandeep Srivastava: "combining characters using ZWJ"
Next in thread: Sandeep Srivastava: "Re: combining characters using ZWJ"
Reply: Sandeep Srivastava: "Re: combining characters using ZWJ"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In the context of Unicode, it is important to distinguish ligatures
which have only a graphic motivation from ligatures which have a
"semantic impact".

The common "ff" ligature for example is all about solving a graphic
design problem, namely when the shape of a single "f" is such that
putting two in a row is ugly. In some font designs, two single "f" in a
row are not a problem at all, and such fonts does not need an "ff"
ligature at all.

The "œ" ligature, on the other hand, has a "semantic impact". In the
French orthography I learned at school, some words need to be spelled
with œ (cœur, bœuf) and other words with oe (coexister). Coeur, boeuf,
cœxister would all be considered mistakes (I don't know of a minimal
pair, i.e. of two words that differ exactly by œ vs. oe). Therefore,
pretty much all fonts needs to have an "œ" ligature, regardless of
whether "oe" is graphically problematic or not. [I qualified "French
orthography" by "[that] I learned at school" because orthographies do
change, either de jure or de facto, and we certainly see tremendous
changes with instant messaging and Internet games.]

This leads to the following rule of thumb in Unicode: ligatures of the
first kind are not inherent to the text being written, and therefore do
not need their own code points; ligatures of the second kind are
inherent and need their own code points. In fact, we do have U+0153 œ
LATIN SMALL LIGATURE OE, U+00E6 æ LATIN SMALL LETTER AE as regular
characters, without decompositions (canonical or other). U+FB00 ﬀ LATIN
SMALL LIGATURE FF is justified not by its "semantic impact" but by
compatibility with legacy character standards and it does have a
compatibility decomposition; for the purpose of this discussion, this
character and its friends can be ignored.

Back to your question, if you want æ for the second reason, then you
really want to use U+00E6 æ LATIN SMALL LETTER AE. If on the other hand
you want a ligature of a and e for graphic reasons (and in the
orthography you use, that does not interfer with an æ ligature of the
semantic kind), then you really want to use U+0061 a LATIN SMALL LETTER
A, U+0065 e LATIN SMALL LETTER E, and the best you can do is to
encourage the rendering system to use a ligature is to insert ZWJ
between "a" and "e"; and you can discourage the formation of a ligature
by inserting ZWNJ. However, that does not guarantee the result: a
rendering system is free to ignore your request (it's even free to
ignore it on even pages and satisfy it on odd pages - as far as Unicode
is concerned, of course).

Incidentally, a rendering system is the combination of a layout engine
and one or more fonts. Both participate in the result so it's often not
possible to say that a font will or will not produce outside the context
of a given layout engine, hence my previous message.

> So, if I understand you correctly, ligatures are full blown
> characters, and that they cannot be created using the individual
> characters they represent in any way.

It entirely depends on the kind of ligature we are talking about. You
statement is essentially true for the "semantic" ligatures, and the
opposite statement is essentially true for the "graphic" ligatures.

For completeness, I should add that there are edge cases where a
ligature which is normally graphic only may have a semantic impact. For
example, there is often an "fi" graphic ligature, because the top of "f"
often collides with the dot of the "i", and the typical solution
involves dropping the dot. But in orthographies which distinguish dotted
i from dotless i (e.g. Turkish), such a ligature is not acceptable and
font designers really need to find another way to solve the graphic
problem (may be put more space between f and dotted i, or find another
modification that dropping the dot).

And while we are there, the use of ZWJ and ZWNJ in the context of the
Latin script is different from their use in Arabic or the Brahmi-derived
scripts.

> I also found that every script has a different 'combining mark' to
> combine characters. For example, U+09CD is the combining mark used for
> the Bengali script, and U+094D is the combining mark used for the
> Hindi script. If that's the case, then what is the use of ZWJ?

First, you are right that U+094D ◌् DEVANAGARI SIGN VIRAMA and the other
virama characters are formally combining marks.

Second, the virama in the Indic scripts serves a very different purpose
than the joiners (ZWJ and ZWNJ) in Latin. A स्त (sta) conjunct is much
more like an "œ" ligature than it is like an "fi" ligature: "सत" (sata)
and "स्त" (sta) are simply not interchangeable, you need to use the
appropriate one.

For Latin, we have a small number of pairs that form semantic ligatures,
and it is therefore reasonable to encode a separate character for each
pair as needed.

Devanagari on the other hand has a large number of conjuncts (including
some formed of three or four characters), so it was deemed preferable to
have a constructive mechanism to represent conjuncts, namely to link the
letters entering in a conjunct by the VIRAMA coded character. That way,
there is no need to rework the standard every time somebody exhibits a
new, up-to-now not encoded conjunct. [This is a bit of an historical
revision: for one thing, Unicode followed the lead of ISCII; and I
strongly suspect that having a small character set was a constraint for
ISCII. But you get the point, I can pretty much guarantee that without
legacy, Unicode would have selected a constructive approach anyway.]

You could wonder what we would have done in Latin had the set of
semantic ligatures be large or not bounded. A very viable approach would
have been to not encode U+0153 œ LATIN SMALL LIGATURE OE and U+00E6 æ
LATIN SMALL LETTER AE and friends, to encode LATIN SIGN VIRAMA instead,
and to represent "œ" by <U+006F o LATIN SMALL LETTER O, LATIN SIGN
VIRAMA, U+0065 e LATIN SMALL LETTER E>.

As to whether we need a single VIRAMA character for all the scripts or
one per script, it's six one way and half a dozen the other (although I
am sure we will see answers from vehement proponents of each approach).

Finally, the joiners are used in Devanagari for a function that is
almost always similar to their use in Latin. It is to encourage the
rendering system to select one form or another for a conjunct, when
those forms are "semantically" equivalent (full conjunct vs. half-form +
full-form vs. full-form + halant + full-form),.

Eric.

Next message: Sandeep Srivastava: "Re: combining characters using ZWJ"
Previous message: Mark Davis: "Re: combining characters using ZWJ"
In reply to: Sandeep Srivastava: "combining characters using ZWJ"
Next in thread: Sandeep Srivastava: "Re: combining characters using ZWJ"
Reply: Sandeep Srivastava: "Re: combining characters using ZWJ"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 28 2006 - 13:43:25 CST