From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Aug 02 2004 - 11:14:40 CDT
On 02/08/2004 13:12, Antoine Leca wrote:
> ...
>
>However, if I can agree with you about the area being fuzzy when it comes to
>*ZWJ* and its numerous uses and some abuses (like Devanagari half-forms),
>the verdict is not anywhere as bad about ZWNJ.
>Behaviour of ZWNJ is consistent in about any place, and the correct
>explanation is the one that is, among others, in chapter 15, that is that
>ZWNJ restricts rendering to unconnected and unligatured forms (or prevent
>use of any connected form or ligature, if you prefer), where possible.
>
>
>
I agree that the situation with ZWJ is more complex than that with ZWNJ.
But there is still uncertainty concerning ZWNJ because of the
uncertainty about what is actually considered a "ligature", and so what
exactly may be broken by ZWNJ.
In discussions on my Holam proposal, John Hudson wrote:
> [Note that on Unicode lists I tend to use the term ligature in a
> purely technical sense: a single glyph representing two or more
> characters. This says nothing about the form of that glyph. Discussion
> of ligatures in complex scripts can become confusing unless this
> strictly technical definition is kept in mind. It helps to remember
> that when you are looking at rendered text, what *looks* like a
> ligature -- i.e. two or more conjoined forms -- may or may not in fact
> be a single glyph.]
He clarified later that it is irrelevant to him whether a glyph consists
of a single continuous block or is graphically equivalent to a base
character plus a diacritical mark; all that matters is how it is
implemented.
But other experts use a very different definition of "ligature" which is
apparently restricted to glyphs with a particular *form*, perhaps John
Hudson's "two or more conjoined forms". This definition apparently
excludes combinations of a base character with a diacritical mark, even
when these are represented as two Unicode characters (i.e. not
precomposed) but are implemented with a single glyph e.g. by
substitution of a presentation form. On this latter definition, the
glyph for the alphabetic presentation form U+FB4B HEBREW LETTER VAV WITH
HOLAM cannot be considered a ligature, even though it is used, and is
automatically substituted by rendering engines e.g. Uniscribe, only (in
all normalisation forms) to represent the combination of two characters
<VAV, HOLAM>.
The situation is even more confused in that some Unicode characters,
e.g. U+0152 LATIN CAPITAL LIGATURE OE, are called LIGATUREs in their
character names but are unambiguously single Unicode characters (e.g.
they have no decomposition even for compatibility). (These are in
addition to the characters named LIGATURE in the Alphabetic Presentation
Forms block, which mostly have compatibility decompositions.)
The Unicode definition in the TUS glossary
(http://www.unicode.org/versions/Unicode4.0.0/b1.pdf) seems ambiguous.
Here it is:
> Ligature. A glyph representing a combination of two or more
> characters. In the Latin script,
> there are only a few in modern use, such as the ligatures between “f ”
> and “i” (= fi) or “f”
> and “l” (= fl). Other scripts make use of many ligatures, depending on
> the font and style.
The first sentence would seem to confirm John Hudson's definition, for a
"glyph" is defined in terms of rendering engine implementation rather
than graphical identity or continuity. But the comment that there are
only a few ligatures in modern use in Latin script seems to restrict the
concept to certain graphical forms without making a proper definition.
So the uncertain point is, what exactly are the "ligatures" whose
formation ZWNJ should inhibit? Are they the technical ligatures as
understood by John Hudson, or are they the undefined formal ligatures or
conjoined forms?
Which brings me back to the specific debate over the Holam proposals: Is
it a proper use of ZWNJ to block the mapping of the character sequence
<VAV, HOLAM> on to the glyph for the alphabetic presentation form U+FB4B
HEBREW LETTER VAV WITH HOLAM, so that the HOLAM dot is positioned in its
regular top left position relative to the base character, rather than
the irregular (top centre or top right) place in the alphabetic
presentation form?
>
>
>>Another argument against our proposal is that by defining
>>ZWNJ as breaking a ligature I am specifying implementation.
>>
>>
>
>This is a dubious argument. Unicode specifies encodings. When two different
>"meanings" are identified, different encodings are requested, so it is a
>task for Unicode.
>
>OTOH, if there is no underlying difference and the matter is purely of
>presentation (like the aspect of a, like a reversed e or like a o with left
>stem), then Unicode is not to be involved.
>
>I know the border is fuzzy. ;-) or :-(.
>
>Here, the fact it ligates or no does mean something (and this is the hard
>part of the demonstration) is what should be examined. How it is implemented
>is largely irrelevant (in fact, it is relevant when the result is *not*
>implementable!)
>
>
There is a separate issue of whether it is proper to use ZWNJ or ZWJ for
a semantically significant distinction. It is arguable whether the Holam
distinction is actually semantic, although it does need to be made in
plain text for proper exact typography. But then there are other
distinctions made by ZWNJ e.g. in Persian which are certainly
semantically significant.
My proposal was criticised at one point for restricting how something
could be implemented. I had demonstrated that there was one feasible
implementation strategy, that it is *not* something *not* implementable.
Is it really necessary to demonstrate that there is more than one
feasible strategy so that implementers have a choice? In any case, the
restriction to one strategy was not imposed by the proposal or by TUS,
but by the rendering system (OpenType) and particular implementations of
it, which had the effect of restricting the font implementer's options.
>
>OTOH, regarding your problem, I should point out that the Bengali's
>precedent is anything but something that should be taken as example: it
>appears to me as an ad-hoc solution built in a hurry, that happened to fit
>well with certain technical implementations; it is a nightmare to handle for
>others; and now there is on the table a proposal, PR-37, which among other
>things will (try to) remove this hack and replace it with another, more
>orthogonal (using ZWJ).
>
>
>
Thanks for your advice about PR-37. I realised after including this
example in the draft Holam proposal that it is in fact controversial.
However, it seems that the controversy is over whether to use ZWJ or
ZWNJ; the principle seems to be accepted that one or other may be used,
and in this position between a base character and a combining mark. The
UTC obviously needs to decide this issue once and for all, and then
implementers will need to adjust their implementations to fit. Any
adjustments are likely to make things easier also for implementation of
my Holam proposal.
No one, as far as I know, has proposed a resolution of the Bengali
ligature issue by defining a new Unicode character. Why not? Presumably
because this would be a breach of the character/glyph model. Very
similar principles apply to the Holam case. Use of ZWNJ has been
proposed because it seems to fit Unicode definitions better. But I would
not object if the UTC preferred a representation with ZWJ for continued
compatibility with the Bengali case, especially if this solves actual
implementation difficulties. My objection to a new character solution is
basically that it breaks the character/glyph model by defining a new
character for what is no more than a glyph variant.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Mon Aug 02 2004 - 11:16:08 CDT