From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Mar 30 2007 - 15:52:55 CST
Asmus wrote:
> Just because printers in the past grabbed whatever combination worked,
> is not a good guidance as to the suitability of using combinations.
Of course. I was using an analysis of the typography to help
determine how the forms in question were made, and in turn using
that as another clue to what the concept of the mark was (i.e. a
modification of a macron), to add to the information provided
by the paradigmatic pattern of its use and the explicit annotation
of the intent of the mark.
> If
> the underlying intent is to create a new 'entity' then the requirement
> is to encode that entity.
I gotta take issue with that. We can impute that the underlying
intent was to creat a new entity (the modified macron to indicate
a modified pronunciation of an English "long" vowel). But that
doesn't lead directly to a requirement to encode that entity.
One has to first pass the hurdle of determining whether the
entity *deserves* encoding via encoded characters, or is
better treated via markup of some sort, or represents a nonce
usage that doesn't rise to the level of requiring international
standardization.
Even assuming consensus is reached that the entity *does* require
a character representation (as this one probably does) and is
important enough to bother with (as this one might well be),
you then simply have a requirement for *representation* in
terms of characters -- which doesn't force the conclusion that
the entity itself must be encoded as a character.
Obvious exceptions: the discovery of deliberate ligatures in
texts. Those constitute textual entities and you may well
need to represent them for digital text, but that doesn't
require you to go directly to encoding the ligature as a
character.
> Some entity decompose, but our rules for
> decompositions is not merely that a similar visual effect *can* be
> produced, but that the elements of the decomposition, when combined,
> correctly form the new entity.
I agree with that. I'm not suggesting that we start treating
the Unicode characters as a visual lego set.
>
> I will not quibble with Ken's analysis that the new entity is not an Up
> Tack. An I will not quibble with the fact that the entity is a
> modification of a MACRON. I'll take these as read, for the sake of the
> following argument.
O.k.
> I remain very much unconvinced that the decoration
> on that macron is correctly represented by a combining vertical line
> above. I can see no convincing evidence in the discussion.
Nor am I. What I am convinced of, instead, is that the modification
of the macron is a vertical tick diacritic on the macron. This
could be proven, I suspect, if anybody could turn up the
presumably manuscript material from which the books in question
were typeset. That is unlikely a century later, however, for
somewhat obscure material like this.
But in any case, the question devolves to determining whether it
makes sense to posit that a graphological diacritic with a
roughly apostrophic shape, applied to another diacritic, deserves
treatment in the Unicode character encoding as a *character*
itself, or whether, like descender diacritics on Cyrillic letters,
the diacritic nature of the mark doesn't lend itself to
separate character encoding -- leading to the conclusion that
the diacritic modified base should simply be encoded separately
as a unit.
>
> Because of that, I see as viable alternatives either, the encoding of a
> character to correspond with the entity as a whole,
Which is what I would be inclined to in this particular instance,
as the easier option to implement and explain.
> or the encoding of
> the correct modification for the macron.
The correct modification of the macron is a vertical tick added
to the top of it.
You don't get that for free, because there already is a combining
diacritic vertical tick above, namely, U+030D.
You either claim:
A. That isn't it (your straw position here), so a separate
mark needs to be encoded.
or
B. That is it.
In case A, you end up introducing another problematical confusable
issue. By claiming functional distinction for two marks that would
be visually virtually indistinguishable, you end up with the same
kinds of confusion that occurs anytime visually indistinguishable
characters are claimed to be distinct: ordinary users will have
trouble determining which to use when, and you will end up with
data corruption as a result.
In case B, you end up with the possibility (or likelihood) that
presentation of marks in combination won't result in the exact
shapes expected, and the need to specify rules for glyphic
combination in particular contexts.
Case A is more difficult to justify paradigmatically. You end
up with a mark that looks like X but only occurs in context Y,
and another mark that looks like X but only occurs in context Z,
when contexts Y and Z don't overlap. In particular, you have
a vertical tick that is applied to base vowels (U+030D) and
another vertical tick that is applied to macrons (U+XXXX).
Case B is more difficult to justify practically, because it
potentially requires more font smarts for contextual shaping
of a single character, rather than simply designing the
new character (U+XXXX) to fit correctly on any given font's
macron (U+0304), without requiring any other contextual
shaping beyond that already perhaps required for the macron
itself.
> Overall, placement of multiple
> combining marks strikes me as a fragile (except in the context of
> strong, well supported language-based requirements such as for
> Vietnamese, Polytonic Greek and ignoring scripts with so called 'complex
> layout' such as Arabic and similar cases for the moment). Because of
> this, I think that the best user experience might be generated by
> encoding the entity as such.
I agree with that assessment in the end.
But I think the stronger precedent to look towards here is
the handling of letter diacritics when the diacritic form
itself is a modification of the letter itself (descenders,
bars through, hooks, and so on), rather than being a
free-floating diacritic above or below the entire letter
form. The UTC precedent in such instances is to acknowledge
that a diacritic modification is present, but to encode the
entire modified letter as a unit.
And the UTC has precedents in place for handling diacritic
modification of marks themselves in an analogous way.
The recently accepted Lithuanian tone marks, U+1DCB
COMBINING BREVE-MACRON and U+1DCC COMBINING MACRON-BREVE
are themselves obviously simply graphological combinations
of two existing combining marks that are already encoded
as characters. Yet they were separately encoded as
unitary characters. Now in those cases, the combination
of the macron and the breve were graphically side-by-side
linking, for which encoding simply as sequences of the
existing marks wouldn't make much sense. But in principle,
other than the placement of the diacritic modification
above, rather than side-to-side, the MACRON-TICK is
not much different in the problem it presents for
encoding.
--Ken
This archive was generated by hypermail 2.1.5 : Fri Mar 30 2007 - 15:57:18 CST