Re: U+0140

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Apr 17 2004 - 16:57:29 EDT

Next message: Peter Kirk: "Re: U+0140"

Previous message: Michael Everson: "Re: U+0140"
In reply to: John Hudson: "Re: U+0140"
Next in thread: Peter Kirk: "Re: U+0140"
Reply: Peter Kirk: "Re: U+0140"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

----- Original Message -----
From: "John Hudson" <tiro@tiro.com>
To: <unicode@unicode.org>
Sent: Saturday, April 17, 2004 6:03 PM
Subject: Re: U+0140

> Michael Everson wrote:
>
> > I have had suboptimal connectivity over the last while, and so have
> > missed some of this discussion. As a type designer I personally consider
> > the middle dot to be ordinary punctuation that should harmonize with
> > other punctuation marks. My solution to this is to treat it as the top
> > dot of a colon. So for me, MIDDLE DOT is to COLON as MODIFIER LETTER
> > HALF TRIANGULAR COLON is to MODIFIER LETTER TRIANGULAR COLON.
>
> This would make the mid-dot too high. The top dot of the colon usually sits
toward the top
> of the x-height; the *mid*-dot should sit lower, optically midway up the
x-height (which
> means slightly higher than the actual halfway mark). The top dot of a colon is
typically
> closer to the height of the Greek ano teleia, which aligns with the x-height
(and which
> should align with the cap height in all-cap settings, and with the small-cap
height in
> smallcap settings).

So we can see three different vertical positions for this middle-dot, and two
are encoded:

(1) centered at the middle of the x-height and baseline: this is the mathemical
middle-dot symbol, because most mathematical variables are lowercase letters,
making this position appropriate to note a multiplication. There's some large
horizontal gap between the two variables or number, and the horizontal position
is centered between the right edge of the previous character and the left edge
of the next character. This is basically the U+00B7 character which can also be
used as a punctuation mark, notably in dictionnary entries. Its weight should be
the same as the regular dot on the baseline for sentence periods. Note that
Unicode also defines a superfluous mathematical middle-dot symbol (I wonder if
this is caused by the fact that mathematical formulas often happen to use Greek
letters; this symbol at U+22C5 however is thicker, but still thiner than the
bullet operator U+2219, itself thiner than the bullet punctuation U+2219 which
sits on the baseline...)

(2) centered exactly at the x-height: this is the normal position for the
Catalan symbol and for the Greek Ano Teleia. The horizontal gap is minimal, just
enough to make the dot easily distinct when reading, from the two surrounding
character. So the horizontal spacing is smaller than with the middle dot in (1).
One bad thing is that Greek Ano Teleia was unified with the middle dot. If it
had not been so, the Catalan middle dot could have been unified with the Greek
Ano Teleia. It's significant that fonts actually do not respect the unification
of Greek Ano Teleia (2) and the middle-dot symbol or punctuation (1): it
demonstrates that these two should not have been unified with a canonical
equivalence...

(3) the upper dot of the colon or semi-colon is in fact a better position for
the Catalan middle-dot; we can see them as a middle-dot diacritic centered above
another character (a period or comma), but below the upper dot used on lowercase
letters or uppercase letters. For the Catalan middle-dot, the base character
should be the thinest space (sixth of cadratin) whose invisible height would be
the middle of the x-height, under which other baseline punctuations are drawn
(period, comma, connecting underscore. Michael can be right by saying that this
position should match with the vertical position of the hyphen, where in that
case the hyphenation point is probably the best character to use for rendering
the Catalan middle-dot: this dot or hyphen is not centered at the x-height but
just just below it so that the dot fits fully under that x-height with a tiny
vertical gap under it, approximately the weight of the dot or hyphen. A more
exact definition would be computed by using exactly the middle of the M-height.

Characters (2) and (3) are very near from each other, as they are both modifiers
for surrounding letters, and not a symbol or punctuation themselves.

But currently Unicode has unified the first 2 cases, by the canonical
equivalence for Ano Teleia and the middle-dot symbol/punctuation, which is
probably wrong, even if there's a legacy use of U+00B7 on keyboards that
generate ISO 8859 Greek text. The unification in fact comes from the mapping of
the ISO 8859 repertoire to Unicode, at the time when the hyphenation point did
not exist, or possible even before with some legacy mappings between unrelated
ISO 8859 repertoires (notably between Basic-Latin/Greek and Basic-Latin/Latin1).

Who's to blame there? Only software designers that have not offered better
keyboards to enter a regular Ano Teleia on Greek keyboards, or accepted
incorrectly to use the approximation between the middle-dot punctuation and the
Greek Ano Teleia. May be the votes from Greek typographers were not heard at the
ISO or UTC decision commitees when such unification was incorrectly decided...

What this suggests is that a note should be added as an exception to the
unification rule for renderers. In that case a renderer should be officially
allowed to render Ano Telaia differently from the middle-dot symbol/punctuation,
by ignoring their canonical equivalence. And by allowing text processes to
ignore this equivalence when they perform normalization of text, without being
considered as non-conforming: the mapping of Ano Teleia to the middle-dot could
become optional, and used only by applications that require security. This will
also suggest that normalization of texts should not be a default text handling
option for all applications (it is already NOT required for example for XML
processing, as a XML processor should not alter the normalization form of a
String until it really cannot do without it for text transformations such as
foldings.)

This also means that a collation algorithm could make a level-3 distinction
between Ano Teleia and the middle-dot (this could be introduced in the DUCET),
so that applications that perform case-insensitive compares (at level 2 only)
can ignore this difference as well as it can compare string by ignoring the
diacritics at level 3. However, this would require an update to the standard
collation algorithm, which assumes that strings are fully normalized before
proceeding.

Wow! so many dots with distinct properties and rendering... And still no good
definition of them to make a clear choice or distinction that will work in all
sorts of apps...

Next message: Peter Kirk: "Re: U+0140"
Previous message: Michael Everson: "Re: U+0140"
In reply to: John Hudson: "Re: U+0140"
Next in thread: Peter Kirk: "Re: U+0140"
Reply: Peter Kirk: "Re: U+0140"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Apr 17 2004 - 17:30:42 EDT