Re: U+0140

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Apr 17 2004 - 16:57:29 EDT

  • Next message: Peter Kirk: "Re: U+0140"

    ----- Original Message -----
    From: "John Hudson" <tiro@tiro.com>
    To: <unicode@unicode.org>
    Sent: Saturday, April 17, 2004 6:03 PM
    Subject: Re: U+0140

    > Michael Everson wrote:
    >
    > > I have had suboptimal connectivity over the last while, and so have
    > > missed some of this discussion. As a type designer I personally consider
    > > the middle dot to be ordinary punctuation that should harmonize with
    > > other punctuation marks. My solution to this is to treat it as the top
    > > dot of a colon. So for me, MIDDLE DOT is to COLON as MODIFIER LETTER
    > > HALF TRIANGULAR COLON is to MODIFIER LETTER TRIANGULAR COLON.
    >
    > This would make the mid-dot too high. The top dot of the colon usually sits
    toward the top
    > of the x-height; the *mid*-dot should sit lower, optically midway up the
    x-height (which
    > means slightly higher than the actual halfway mark). The top dot of a colon is
    typically
    > closer to the height of the Greek ano teleia, which aligns with the x-height
    (and which
    > should align with the cap height in all-cap settings, and with the small-cap
    height in
    > smallcap settings).

    So we can see three different vertical positions for this middle-dot, and two
    are encoded:

    (1) centered at the middle of the x-height and baseline: this is the mathemical
    middle-dot symbol, because most mathematical variables are lowercase letters,
    making this position appropriate to note a multiplication. There's some large
    horizontal gap between the two variables or number, and the horizontal position
    is centered between the right edge of the previous character and the left edge
    of the next character. This is basically the U+00B7 character which can also be
    used as a punctuation mark, notably in dictionnary entries. Its weight should be
    the same as the regular dot on the baseline for sentence periods. Note that
    Unicode also defines a superfluous mathematical middle-dot symbol (I wonder if
    this is caused by the fact that mathematical formulas often happen to use Greek
    letters; this symbol at U+22C5 however is thicker, but still thiner than the
    bullet operator U+2219, itself thiner than the bullet punctuation U+2219 which
    sits on the baseline...)

    (2) centered exactly at the x-height: this is the normal position for the
    Catalan symbol and for the Greek Ano Teleia. The horizontal gap is minimal, just
    enough to make the dot easily distinct when reading, from the two surrounding
    character. So the horizontal spacing is smaller than with the middle dot in (1).
    One bad thing is that Greek Ano Teleia was unified with the middle dot. If it
    had not been so, the Catalan middle dot could have been unified with the Greek
    Ano Teleia. It's significant that fonts actually do not respect the unification
    of Greek Ano Teleia (2) and the middle-dot symbol or punctuation (1): it
    demonstrates that these two should not have been unified with a canonical
    equivalence...

    (3) the upper dot of the colon or semi-colon is in fact a better position for
    the Catalan middle-dot; we can see them as a middle-dot diacritic centered above
    another character (a period or comma), but below the upper dot used on lowercase
    letters or uppercase letters. For the Catalan middle-dot, the base character
    should be the thinest space (sixth of cadratin) whose invisible height would be
    the middle of the x-height, under which other baseline punctuations are drawn
    (period, comma, connecting underscore. Michael can be right by saying that this
    position should match with the vertical position of the hyphen, where in that
    case the hyphenation point is probably the best character to use for rendering
    the Catalan middle-dot: this dot or hyphen is not centered at the x-height but
    just just below it so that the dot fits fully under that x-height with a tiny
    vertical gap under it, approximately the weight of the dot or hyphen. A more
    exact definition would be computed by using exactly the middle of the M-height.

    Characters (2) and (3) are very near from each other, as they are both modifiers
    for surrounding letters, and not a symbol or punctuation themselves.

    But currently Unicode has unified the first 2 cases, by the canonical
    equivalence for Ano Teleia and the middle-dot symbol/punctuation, which is
    probably wrong, even if there's a legacy use of U+00B7 on keyboards that
    generate ISO 8859 Greek text. The unification in fact comes from the mapping of
    the ISO 8859 repertoire to Unicode, at the time when the hyphenation point did
    not exist, or possible even before with some legacy mappings between unrelated
    ISO 8859 repertoires (notably between Basic-Latin/Greek and Basic-Latin/Latin1).

    Who's to blame there? Only software designers that have not offered better
    keyboards to enter a regular Ano Teleia on Greek keyboards, or accepted
    incorrectly to use the approximation between the middle-dot punctuation and the
    Greek Ano Teleia. May be the votes from Greek typographers were not heard at the
    ISO or UTC decision commitees when such unification was incorrectly decided...

    What this suggests is that a note should be added as an exception to the
    unification rule for renderers. In that case a renderer should be officially
    allowed to render Ano Telaia differently from the middle-dot symbol/punctuation,
    by ignoring their canonical equivalence. And by allowing text processes to
    ignore this equivalence when they perform normalization of text, without being
    considered as non-conforming: the mapping of Ano Teleia to the middle-dot could
    become optional, and used only by applications that require security. This will
    also suggest that normalization of texts should not be a default text handling
    option for all applications (it is already NOT required for example for XML
    processing, as a XML processor should not alter the normalization form of a
    String until it really cannot do without it for text transformations such as
    foldings.)

    This also means that a collation algorithm could make a level-3 distinction
    between Ano Teleia and the middle-dot (this could be introduced in the DUCET),
    so that applications that perform case-insensitive compares (at level 2 only)
    can ignore this difference as well as it can compare string by ignoring the
    diacritics at level 3. However, this would require an update to the standard
    collation algorithm, which assumes that strings are fully normalized before
    proceeding.

    Wow! so many dots with distinct properties and rendering... And still no good
    definition of them to make a clear choice or distinction that will work in all
    sorts of apps...



    This archive was generated by hypermail 2.1.5 : Sat Apr 17 2004 - 17:30:42 EDT