Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

From: Jim Allan (jallan@smrtytrek.com)
Date: Tue Aug 05 2003 - 12:42:21 EDT

  • Next message: Magda Danish \(Unicode\): "FW:transform a (UNICODE) accented character to its equivalent (UNICODE) non-accented character"

    Peter Kirk posted:

    > If I want to do this, should I explicitly encode a dotted circle, or
    > should I encode nothing and expect the font to generate the dotted
    > circle, as it often does?

    I think that practise of a font or application automaticaly inserting a
    dotted circle under an orphaned combining character is dubious compliant
    with Unicode specifications.

    In http://www.unicode.org/book/preview/ch03.pdf the space characters in
    general are given class Zs:

    << Zs, Zl, and Zp are considered format characters, but their membership
    in the Z (separator) class takes precedence over their membership in the
    Cf class, because the General Category assigns only a single value to
    each character. >>

    So the various space characters (class Zs) are also classified as format
    characters.

     From http://www.unicode.org/book/ch04.pdf:

    << _D13 Base character:_ a character that does not graphically combine
    with preceding character, and that is neither control nor a format
    character. >>

    Accordingly, by definition, spaces are not base characters.

    Also from http://www.unicode.org/book/ch04.pdf:

    << _D14 Combining character:_ a character that graphically combines
    with a preceding base character. The combining character is said to
    _apply_ to the base character. >>

    So we know what happens with a combining character follows a base
    character. It combines with it.

    What happens when a combining character follows a character that is not
    a base character or appears initially? The same source explains:

    << o Even though a combining character is intended to be presented in
    graphical combination with a base character, circumstances may arise
    where either (1) no base character precedes the combining character or
    (2) a process is unable to perform graphical combination. In both cases
    it may present a combining character without graphical combination; that
    is, it may present it as if it were a base character.

    o The representative images of combining characters are depicted with a
    dotted circle in the code charts; when presented in a graphical
    combination with a preceding base character, that base character is
    intended to appear in the position occupied by the dotted circle. >>

    So a display device *may* present an oprhaned combining character as
    suggested.

    But the word "may" is weak. Or there other things it may do that would
    still be compliant with Unicode? May it ignore the character
    altogether? May it display the character as U+FFFD REPLACEMENT
    CHARACTER? May it display the over some other character altogether,
    perhaps even U+20CC DOTTED CIRCLE? This is the only way I can to justify
    the display of U+20CC DOTTED CIRCLE in such cases by the Unicode
    specifications.

    But is then is there any display that is not acceptable according to
    these specifications?

    Note that even if an application takes the suggestion made here, the
    combination of the non-base character SPACE followed by a combining
    character would be rendered as the non-base character SPACE followed by
    the combining character rendered as a base character. They would not be
    combined.

     From the same source:

    << _D17a Defective combining character sequence:- a combining character
    sequence that does not start with a base character.

    o Defective combining character sequences occur when a sequence of
    combining charactes appears at the start of a string or follows a
    control or format character. Such sequences are defective from the point
    of handling of combining marks, but are not _ill-formed_. (See D30.)

    Accordingly any space character followed by a combining character is a
    defective combining character sequence.

     From http://unicode.org/book/ch07.pdf

    << *Marks as Spacing Characters.* By convention, combining marks may be
    exhibited in (apparent) isolation by applying them to U+0020 SPACE or to
    U+00A0 NO-BREAK SPACE. This approach might be taken, for example, when
    referring to the diacritical mark itself as a mark, rather than by using
    it in its normal way in text. The use of U+0020 SPACE versus U+00A0
    NO-BREAK SPACE affects line-break behavior.>>

    The words "by convention" are odd. It perhaps acknowledges that this
    shouldn't work according to general other Unicode rules and definitions.

    This passage, however, does not even hint that "by convention" a dotted
    circle should appear under the diacritic.

    Presumably if someone wanted a combining character applied to a dotted
    circle that person would code U+20CC followed by the combining character.

    One could fix this messiness by changing the definition of base
    character to specifically include U+0020 SPACE and U+00A0 NO-BREAK
    SPACE. That in effect is exactly what the above passage does. So it in a
      structured manner by making it part of the rule instead burying it in
    the text an odd exception to the rule.

    But it does seems philosphically odd that U+0020 and U+00A0 alone of the
    category Zs characters should be especially singled out.

    It would be more intuitive if all Zs characters could be included in the
    category of base characters. Is there any philosphical reason why
    combining characters should not be applied to the other spaces?

    The combining character might of course increase the width of the space:

    Again from http://www.unicode.org/book/ch04.pdf:

    << o Such characters may be large enough to effect the placement of
    their base character relative to preceding and succeeding base
    characters. For example, a circumflex applied to an "i" may effect
    spacing ("î"), as might the character U+20DD COMBINING ENCLOSED CIRCLE. >>

    In any case, I see nothing in the Unicode specifications that suggests
    replacing either U+0020 or U+00A0 by U+20CC when followed by a combining
    character or placing applying the combining character to any inserted
    U+20CC when it is part of a defective combining character sequence.

    Jim Allan

    _D15 Nonspacing mark: a combining character whose positioning in
    presentation is dependent on the base character. It generally does not
    consume space along the visual baseline and and of itself.

    o Such characters may be large enough to effect the placement of their
    base character relative to preceding and succeeding base characters. For
    example, a circumflex applied to an "i" may effect spacing ("î"), as
    might the character U+20DD COMBINING ENCLOSED CIRCLE.



    This archive was generated by hypermail 2.1.5 : Tue Aug 05 2003 - 13:32:29 EDT