From: Jim Allan (jallan@smrtytrek.com)
Date: Tue Aug 05 2003 - 12:42:21 EDT
Peter Kirk posted:
> If I want to do this, should I explicitly encode a dotted circle, or
> should I encode nothing and expect the font to generate the dotted
> circle, as it often does?
I think that practise of a font or application automaticaly inserting a
dotted circle under an orphaned combining character is dubious compliant
with Unicode specifications.
In http://www.unicode.org/book/preview/ch03.pdf the space characters in
general are given class Zs:
<< Zs, Zl, and Zp are considered format characters, but their membership
in the Z (separator) class takes precedence over their membership in the
Cf class, because the General Category assigns only a single value to
each character. >>
So the various space characters (class Zs) are also classified as format
characters.
From http://www.unicode.org/book/ch04.pdf:
<< _D13 Base character:_ a character that does not graphically combine
with preceding character, and that is neither control nor a format
character. >>
Accordingly, by definition, spaces are not base characters.
Also from http://www.unicode.org/book/ch04.pdf:
<< _D14 Combining character:_ a character that graphically combines
with a preceding base character. The combining character is said to
_apply_ to the base character. >>
So we know what happens with a combining character follows a base
character. It combines with it.
What happens when a combining character follows a character that is not
a base character or appears initially? The same source explains:
<< o Even though a combining character is intended to be presented in
graphical combination with a base character, circumstances may arise
where either (1) no base character precedes the combining character or
(2) a process is unable to perform graphical combination. In both cases
it may present a combining character without graphical combination; that
is, it may present it as if it were a base character.
o The representative images of combining characters are depicted with a
dotted circle in the code charts; when presented in a graphical
combination with a preceding base character, that base character is
intended to appear in the position occupied by the dotted circle. >>
So a display device *may* present an oprhaned combining character as
suggested.
But the word "may" is weak. Or there other things it may do that would
still be compliant with Unicode? May it ignore the character
altogether? May it display the character as U+FFFD REPLACEMENT
CHARACTER? May it display the over some other character altogether,
perhaps even U+20CC DOTTED CIRCLE? This is the only way I can to justify
the display of U+20CC DOTTED CIRCLE in such cases by the Unicode
specifications.
But is then is there any display that is not acceptable according to
these specifications?
Note that even if an application takes the suggestion made here, the
combination of the non-base character SPACE followed by a combining
character would be rendered as the non-base character SPACE followed by
the combining character rendered as a base character. They would not be
combined.
From the same source:
<< _D17a Defective combining character sequence:- a combining character
sequence that does not start with a base character.
o Defective combining character sequences occur when a sequence of
combining charactes appears at the start of a string or follows a
control or format character. Such sequences are defective from the point
of handling of combining marks, but are not _ill-formed_. (See D30.)
Accordingly any space character followed by a combining character is a
defective combining character sequence.
From http://unicode.org/book/ch07.pdf
<< *Marks as Spacing Characters.* By convention, combining marks may be
exhibited in (apparent) isolation by applying them to U+0020 SPACE or to
U+00A0 NO-BREAK SPACE. This approach might be taken, for example, when
referring to the diacritical mark itself as a mark, rather than by using
it in its normal way in text. The use of U+0020 SPACE versus U+00A0
NO-BREAK SPACE affects line-break behavior.>>
The words "by convention" are odd. It perhaps acknowledges that this
shouldn't work according to general other Unicode rules and definitions.
This passage, however, does not even hint that "by convention" a dotted
circle should appear under the diacritic.
Presumably if someone wanted a combining character applied to a dotted
circle that person would code U+20CC followed by the combining character.
One could fix this messiness by changing the definition of base
character to specifically include U+0020 SPACE and U+00A0 NO-BREAK
SPACE. That in effect is exactly what the above passage does. So it in a
structured manner by making it part of the rule instead burying it in
the text an odd exception to the rule.
But it does seems philosphically odd that U+0020 and U+00A0 alone of the
category Zs characters should be especially singled out.
It would be more intuitive if all Zs characters could be included in the
category of base characters. Is there any philosphical reason why
combining characters should not be applied to the other spaces?
The combining character might of course increase the width of the space:
Again from http://www.unicode.org/book/ch04.pdf:
<< o Such characters may be large enough to effect the placement of
their base character relative to preceding and succeeding base
characters. For example, a circumflex applied to an "i" may effect
spacing ("î"), as might the character U+20DD COMBINING ENCLOSED CIRCLE. >>
In any case, I see nothing in the Unicode specifications that suggests
replacing either U+0020 or U+00A0 by U+20CC when followed by a combining
character or placing applying the combining character to any inserted
U+20CC when it is part of a defective combining character sequence.
Jim Allan
_D15 Nonspacing mark: a combining character whose positioning in
presentation is dependent on the base character. It generally does not
consume space along the visual baseline and and of itself.
o Such characters may be large enough to effect the placement of their
base character relative to preceding and succeeding base characters. For
example, a circumflex applied to an "i" may effect spacing ("î"), as
might the character U+20DD COMBINING ENCLOSED CIRCLE.
This archive was generated by hypermail 2.1.5 : Tue Aug 05 2003 - 13:32:29 EDT