From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 30 2003 - 21:53:02 EDT
Philippe Verdy continued:
> What surprizes me the most in the Unicode spec is that it
> both says that its purpose is to create arbitrary length
> of leaders
As in plain text, as can be seen in Table of Content listings
in many RFCs, for example. (Which, however, use ASCII 0x2E for the
same purpose.)
> (you say that the spacing statement in the Xerox name was
> not considered important by Xerox, so how many leaders would
> be needed to fit a en space with the Unicode designation?).
If you mean how many leader *dots* would it take to fit an en
space, that would depend on the font in Unicode, as for so
much else. My guess would be that the correct answer is
approximately the same as the number of angels that can stand
on the dot.
Very few characters in Unicode have any specified widths. That
is by design.
> Why then do you insist that it represents one dot ?
Because that was the intent of the Unicode Technical Committee
when it encoded the character, and is the clear intent of the
standard as currently specified.
> You also seem to insist o the "compatibility" decomposition
> which is normally removing an important semantic (else it
> would be canonical).
I'm simply restating the specification in the standard. Read it
yourself.
> All this seems like creating contradictions.
>
> Also it would be the only punctuation sign whose number of
> occurences is not relevant
False. See the discussion of Tibetan justifying tseks in:
http://www.unicode.org/versions/Unicode4.0.0/ch09.pdf
> (in dotted lines used as leaders),
Or, for that matter, in plain text visual line separations
also created by stringing together ASCII punctuation:
**********************************************************
like that. Such legacy use of punctuation characters is no
different than legacy use of a sequence of periods to create
leader lines in plain text.
> as the final presentation of the text will need to compensate
> for font metrics differences in order to produce the correct
> effect (also because the size of the dots where removed from
> the Unicode designation.)
So? That is irrelevant to the question at hand. People who do
stuff like this, as in plain text RFCs, display text in
monospace fonts and don't expect dynamic reflowing of text.
People who do leader lines correctly for fine typography do
them with internal data abstractions, and those data abstractions
aren't based on interpreting U+2024 as a format control character.
> I do no agree wih your argument that says that it is like a
> full dot to be used in limited applications
You can disagree with my argument all you like. But if you insist
on coming on the unicode list and spouting nonsense about
particular characters in the standard, suggesting that people
implement them in ways that would be nonconformant with the
standard, then expect people to respond to the nonsense.
> (if Unicode wanted to remove the spacing, it was to generalize
> is use as an abstract character, not to reenforce its mapping
> to an approximate full dot.)
That claim is errant nonsense.
> I never heard about the Xerox CCS before, but there's a large
> legacy usage of the ellipsis as a single unbreakable character
Correct. And U+2026 is encoded precisely for that legacy practice.
> (and the two dots for the notation of interval bounds are also
> unbreakable).
True, but this kind of behavior falls automatically out of most
implementations' treatment of U+002E characters in sequence.
Check UAX #14, which discusses the line break behavior of both
the leader dot characters and U+002E FULL STOP. U+002E is lb class
IS, and since class IS prohibits a break before, a sequence of
two periods in a row, as in [0..1] does not have a break
opportunity in the middle of the sequence.
> The single dot leader looks like a way to fill the gap,
> only because two-dot three-dots ellipsis did not allow,
> in most fonts and applications, to create a regular leader,
> using smaller dots than the one used for the regular full stop
> punctuation.
You are mixing up glyphs and characters here.
In "most fonts and applications" leader dots are *glyphs* used
to express a measured leader line, not characters at all.
> The fact that it was unified with XCCS (with some
> compromizes accepted by Xerox) clearly demonstrates that
> the Xerox design was not the main focus:
In the case of encoding of the ONE DOT LEADER, you don't know what you
are talking about.
> - Who knows XCCS and use it ? Very few people.
Today, yes. But it was a key source of character repertoire for
Unicode 1.0, and choices made in the XCCS often guided thinking
about character/glyph distinctions for Unicode.
> - Who uses leaders ? Every publisher and author of long documents
> that do not want to see irregularily spaced leaders, or a dotted
> grid instead of a true dotted horizontal line.
This is irrelevant to the claims you have been making about U+2024.
>
> Leaders are visual helpers for the eye of readers, they have
> absolutely no punctuation or symbolic semantic (unlike the
> two-dots symbol or the ellipsis). The fact that it was categorized
> as a punctuation is probably an initial error
It was not. The error is your assumption that the TWO DOT LEADER
was encoded to represent the convention of using <U+002E, U+002E>
to indicate a range.
> that can' be corrected and that comes from the classification
> of its approximative fallback "compatibility decomposition".
>
> So you seem to mix the very distinct concept of compatibility
> characters and compatibility decompositions:
I see...
[*looks around the office to see who else it was who wrote that
text in Chapter 2*]
...but I do appreciate the coals delivered to Newcastle. ;-)
> - compatibility characters are for the initial mapping from an
> important legacy encoding with full roundtrip, and the
> exact semantic is preserved in this mapping to Unicode. The usage
> of these Unicode codepoints is discouraged out of this legacy usage.
>
> - characters that have compatiblity decompositions are intended
> as guides for acceptable fallback characters that will not create
> too confusive interpretation by readers, but the exact semantic
> is not preserved with their compatibility decomposition. Their
> usage is not discouraged but instead favored by Unicode which
> adds important semantics in the "composed" character.
I won't desconstruct this sentence by sentence. But use of
compatibility characters is not discouraged. *Some* of them
are deprecated; *some* of them are inappropriate for particular
uses; *some* of them are, in fact, required for other contexts.
It depends on what you are doing in your implementations.
Compatibility decompositions were *not* defined as
guides for acceptable fallback. They can be used as part of
a fallback conversion implementation, but fallback is a much
more general problem, and applies to characters that have no
decompositions and to characters with canonical decompositions,
as well.
Finally, some compatibility decomposable characters are not
only discouraged, they may even be "strongly discouraged", for
one reason or another. See, for example, U+0F77 and U+0F79.
I'd advise more care in making unjustified generalizations
and then proclaiming them to the unicode list as if they
were expert opinions.
--Ken
This archive was generated by hypermail 2.1.5 : Fri May 30 2003 - 22:33:32 EDT