Marco:
Interesting discussion. I've chosen to respond to the bottom
half of your message.
>There is only a minority of Unicode characters that require
more than one glyphs.
There are a whole lot of Unihan characters that need multiple
glyphs (according to choice of simplified Ch, trad. Ch,
Japanese or Korean). Of course, multiple glyphs per character
are needed for Arabic, Devanagari, Thai, and several other
scripts. A font designer may also want to include lots of
composites for Latin, Hebrew or any other script that has
diacritics in order to provide optimal output quality. Many
scripts, e.g. Arabic, Devanagari, require lots of ligatures; a
type designer may also choose to add addtional precomposed
liguature forms for optimal output quality. Monotype, for
example, has done type design for Nastaliq style Arabic in
which they have something on the order of 20,000 glyphs to
render a small number of characters.
I.e. there are many ways in which this statement is not really
representative of the actual situation.
>In my mind, all those letter-accent pairs, all those
ligatures, all those "presentation forms" for Arabic and
vertical CJK, all those ideograph variants, etc. are there to
allow font designers using Unicode as a glyph indexing system.
All the characters you refer to here are in the standard as
part of a compromise with the past. There are many ways in
which we'd be better off if they could have been left out. They
are *definitely not* in their for the benefit of font
designers. They are only there because of legacy encodings and
the requirement of round-trip conversion.
>What I am trying to say is that Unicode should pragmatically
give up the "abstract character" concept in some delimited
cases, and explicitly admit that some of the code points are
not there to represent "abstract" characters, but rather to
represent "actual" glyphs.
>If this distinction is made clear, then everything would fit
nicely in its proper slot: it would become clear(er) that some
"characters" are actually graphemes designed to be used as
glyph indexes inside fonts (or inside rendering algorithms),
and that application are not encouraged to use them to encode
text.
It could perhaps work if *everyone* recognised what should be
used inside a document and what is there for font-internal
purposes only, and if everyone *obeyed* the rule that only the
former get used except in the rendering process. The reality is
that that could never be enforced. We'd have a *very serious
mess*.
Also, it was clear from day one that developing a standard
encoding for all the glyphs of the world (even assuming an
abstract definition of glyph) would be a very unpleasant task
that even a Vogon wouldn't want to impose on anybody - and
possibly an impossible one, and that it would take a lot more
than 16 bits since the number of potential glyphs is, in
principle, open ended. E.g. if Monotype wanted a hole into
which to sink their capital, they could have created ligature
glyphs for all possible combinations of characters from lengths
2 to 20. To encode all those glyphs would take something like a
70-bit encoding (very quick and crude estimate).
Furthermore, it isn't necessary to do that. For all text
processing purposes other than rendering, glyph IDs are of no
interest whatsoever. It is the text element that matters. It
would be a far better suggestion to encode every grapheme, but
it's possible to accomplish the same results by encoding
characters as currently defined in Unicode using decomposition
wherever possible and reasonable (e.g. it's not reasonable to
decompose "P" into stem and circle components), and it's a
whole lot easier to develop an encoding standard this way.
Ideally, it would have been done in a consistent manner with no
precomposed forms, but for practical reasons some compromises
were considered necessary for purposes of legacy encodings and
round-trip conversion.
>This would open the door to 3 different things:
>1) Greater relaxation in adding new pre-composed glyphs: if
font designers ask for them, they must have their good reasons.
However much font designers might ask, strictly speaking they
don't actually need them in a character encoding. That may make
things easier for end users in the short term, but as Ken
Whistler suggested, everything will eventually work without
them, it's just a matter of time. Furthermore, adding new
pre-composed glyphs just creates a lot of additional work in
the area of normalisation and canonical equivalency, and will
only lead to grief.
>2) The possibility to standardize, up to a certain degree, the
process of transforming a string of "abstract characters" to a
string of renderable glyphs. Of course, some details will
always be totally dependent on the font and required quality.
But at least some basic readability features could be expressed
as simple mapping lists, rather than as lengthy algorithms
expressed in natural-language.
In reality, there is far too much that is dependent upon the
font and how a type designer wants to handle certain problems.
Consider again the Nastaliq case: Monotype chose to implement
this using a particular set of precomposed ligature glyphs.
They could have added more, but chose not to; they could also
have left some out, but chose not to. There is no single,
obvious set of rendering rules that could be immediately stated
for most scripts, and even coming up with a set of reasonably
obvious, basic rules would not be trivial. It's beyond the
scope of this encoding standard. The prose descriptions in the
standard for things like Arabic and Devanagari provide a
guideline for how the characters in the standard should be
understood and implemented. There are a whole lot of choices
that are still left open to the font designer, and restating
the prose descriptions as mapping tables wouldn't change this.
>3) Greater relaxation for applicative developers: they would
still be free to nicely display a character like U+00E1, but
they would no longer be blamed if they want to be extremist and
show a white box instead.
??? I don't follow this point. Application developers are
always free to display a white box if they want; nobody would
blame them for anything, they just might not buy their
software. But it's not the application developer that choses to
display a white box. They simply display whatever characters
are in the data using whatever font is specified. They rely on
the OS provider to handle a mapping from characters to glyphs
in a reasonable way (of course, they have to make use of the
appropriate services the OS provider has made available), and
they rely on the font designer to provide appropriate glyphs.
If a white box appears, it's not because an app developer has
done anything; it's because the data contains a character for
which the selected font has no glyph. None of this is the app
developer's responsibility, though some app developers go out
of their way to keep the user from seeing white boxes and thus
implement clever technigues such as MS's font linking. (We may
start seeing OS providers including services for font fallback
mechanisms like this; I'm sure lots of app developers would be
happy if they did.)
I don't see what encoding glyphs and precomposed forms has to
do with application developers not being blamed for displaying
white boxes.
Peter
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT