Re: A basic question on encoding Latin characters

From: peter_constable@sil.org
Date: Fri Sep 24 1999 - 11:16:00 EDT


       Marco:

       Interesting discussion. I've chosen to respond to the bottom
       half of your message.

>There is only a minority of Unicode characters that require
       more than one glyphs.

       There are a whole lot of Unihan characters that need multiple
       glyphs (according to choice of simplified Ch, trad. Ch,
       Japanese or Korean). Of course, multiple glyphs per character
       are needed for Arabic, Devanagari, Thai, and several other
       scripts. A font designer may also want to include lots of
       composites for Latin, Hebrew or any other script that has
       diacritics in order to provide optimal output quality. Many
       scripts, e.g. Arabic, Devanagari, require lots of ligatures; a
       type designer may also choose to add addtional precomposed
       liguature forms for optimal output quality. Monotype, for
       example, has done type design for Nastaliq style Arabic in
       which they have something on the order of 20,000 glyphs to
       render a small number of characters.

       I.e. there are many ways in which this statement is not really
       representative of the actual situation.

>In my mind, all those letter-accent pairs, all those
       ligatures, all those "presentation forms" for Arabic and
       vertical CJK, all those ideograph variants, etc. are there to
       allow font designers using Unicode as a glyph indexing system.

       All the characters you refer to here are in the standard as
       part of a compromise with the past. There are many ways in
       which we'd be better off if they could have been left out. They
       are *definitely not* in their for the benefit of font
       designers. They are only there because of legacy encodings and
       the requirement of round-trip conversion.

>What I am trying to say is that Unicode should pragmatically
       give up the "abstract character" concept in some delimited
       cases, and explicitly admit that some of the code points are
       not there to represent "abstract" characters, but rather to
       represent "actual" glyphs.

>If this distinction is made clear, then everything would fit
       nicely in its proper slot: it would become clear(er) that some
       "characters" are actually graphemes designed to be used as
       glyph indexes inside fonts (or inside rendering algorithms),
       and that application are not encouraged to use them to encode
       text.

       It could perhaps work if *everyone* recognised what should be
       used inside a document and what is there for font-internal
       purposes only, and if everyone *obeyed* the rule that only the
       former get used except in the rendering process. The reality is
       that that could never be enforced. We'd have a *very serious
       mess*.

       Also, it was clear from day one that developing a standard
       encoding for all the glyphs of the world (even assuming an
       abstract definition of glyph) would be a very unpleasant task
       that even a Vogon wouldn't want to impose on anybody - and
       possibly an impossible one, and that it would take a lot more
       than 16 bits since the number of potential glyphs is, in
       principle, open ended. E.g. if Monotype wanted a hole into
       which to sink their capital, they could have created ligature
       glyphs for all possible combinations of characters from lengths
       2 to 20. To encode all those glyphs would take something like a
       70-bit encoding (very quick and crude estimate).

       Furthermore, it isn't necessary to do that. For all text
       processing purposes other than rendering, glyph IDs are of no
       interest whatsoever. It is the text element that matters. It
       would be a far better suggestion to encode every grapheme, but
       it's possible to accomplish the same results by encoding
       characters as currently defined in Unicode using decomposition
       wherever possible and reasonable (e.g. it's not reasonable to
       decompose "P" into stem and circle components), and it's a
       whole lot easier to develop an encoding standard this way.
       Ideally, it would have been done in a consistent manner with no
       precomposed forms, but for practical reasons some compromises
       were considered necessary for purposes of legacy encodings and
       round-trip conversion.

>This would open the door to 3 different things:

>1) Greater relaxation in adding new pre-composed glyphs: if
       font designers ask for them, they must have their good reasons.

       However much font designers might ask, strictly speaking they
       don't actually need them in a character encoding. That may make
       things easier for end users in the short term, but as Ken
       Whistler suggested, everything will eventually work without
       them, it's just a matter of time. Furthermore, adding new
       pre-composed glyphs just creates a lot of additional work in
       the area of normalisation and canonical equivalency, and will
       only lead to grief.

>2) The possibility to standardize, up to a certain degree, the
       process of transforming a string of "abstract characters" to a
       string of renderable glyphs. Of course, some details will
       always be totally dependent on the font and required quality.
       But at least some basic readability features could be expressed
       as simple mapping lists, rather than as lengthy algorithms
       expressed in natural-language.

       In reality, there is far too much that is dependent upon the
       font and how a type designer wants to handle certain problems.
       Consider again the Nastaliq case: Monotype chose to implement
       this using a particular set of precomposed ligature glyphs.
       They could have added more, but chose not to; they could also
       have left some out, but chose not to. There is no single,
       obvious set of rendering rules that could be immediately stated
       for most scripts, and even coming up with a set of reasonably
       obvious, basic rules would not be trivial. It's beyond the
       scope of this encoding standard. The prose descriptions in the
       standard for things like Arabic and Devanagari provide a
       guideline for how the characters in the standard should be
       understood and implemented. There are a whole lot of choices
       that are still left open to the font designer, and restating
       the prose descriptions as mapping tables wouldn't change this.

>3) Greater relaxation for applicative developers: they would
       still be free to nicely display a character like U+00E1, but
       they would no longer be blamed if they want to be extremist and
       show a white box instead.

       ??? I don't follow this point. Application developers are
       always free to display a white box if they want; nobody would
       blame them for anything, they just might not buy their
       software. But it's not the application developer that choses to
       display a white box. They simply display whatever characters
       are in the data using whatever font is specified. They rely on
       the OS provider to handle a mapping from characters to glyphs
       in a reasonable way (of course, they have to make use of the
       appropriate services the OS provider has made available), and
       they rely on the font designer to provide appropriate glyphs.
       If a white box appears, it's not because an app developer has
       done anything; it's because the data contains a character for
       which the selected font has no glyph. None of this is the app
       developer's responsibility, though some app developers go out
       of their way to keep the user from seeing white boxes and thus
       implement clever technigues such as MS's font linking. (We may
       start seeing OS providers including services for font fallback
       mechanisms like this; I'm sure lots of app developers would be
       happy if they did.)

       I don't see what encoding glyphs and precomposed forms has to
       do with application developers not being blamed for displaying
       white boxes.

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT