Normalisation and font technology

From: John Hudson (
Date: Tue May 28 2002 - 15:59:28 EDT

At 08:51 5/28/2002, Addison Phillips [wM] wrote, in reference to comments
from Python developers re. Unicode:

>I suspect that you could search-and-replace the word "Unicode" with the word
>"multibyte" or the word "Japanese" and successfully turn the clock back ten
>years. The difference between then and now is that internationalization
>retrofit projects are being undertaken just to get Unicode support, rather
>than to satisfy specific, transient language needs

The comments Roozbeh forwarded to the Unicode list, originally cited by
Just van Rossum on the OpenType list, did not include the context of Just's
commentary. The objection of Python developers appears to be very
specifically focused on Unicode normalisation:

         ...a programming language that supports Unicode is ideally
         supposed to do canonical decomposition before comparing two
         strings. Eg. Python has otherwise pretty good Unicode support,
         but the developers have decided that comparison will *not* do
         canonical decomposition. "Too complicated, too much overhead".

Frankly, their analysis is correct -- canonical decomposition in
normalisation *is* a headache -- even if their solution is not.

Normalisation is also shaping up to be a font support headache that will
have a major impact on all sorts of software. Apple recently started
applying normalisation to file names in Mac OS X, with the result that the
content of folders can now only be correctly displayed with fonts that
contain the necessary AAT table information that the Mac OS recognises to
be able to recompose the diacritics that have been decomposed in
normalisation. In practice, this means only Apple system fonts, because
Apple are pretty much alone in making AAT fonts. Now apply this model to
other software: Do you really want word processing applications or web
browsers that can only correctly display text in a handful of fonts on a
user's system?

Is it, in fact, wise to rely on resolving display of character level
decomposition in glyph space? It seems fraught with problems, especially
considering that we still have competing font formats and no documented
process agreed on by OS, app and font developers for how to handle this
stuff. We also have tens-of-thousands of fonts out there that don't even
contain glyphs or cmap references for most combining marks, meaning that
normalised strings are most likely to be represented by dozens of .notdef
boxes. Users are not going to want to purchase new versions of their entire
font collections and, frankly, a hell of a lot of font developers wouldn't
know where to begin if you told them they needed to make an OpenType -- let
alone AAT -- font that included glyph substitution or positioning to render
normalised Unicode text strings.

Wouldn't it make more sense for OS and app developers to try to resolve
display of normalised text at the character level, by querying font cmap
tables or encoding vectors? It seems to me to make very little sense to
normalise a+acute as a + combiningacute if this results in a font that
contains a perfectly good a+acute glyph being unable to render the text
correctly due to lack of appropriate combiningacute character and/or glyph
substitution or positioning information. Of course, confirming, from the
cmap, the presence of appropriately encoded glyphs -- a, combiningacute --
does not necessarily mean that the font contains necessary substitution or
positioning information to render a correct typeform for a+acute. This
suggests that if the font does contain an a+acute glyph encoded as U+00E1,
this should always be used in display in preference to the glyphs
representing the normalised text. This in turn suggests that if text is
going to be decomposed in normalisation, it should be recomposed in a
buffered character string prior to rendering.

So the user enters U+00E1

This gets normalised to U+0061 U+0301

This gets recomposed in a buffered string to U+00E1

This gets passed to the font engine and rendered with the appropriate glyph

No glyph space processing required

Obviously glyph substitution or positioning is a requirement for handling
sequences of base and combining characters that do not have precomposed
forms in Unicode. This is understood, and users realise that if they want
to typeset, for instance, African languages with tone marks, they are going
to need a font with appropriate cmap coverage and glyph processing support.
Nor is it difficult to conceive of a time in the not too distant future
when all fonts are made with such glyph processing support, because the
ability to compose arbitrary diacritic combinations with glyph positioning
is a good thing that increases the potential market for a font. It is
foolish, though, to act as if this were already the case, develop systems
that rely on such processing, and in the process make users' existing fonts

John Hudson

Tiro Typeworks
Vancouver, BC

When the pages of books fall in fiery scraps
Onto smashed leaves and twisted metal,
The tree of good and evil is stripped bare.
                                        - Czeslaw Milosz

This archive was generated by hypermail 2.1.2 : Tue May 28 2002 - 14:08:44 EDT