At 08:51 5/28/2002, Addison Phillips [wM] wrote, in reference to comments 
from Python developers re. Unicode:
>I suspect that you could search-and-replace the word "Unicode" with the word
>"multibyte" or the word "Japanese" and successfully turn the clock back ten
>years. The difference between then and now is that internationalization
>retrofit projects are being undertaken just to get Unicode support, rather
>than to satisfy specific, transient language needs
The comments Roozbeh forwarded to the Unicode list, originally cited by 
Just van Rossum on the OpenType list, did not include the context of Just's 
commentary. The objection of Python developers appears to be very 
specifically focused on Unicode normalisation:
         ...a programming language that supports Unicode is ideally
         supposed to do canonical decomposition before comparing two
         strings. Eg. Python has otherwise pretty good Unicode support,
         but the developers have decided that comparison will *not* do
         canonical decomposition. "Too complicated, too much overhead".
Frankly, their analysis is correct -- canonical decomposition in 
normalisation *is* a headache -- even if their solution is not.
Normalisation is also shaping up to be a font support headache that will 
have a major impact on all sorts of software. Apple recently started 
applying normalisation to file names in Mac OS X, with the result that the 
content of folders can now only be correctly displayed with fonts that 
contain the necessary AAT table information that the Mac OS recognises to 
be able to recompose the diacritics that have been decomposed in 
normalisation. In practice, this means only Apple system fonts, because 
Apple are pretty much alone in making AAT fonts. Now apply this model to 
other software: Do you really want word processing applications or web 
browsers that can only correctly display text in a handful of fonts on a 
user's system?
Is it, in fact, wise to rely on resolving display of character level 
decomposition in glyph space? It seems fraught with problems, especially 
considering that we still have competing font formats and no documented 
process agreed on by OS, app and font developers for how to handle this 
stuff. We also have tens-of-thousands of fonts out there that don't even 
contain glyphs or cmap references for most combining marks, meaning that 
normalised strings are most likely to be represented by dozens of .notdef 
boxes. Users are not going to want to purchase new versions of their entire 
font collections and, frankly, a hell of a lot of font developers wouldn't 
know where to begin if you told them they needed to make an OpenType -- let 
alone AAT -- font that included glyph substitution or positioning to render 
normalised Unicode text strings.
Wouldn't it make more sense for OS and app developers to try to resolve 
display of normalised text at the character level, by querying font cmap 
tables or encoding vectors? It seems to me to make very little sense to 
normalise a+acute as a + combiningacute if this results in a font that 
contains a perfectly good a+acute glyph being unable to render the text 
correctly due to lack of appropriate combiningacute character and/or glyph 
substitution or positioning information. Of course, confirming, from the 
cmap, the presence of appropriately encoded glyphs -- a, combiningacute -- 
does not necessarily mean that the font contains necessary substitution or 
positioning information to render a correct typeform for a+acute. This 
suggests that if the font does contain an a+acute glyph encoded as U+00E1, 
this should always be used in display in preference to the glyphs 
representing the normalised text. This in turn suggests that if text is 
going to be decomposed in normalisation, it should be recomposed in a 
buffered character string prior to rendering.
So the user enters U+00E1
This gets normalised to U+0061 U+0301
This gets recomposed in a buffered string to U+00E1
This gets passed to the font engine and rendered with the appropriate glyph
No glyph space processing required
Obviously glyph substitution or positioning is a requirement for handling 
sequences of base and combining characters that do not have precomposed 
forms in Unicode. This is understood, and users realise that if they want 
to typeset, for instance, African languages with tone marks, they are going 
to need a font with appropriate cmap coverage and glyph processing support. 
Nor is it difficult to conceive of a time in the not too distant future 
when all fonts are made with such glyph processing support, because the 
ability to compose arbitrary diacritic combinations with glyph positioning 
is a good thing that increases the potential market for a font. It is 
foolish, though, to act as if this were already the case, develop systems 
that rely on such processing, and in the process make users' existing fonts 
useless.
John Hudson
Tiro Typeworks		www.tiro.com
Vancouver, BC		tiro@tiro.com
When the pages of books fall in fiery scraps
Onto smashed leaves and twisted metal,
The tree of good and evil is stripped bare.
                                        - Czeslaw Milosz
This archive was generated by hypermail 2.1.2 : Tue May 28 2002 - 14:08:44 EDT