RE: TTF and Unicode surrogates

From: Murray Sargent (murrays@microsoft.com)
Date: Fri May 15 1998 - 16:07:49 EDT


There seem to be two competitive ways anyhow to represent surrogate
characters in TrueType. The simpler one is to overload the codepoints with
the BMP, i.e., make a font a plane-dependent font, like a plane 1 font, a
plane 2 font, etc. This is the analog of the 8-bit character set approach,
but applied to 16-bits. Then the current Unicode TrueType CMAP can work the
same way it does now. The problem is that in masking off hexadigits 4 and
5, thereby achieving a 16-bit code, some low-level routines might
misinterpret some common codes, such as ASCII control characters, according
to BMP semantics. Higher-level routines would work with the surrogate pairs
(or 32-bit characters) and not be affected in this way. One might be able
to agree to avoid such common codes in the higher planes. The cool thing
about this simple approach is that things work right away; e.g., no need to
change Windows GDI (ExtTextOutW), etc. Application code has to do some
minor translations (I used about 50 lines of code in an edit-control
prototype) to get this approach to work. But application code has to have
some minor changes anyhow to navigate in text containing surrogate pairs.

The second, more general approach, is to use a 32-bit CMAP. 16-bit glyph
indices would still be used so that TrueType rasterizers wouldn't have to be
modified. The code to translate from character codes to glyph indices would
have to be generalized to handle the new CMAP. This approach allows access
to the full 0x10FFFF Unicode code space, 64K glyphs at a time, rather than
at a plane at a time. As with smaller index CMAPs, the table would be
highly compressed and quite efficient. But GDI's current ExtTextOutW
doesn't understand surrogate pairs and would have to be revised or apps
would have to call a routine that translates to 16-bit glyph indices, which
the various ExtTextOutWs do understand. It's easy to revise new versions of
the operating system to handle such changes, but it's hard to change all the
Win95's already out there. So one would need a new OS-independent component
that maps character codes to glyph indices and apps would have to be
modified to call this new component.

A third approach is to use 32-bit codes for glyph indices as well. This is
an unlikely approach, partly because it's a lot of work and has
significantly increased memory requirements, and partly because the demand
for a single font with more than 64K glyphs doesn't seem to be large. To
display Unicode, one is generally better off using a coordinated set of
fonts rather than one monster font. The latter could be useful for general
low-quality displays, e.g., directory displays, but since one needs to solve
the higher-quality problem addressed by multiple fonts anyhow, one might as
well use it to deal with large glyph-set requirements.

David Goldsmith outlines a fourth approach that we've also looked at, namely
using ligatures. Off hand this would seem to be an expensive way to handle
surrogates. On Windows platforms, it would still require a separate line
layout component to handle the glyphing, since ExtTextOutW doesn't handle
ligatures except given a ligature glyph index (or using the character code
for the few ligatures that slipped into Unicode because they existed in
widespread code sets). It would be nice to have some performance results
from David's implementation.

Can anyone think of better approaches?

Thanks
Murray

> -----Original Message-----
> From: David Goldsmith [SMTP:goldsmith@apple.com]
> Sent: Thursday, May 14, 1998 11:26 PM
> To: Unicode List
> Subject: Re: TTF and Unicode surrogates
>
> At 5/14/98 12:16 PM, Werner Lemberg (sx0005@sx2.hrz.uni-dortmund.de)
> wrote:
>
> >how will such surrogate characters represented in a cmap of a TrueType
> >font?
>
> I'm not aware of any way to do this with only the cmap, given the current
> definition of TrueType fonts. A cmap can only map a single 16 bit
> character to a single glyph.
>
> On Apple platforms, the way we plan to support characters outside Plane 0
> in fonts is as mandatory ligatures of the high and low surrogates. Our
> software is also aware of surrogates and that they shouldn't be broken
> apart when editing. Ligatures for Mac OS are specified through the 'mort'
> (morph) table. This should work for any font stored in 'sfnt' format, not
> just TrueType.
>
> Apple's font tables are specified at:
>
> http://fonts.apple.com/
>
> Hope this helps.
>
>
> David Goldsmith
> International and Text Department Architect
> Apple Computer, Inc.
> goldsmith@apple.com
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT