From: Eric Muller (emuller@adobe.com)
Date: Mon May 02 2005 - 13:56:48 CDT
Rick Cameron wrote:
>Does 'most of our applications' include Acrobat? The last time I looked
>at the PDF file format (which is a couple of years ago) it did not allow
>text to be represented as Unicode.
>
>
The thing to understand is that fundamentally, a PDF content stream (the
name of the part that describes the content of a page) describes which
glyph of which font is positioned where on a page. When you see in a PDF
document "(office) Tj", it really means "display the glyph with glyph id
0x6F of the current font at the current point, and advance the current
point by the width of that glyph; display the glyph with glyph id 0x66
at the current point, ..."
It so happens that in the most common cases, the glyph with glyph id
0x6F renders as "o", etc; it also happens that the PDF spec calls these
glyph ids "character codes"; it also happens that the PDF spec calls the
byte sequence of the glyph ids a "string". Hence, it is easy to be
mislead and believe that "(office) Tj" means "render the (Unicode)
character string 'office' at the current point." But that is not what
PDF content streams are about. In particular, there is no opportunity
for a PDF renderer to use an "ffi" ligature.
The choice of capturing the glyphs, i.e. the result of layout, rather
than the characters, i.e. the input to layout, is what makes PDF so good
at providing fidelity (and is arguably necessary to achieve that fidelity).
Besides the content stream, PDF also allows the input to layout to be
captured. This is what the /ToUnicode entry in PDF /Font objects, the
/AltText entry on marked content and the whole "tagged PDF" stuff is
about. Furthermore, this input is correlated with the glyph references,
i.e. it is possible to record that a given occurrence of the glyph with
glyph id 0x6F of some font does render the (Unicode) character "U+006F".
Or even that a seqence of glyphs occurrences does render a given
(Unicode) character string. In many common cases the representation of
that correlation is very efficient.
So the statement about the PDF format is: Whenever *characters* are
represented in PDFs, they can (and sometime have to) be represented
using Unicode.
Whether a specific PDF generator does properly record the input to
layout along with the content stream (and their correlation ), whether
it is even in a position to do so, is a separate issue.
Eric.
This archive was generated by hypermail 2.1.5 : Mon May 02 2005 - 13:57:40 CDT