From: Eric Muller (emuller@adobe.com)
Date: Sat Feb 09 2008 - 19:46:17 CST
James Kass wrote:
>
>
> PDF has long been touted as *the* way to safely send text with the
> assurance that the recipients will be able to display that text exactly
> as the author intended.
Actually, it is "final form documents", not text.
>
> Without any real knowledge of the PDF format and what happens when
> converting a file to PDF, it appears to me that it is not text which is
> being embedded. Rather, the process is embedding glyphs.
Glyphs is the primary construct that is needed for "final form
documents". Glyphs are mandatory in PDFs.
When you see something like "(the car) Tj" in a PDF content stream, the
"the car" piece is only accidentally looking like text (of course an
intended accident, but an accident nevertheless).
> If a glyph
> is mapped to a Unicode value, at least some applications can return that
> value. But, if the glyph is not mapped to a unicode value (which is
> normally the case with presentation forms used in complex scripts),
> there does not seem to be any effort made to preserve the Unicode
> string which generated the presentation form. And that's really a
> shame.
Actually, there are ways to include characters in additions to the
glyphs, even when the character/glyph correspondence is not one-for-one
(look for /ActualText in the PDF reference; /ToUnicode maps are
conceptually optimizations of that), but whether those ways are
exploited depend on the PDF generator. Some generators use nothing,
other will generate only /ToUnicode (what you describe) which can
account for only 1-to-1 character/glyph mappings, others will use the
full apparatus.
For example, if you take the PDFs generated for the UDHR in Unicode
project (e.g.
http://www.unicode.org/udhr/assemblies/first_article_subset.pdf for a
small comprehensive example), then except for the space problem
mentioned earlier, I think that you can copy from Acrobat and paste in
Notepad and get back all the text.
Eric.
This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 19:48:14 CST