From: Eric Muller (emuller@adobe.com)
Date: Fri Oct 27 2006 - 00:59:02 CST
For the page content, a PDF document primarily records glyphs and their
positions. It can also optionally record the corresponding characters,
using some combination of a mapping from glyphs to characters and local
overrides. You can look at
<http://www.udhrinunicode.org/assemblies/first_article_subset.pdf> to
see how that can be done for a variety of writing systems. (I am aware
that a copy-paste from that document using Acrobat results in additional
SPACE characters; this seems to be a problem with Acrobat.)
What a PDF consumer does with that is another story. Acrobat uses the
character data when present, but also attempts, with more or less
success, to squeeze it from whatever is available in the PDF. This is
currently working relatively well for Latin text, but relatively little
work has been done for other writing systems.
Eric.
This archive was generated by hypermail 2.1.5 : Fri Oct 27 2006 - 01:01:10 CST