Re: extracting code values from PDF?

From: Eric Muller (emuller@adobe.com)
Date: Fri Oct 27 2006 - 00:59:02 CST

Next message: Otto Stolz: "Re: extracting code values from PDF?"

Previous message: Balasankar: "keys in CLDR"
In reply to: Rick Cameron: "RE: extracting code values from PDF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

For the page content, a PDF document primarily records glyphs and their
positions. It can also optionally record the corresponding characters,
using some combination of a mapping from glyphs to characters and local
overrides. You can look at
<http://www.udhrinunicode.org/assemblies/first_article_subset.pdf> to
see how that can be done for a variety of writing systems. (I am aware
that a copy-paste from that document using Acrobat results in additional
SPACE characters; this seems to be a problem with Acrobat.)

What a PDF consumer does with that is another story. Acrobat uses the
character data when present, but also attempts, with more or less
success, to squeeze it from whatever is available in the PDF. This is
currently working relatively well for Latin text, but relatively little
work has been done for other writing systems.

Eric.

Next message: Otto Stolz: "Re: extracting code values from PDF?"
Previous message: Balasankar: "keys in CLDR"
In reply to: Rick Cameron: "RE: extracting code values from PDF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Oct 27 2006 - 01:01:10 CST