From: Olaf Drümmer (o.druemmer@callassoftware.com)
Date: Thu Oct 26 2006 - 02:28:21 CST
Hi,
jefsey@jefsey.com wrote Thu, 26 Oct 2006 03:11:35 +0200
>I have a PDF document including some non-roman characters I would
>like to obtain the code element value. Is there a tool able to do that?
>Thank you for the tip.
>jfc
Programmatically this in itself is not an easy task, and as far as I
know there is no good freeware/opensource implementation/code readily
available.
PDFlib does offer a text extraction tool kit (TET) - see www.pdflib.com
for more info that I think would do the trick.
Acrobat's API also offers access to text on a page (the SDK is publicly
available).
If you just need to do it occasionally/as a user you could just use
Acrobat (recent versions work better than older ones), select the text,
copy it to the pasteboard and then paste it into some other app that can
give you the unicode values.
Also: probably next month Acrobat 8 Professional will be released by
Adobe. It contains a component called "Preflight" which in turn does
have two features that may be of interest here:
- a browser for the internal structure of embedded fonts (will give you
Unicode values for glyphs in the fonts, given they are defined/can be
established)
- an inventory feature that (among other things) creates tables of
glyphs for the fonts used in the PDF, together with character IDs and
Unicode code points (and Unicode glyph names).
For all suggestions please keep in mind that in some cases
- it may not be possible to establish the Unicode value
- the Unicode value may be incorrect because the information in the PDF/
font is incorrect
Olaf Druemmer
callas software (which happens to be the company having developed
Preflight for Acrobat ;-> )
This archive was generated by hypermail 2.1.5 : Thu Oct 26 2006 - 02:29:49 CST