RE: minimizing size (was Re: allocation of Georgian letters)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Feb 09 2008 - 17:22:55 CST

Next message: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"

Previous message: Andreas Stötzner: "Monetaria, Glareanus 1551"
In reply to: James Kass: "Re: minimizing size (was Re: allocation of Georgian letters)"
Next in thread: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

James Kass wrote:
> I'm coming from the old IBM-PC days when control "C" copied
> selected text into the buffer, and control "V" copied
> whatever is in the buffer to the active text area. (Still
> works, too! Except that the buffer now apparently accepts
> non-textual data.)

This is not a new feature of the clipboard; in fact the Windows clipboard
accepts several formats, and the "memory bufer" is not comletely filled
before both applications (plus the clipboard itself that supports some basic
formats and that will keep the copied data internally by performing data
conversion when the data gets actually copied into it) agree on the format
to use.

Since always, the clipboard contains not a single data, but several ones
that need to be enumerated; the application accepting data from the
clipboard should enumerate the formats to see which one best fits its needs;
However the clipboard itself does not verify which data is getting copied
into: if the source application says it is basic text, then the clipboard
keeps it as is, just converting it to Unicode internally of the source
application uses another encoding, or keeping the local system's ANSI or OEM
codepage.

The clipboard internal formats should always be negociated.

But it's true that some application are putting some garbage data into the
clipboard when performing copies into it. One of them is Adobe reader, but
this comes most of often from the fact that PDF documents were created with
custom fonts that don't obey to a standard encoding, or where the encoding
was "tweaked" to reuse another "similar" encoding within these fonts, with
non-standard mapping from text to glyphs.

This happens quite often with some PDF creation tools that are building
custom fonts to reduce the size of the PDF, by not embedding the original
font definitions, but assigning linear codes foreach glyph as they appear in
the source text, in random order. How can Adobe Reader "guess" which
character maps to the effective glyph ids used in the PDF? That's a
difficult task. Not all PDFs are created for allowing copy-pasting from
them, they are just designed to be viewed or printed the way they were
designed in the original document and nowhere else.

A PDF document is not a text document but a collection of drawing primitives
and collections of glyphs that are not necessarily indexed by some standard
character encoding because the encoding effectively used is only local to
the document itself; however not respecting some conventions will disable
some important features of PDF documents, such as the possibility of
performing reliably full text searches in them and indexing large
collections of documents.

Don't blame too much Adobe Reader, blame the PDF creation tools for not
respecting these conventions, and the authors of these tools for not
verifying that the tool will permit reuse of the document content by
legitimate document authors.

Next message: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"
Previous message: Andreas Stötzner: "Monetaria, Glareanus 1551"
In reply to: James Kass: "Re: minimizing size (was Re: allocation of Georgian letters)"
Next in thread: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 18:58:11 CST