Greenwood, Timothy wrote:
>This question is pertinent to one asked me the other day for which I did not have an answer. Is the code set of an original document relevant for PDF - say EUC, SJIS, PDF - will the output perform text searches correctly for differing code set inputs?
>
PDF documents logically contain two streams: one of characters, and one 
of glyphs.
The glyph stream is always present physically, and is used for 
rendering. Depending on the fonts involved, the PDF generator, and all 
sorts of factors, the meaning of the numbers in that glyph stream, and 
the machinery to locate the actual outlines will vary quite a bit.
The character stream can be represented explicitly, in which case I am 
pretty sure it is always a Unicode stream. Alternatively, it can be 
computed from the glyph stream using various mechanisms; I believe that 
all the computations described in the PDF spec generate a Unicode stream.
The choice of explicit vs implicit character representation is up to the 
PDF producer. In all cases, I believe that the producer has the 
responsibility of converting from whatever character standard is used in 
the original document to Unicode. When the producer is Distiller, it may 
not have access to the original character content and be forced to 
create an approximation.
Eric.
This archive was generated by hypermail 2.1.2 : Tue Jul 09 2002 - 13:32:00 EDT