From: Wunna Ko Ko (wunnakoko@gmail.com)
Date: Sun Feb 17 2008 - 05:08:59 CST
Dear Sir,
On Feb 10, 2008 4:43 PM, James Kass <thunder-bird@earthlink.net> wrote:
>
> Eric Muller wrote,
>
> >> PDF has long been touted as *the* way to safely send text with the
> >> assurance that the recipients will be able to display that text exactly
> >> as the author intended.
> >
> > Actually, it is "final form documents", not text.
>
> "Portable document format" implies more than merely a method of
> exchanging graphic information intended to be sent to a printer
> device by the end user. Indeed, PNG (portable netword graphics)
> can probably be printed by users almost as well. PDF does have obvious
> advantages over graphic file formats, though.
>
> >> Without any real knowledge of the PDF format and what happens when
> >> converting a file to PDF, it appears to me that it is not text which is
> >> being embedded. Rather, the process is embedding glyphs.
> >
> > Glyphs is the primary construct that is needed for "final form
> > documents". Glyphs are mandatory in PDFs.
>
> I like glyphs and actually consider them useful.
>
> > When you see something like "(the car) Tj" in a PDF content stream, the
> > "the car" piece is only accidentally looking like text (of course an
> > intended accident, but an accident nevertheless).
> >> If a glyph
> >> is mapped to a Unicode value, at least some applications can return that
> >> value. But, if the glyph is not mapped to a unicode value (which is
> >> normally the case with presentation forms used in complex scripts),
> >> there does not seem to be any effort made to preserve the Unicode
> >> string which generated the presentation form. And that's really a
> >> shame.
> >
> > Actually, there are ways to include characters in additions to the
> > glyphs, even when the character/glyph correspondence is not one-for-one
> > (look for /ActualText in the PDF reference; /ToUnicode maps are
> > conceptually optimizations of that), but whether those ways are
> > exploited depend on the PDF generator. Some generators use nothing,
> > other will generate only /ToUnicode (what you describe) which can
> > account for only 1-to-1 character/glyph mappings, others will use the
> > full apparatus.
>
> We all look forward to developers implementing proper mechanisms
> to preserve the original textual data.
>
> > For example, if you take the PDFs generated for the UDHR in Unicode
> > project (e.g.
> > http://www.unicode.org/udhr/assemblies/first_article_subset.pdf for a
> > small comprehensive example), then except for the space problem
> > mentioned earlier, I think that you can copy from Acrobat and paste in
> > Notepad and get back all the text.
>
> I've found the UDHR in Unicode PDF files to be quite helpful. It's a
> worthwhile project, indeed.
>
> Not having Acrobat installed here, I tried to test this anyway.
>
> Vai looks fine:
>
> ꕉꕜꕮ ꔔꘋ ꖸ ꔰ ꗋꘋ ꕮꕨ ꔔꘋ ꖸ ꕎ ꕉꖸꕊ ꕴꖃ ꕃꔤꘂ ꗱ, ꕉꖷ ꗪꗡ ꔻꔤ ꗏꗒꗡ ꕎ ꗪ ꕉꖸꕊ ꖏꕎ. ꕉꕡ ꖏ
> ꗳꕮꕊ ꗏ ꕪ ꗓ ꕉꖷ ꕉꖸ ꕘꕞ ꗪ. ꖏꖷ ꕉꖸꔧ ꖏ ꖸ ꕚꕌꘂ ꗷꔤ ꕞ ꘃꖷ ꘉꔧ ꗠꖻ ꕞ ꖴꘋ ꔳꕩ ꕉꖸ ꗳ.
>
> Tamil does not:
>
> ம?த? ?ற???ன? சகல?? ?த??ரம?க?வ ?ற???றன? ; அவ?க? ம?????,
> உ??மக??? சமம?னவ?க?, அவ?க? ?ய?ய??த?? மன?ச????ய??
> இய?ப?ப?க? ?ப?றவ?க?. அவ?க? ஒ?வ?ட?ன??வ? ச?க?தர உண???
> ப???? நட???க??ள? ?வ???.
>
> Kannada looks worse than Tamil:
>
> ಎ??? ??ನವರ? ಸ?ತಂತ?????? ಜ??ದ????. ??ಗ? ಘನ?? ಮತ?? ಹಕ??ಗಳ??? ಸ??ನ???ದ????. ????ಕ ಮತ??
> ಅಂತಃಕರಣ ಗಳನ?? ಪ??ದವ??ದ? ?ಂದ ಅವರ? ಪರಸ?ರ ಸ????ದರ ??ವ?ಂದ ವ??ಸ??ಕ?.
>
> Hebrew has spaces added:
>
> כ ל ב נ י א ד ם נ ו ל ד ו ב נ י ח ו ר י ן ו ש ו ו י ם ב ע ר כ ם ו ב ז כ ו י ו ת י ה ם . כ ו ל ם ח ו נ נ ו ב ת
> ב ו נ ה ו ב מ צ פ ו ן ,
> ל פ י כ ך ח ו ב ה ע ל י ה ם ל נ ה ו ג א י ש ב ר ע ה ו ב ר ו ח ש ל א ח ו ה .
>
> Burmese has some question marks, maybe a font problem here:
> (Or maybe my system doesn't support Unicode 6.0 yet?)
>
> လူတုိင်းသည် တူညီလွတ်လပ်?သာ ဂုဏ်သိက?ာဖ ြ င့်လည်း?ကာင်း၊ တူညီလွတ်လပ်?သာ
> အခွင့်အ?ရးများဖ ြ င့်လည်း?ကာင်း၊ ?မွးဖွားလာသူများဖ ြ စ်သည်။ ထုိသူတုိ့၌ပုိင်းခ ြ ား?ဝဖန်တတ်?သာ ဉာဏ်နှင့်
> ကျင့်ဝတ်သိတတ်?သာ စိတ်တုိ့ရှိက ြ ၍ ထုိသူတုိ့သည် အချင်းချင်း ?မတ?ာထား၍ ဆက်ဆံကျင့်သုံးသင့်၏။
Your text is not encoded in Unicode 5.1 (beta). It has different code points.
>
> As Eric points out, success may well depend upon the application used for
> PDF generation as well as the application displaying the PDF from which
> the text was copied into Notepad. I used Sumatra to display the PDF and
> the CutePDF generating application.
>
> Getting back to Sinnathurai Srivas' question about when will publishing
> applications support complex scripts like Tamil... Tamil publishers can
> successfully embed Tamil text into a PDF document, send it to a publishing
> house, the publishing house can successfully print on paper from the PDF,
> bind the printed paper into a book, put the book on the market, and
> hope the books sells well.
>
> So, I'd say the answer is "now", at least for some aspects of publishing
> and some publishing applications.
>
> As far as any other problems associated with PDFs and complex scripts,
> if we look ten years into the past, there were *no* applications
> whatsoever which supported Unicode Tamil. We've come a long way
> in a relatively short time. We still have some distance to travel, though.
>
> Best regards,
>
> James Kass
>
>
>
-- Wunna Ko Ko ------------------------------------------- Get Paid To Read Emails. Free To Join Now! http://www.emailcashpro.com/?source=Email&r=onlinestore
This archive was generated by hypermail 2.1.5 : Sun Feb 17 2008 - 05:11:35 CST