Re: Joined "ti" coded as "Ɵ" in PDF
lang.support at gmail.com
Sat Mar 19 18:06:29 CDT 2016
Latin is fine if you keep to simple well made fonts and avoid using more
sophisticated typographic features available in some fonts.
Dumb it down typographically and it works fine. PDF, despite all the
current rhetoric coming from PDF software developers, is a preprint format.
Not an archival format.
The PDF format is less than ideal. But it is widely used, often in a way
the format was never really created for. There are alternatives that
preserve the text. But they have never really taken off (compared to
PDF)for various reasons.
On Sunday, 20 March 2016, Don Osborn <dzo at bisharat.net> wrote:
> Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why
in the 1-many mapping of ligatures (for fonts that have them) do the "many"
not simply consist of the characters ligated? Maybe that's too simple (my
understanding of the process is clearly inadequate).
> The "string of random ASCII characters" (per Leonardo) used in the
Identity H system for hanzi raise other questions: (1) How are the ASCII
characters interpreted as a 1-many sequence representing a hanzi rather
than just a series of 1-1 mappings of themselves? (2) Why not just use the
Unicode code point?
> The details may or may not be relevant to the list topic, but as a user
of documents in PDF format, I fail to see the benefit of such obscure
mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've
just encountered with these mappings, I'm wondering how these concerned
about how the font & mapping results turned out as they did. It is certain
that the creators of the documents didn't intend results that would not be
searchable by normal text, but it seems possible their a particular font
choice with these ligatures unwittingly produced these results. If the
latter, the software at the very least should show a caveat about such
mappings when generating PDFs.
> Maybe it's unrealistic to expect a simple implication of Unicode in PDFs
(a topic we've discussed before but which I admit not fully grasping).
Recalling I once had some wild results copy/pasting from an N'Ko PDF, and
ended up having to obtain the .docx original to obtain text for insertion
in a blog posting. But while it's not unsurprising to encounter issues with
complex non-Latin scripts from PDFs, I'd gotten to expect predictability
when dealing with most Latin text.
> On 3/17/2016 7:34 PM, Andrew Cunningham wrote:
> There are a few things going on.
> In the first instance, it may be the font itself that is the source of
> My understanding is that PDF files contain a sequence of glyphs. A PDF
file will contain a ToUnicode mapping between glyphs and codepoints. This
iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides
support for ligatures and variation sequences.
> I assume it uses the data in the font's cmap table. If the ligature
isn't mapped then you will have problems. I guess the problem could be
either the font or the font subsetting and embedding performed when the PDF
> Although, it is worth noting that in opentype fonts not all glyphs will
have mappings in the cmap file.
> The remedy, is to extensively tag the PDF and add ActualText attributes
to the tags.
> But the PDF specs leave it up to the developer to decide what happens in
there is both a visible text layer and ActualText. So even in an ideal PDF,
tesults will vary from software to software when copying text or searching
> At least thatsmy current understanding.
> On 18 Mar 2016 7:47 am, "Don Osborn" <dzo at bisharat.net> wrote:
>> Thanks all for the feedback.
>> Doug, It may well be my clipboard (running Windows 7 on this particular
laptop). Get same results pasting into Word and EmEditor.
>> So, when I did a web search on "internaƟonal," as previously mentioned,
and come up with a lot of results (mostly PDFs), were those also a
consequence of many not fully Unicode compliant conversions by others?
>> A web search on what you came up with - "Internaonal" - yielded many
more (82k+) results, again mostly PDFs, with terms like "interna onal"
(such as what Steve noted) and "interna<onal" and perhaps others (given the
nature of, or how Google interprets, the private use character?).
>> Searching within the PDF document already mentioned, "international"
comes up with nothing (which is a major fail as far as usability).
Searching the PDF in a Firefox browser window, only "internaƟonal" finds
the occurrences of what displays as "international." However after
downloading the document and searching it in Acrobat, only a search for
"internaonal" will find what displays as "international."
>> A separate web search on "Eīects" came up with 300+ results, including
some GoogleBooks which in the texts display "effects" (as far as I
checked). So this is not limited to Adobe?
>> Jörg, With regard to "Identity H," a quick search gives the impression
that this encoding has had a fairly wide and not so happy impact, even if
on the surface level it may have facilitated display in a particular style
of font in ways that no one complains about.
>> Altogether a mess, from my limited encounter with it. There must have
been a good reason for or saving grace of this solution?
>> On 3/17/2016 2:17 PM, Steve Swales wrote:
>>> Yes, it seems like your mileage varies with the PDF
viewer/interpreter/converter. Text copied from Preview on the Mac replaces
the ti ligature with a space. Certainly not a Unicode problem, per se, but
an interesting problem nevertheless.
>>>> On Mar 17, 2016, at 11:11 AM, Doug Ewell <doug at ewellic.org> wrote:
>>>> Don Osborn wrote:
>>>>> Odd result when copy/pasting text from a PDF: For some reason "ti" in
>>>>> the (English) text of the document at
>>>>> is coded as "Ɵ". Looking more closely at the original text, it does
>>>>> appear that the glyph is a "ti" ligature (which afaik is not coded as
>>>>> such in Unicode).
>>>> When I copy and paste the PDF text in question into BabelPad, I get:
>>>>> Internaonal Order and the Distribuon of Identy in 1950 (By
>>>>> invitaon only)
>>>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use
>>>> Truncating this character to 16 bits, which is a Bad Thing™, yields
>>>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like
>>>> Don's clipboard or the editor he pasted it into is not fully
>>>> Don's point about using alternative characters to implement ligatures,
>>>> thereby messing up web searches, remains valid.
>>>> Doug Ewell | http://ewellic.org | Thornton, CO
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode