Re: Joined "ti" coded as "Ɵ" in PDF
dzo at bisharat.net
Thu Mar 17 15:44:19 CDT 2016
Thanks all for the feedback.
Doug, It may well be my clipboard (running Windows 7 on this particular
laptop). Get same results pasting into Word and EmEditor.
So, when I did a web search on "internaƟonal," as previously mentioned,
and come up with a lot of results (mostly PDFs), were those also a
consequence of many not fully Unicode compliant conversions by others?
A web search on what you came up with - "Internaonal" - yielded many
more (82k+) results, again mostly PDFs, with terms like "interna onal"
(such as what Steve noted) and "interna<onal" and perhaps others (given
the nature of, or how Google interprets, the private use character?).
Searching within the PDF document already mentioned, "international"
comes up with nothing (which is a major fail as far as usability).
Searching the PDF in a Firefox browser window, only "internaƟonal" finds
the occurrences of what displays as "international." However after
downloading the document and searching it in Acrobat, only a search for
"internaonal" will find what displays as "international."
A separate web search on "Eīects" came up with 300+ results, including
some GoogleBooks which in the texts display "effects" (as far as I
checked). So this is not limited to Adobe?
Jörg, With regard to "Identity H," a quick search gives the impression
that this encoding has had a fairly wide and not so happy impact, even
if on the surface level it may have facilitated display in a particular
style of font in ways that no one complains about.
Altogether a mess, from my limited encounter with it. There must have
been a good reason for or saving grace of this solution?
On 3/17/2016 2:17 PM, Steve Swales wrote:
> Yes, it seems like your mileage varies with the PDF viewer/interpreter/converter. Text copied from Preview on the Mac replaces the ti ligature with a space. Certainly not a Unicode problem, per se, but an interesting problem nevertheless.
>> On Mar 17, 2016, at 11:11 AM, Doug Ewell <doug at ewellic.org> wrote:
>> Don Osborn wrote:
>>> Odd result when copy/pasting text from a PDF: For some reason "ti" in
>>> the (English) text of the document at
>>> is coded as "Ɵ". Looking more closely at the original text, it does
>>> appear that the glyph is a "ti" ligature (which afaik is not coded as
>>> such in Unicode).
>> When I copy and paste the PDF text in question into BabelPad, I get:
>>> Internaonal Order and the Distribuon of Identy in 1950 (By
>>> invitaon only)
>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use
>> Truncating this character to 16 bits, which is a Bad Thing™, yields
>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either
>> Don's clipboard or the editor he pasted it into is not fully
>> Don's point about using alternative characters to implement ligatures,
>> thereby messing up web searches, remains valid.
>> Doug Ewell | http://ewellic.org | Thornton, CO
More information about the Unicode