From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Oct 04 2006 - 12:00:42 CST
On Wed, 4 Oct 2006, Paul Johnston wrote:
> I am using Unicode throughout my system (a web-based database for tracking
> work). I am forced to use a tool (htmldoc - for html to PDF conversion) that
> does not support unicode in any manner.
Are you sure? I don't mean the limitations of the tool but the necessity
of using that particular tool. I have successfully converted an HTML
document with over 1,000 different Unicode characters into PDF, using
free software available for a normal PC. But maybe you have some policy
restrictions. (I used PDFCreator. I've heard positive comments about
CutePDF Writer, and it appears to be cleaner and faster.)
> This should not be a significant
> problem in practice, as all the data is in English. However, I am having
> problems with a few characters, primarily an apostrophe-like character (don't
> know the code offhand; it's not in Latin-1).
It might be _the_ apostrophe used in correctly spelled English, the
curly apostrophe, called LEFT SINGLE QUOTATION MARK in Unicode (and
distinct from the Ascii apostrophe, called APOSTROPHE in Unicode). That
character belongs to Windows-1252, also known as Windows Latin 1, but not
to ISO Latin 1.
> If I encode the output as Windows-1251, the character causes an error.
I'm not sure I understand the situation at all. I don't think you can mean
Windows-1251, which is Windows Cyrillic, with Cyrillic (Russian) letters
in the "upper half". I guess this was an "off by one" case and you meant
Windows-1252. Then the question is what is going on, if your tool can
produce Windows-1252 output, as one might expect. Is there some problem
with the _source_? In HTML, you can represent a curly apostrophe in
several ways; maybe the tool cannot handle all of them.
> If I used utf-8 it causes visual garbage in the output.
I'm afraid I cannot visualize the problem. How can you use utf-8 if the
tool does not support Unicode at all? We might need a more detailed
description of the process.
> What would be ideal is to
> perform a "visually approximate" conversion to Windows-1251, which would
> replace this with a regular apostrophe.
If this is really about curly apostrophe and about a system that cannot
deal with it, then the usual way is to replace it by the Ascii apostrophe.
> I know Windows can do this, as retrieving values from controls using a
> non-Unicode interface does exactly this conversion.
I don't see what you mean by that, but I have seen Windows software map
some Windows Latin 1 characters to ISO Latin 1 characters. That's what
e.g. Outlook Express (silently!) does if the default encoding for outgoing
messages has been set to iso-8859-1 but the data contains e.g. a curly
apostrophe.
My analysis (well, guess) might, as usual, be all wrong. Perhaps the
apostrophe-like character is really e.g. MODIFIER LETTER RIGHT HALF RING.
It's not used in normal English, but it is used in scientific
transliteration of Arabic words and might therefore conceivably appear
within English text.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 12:03:21 CST