Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)

From: Addison Phillips (addison@yahoo-inc.com)
Date: Wed Oct 04 2006 - 12:24:31 CST

Next message: Kenneth Whistler: "Re: Unicode and RFC 4690"

Previous message: Jukka K. Korpela: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
In reply to: Paul Johnston: ""Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
Next in thread: Paul Hastings: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I think you mean "windows-1252", the Western European code page. Code
page 1251 is the Cyrillic code page.

Windows-1252, like many Microsoft code pages, differs from the related
"standard" encoding. In this case, it is a superset of ISO 8859-1 (often
referred to as Latin-1). The difference is that Microsoft added 27
characters in the C1 control range (0x80->0x9F), including the Euro
symbol and a variety of "typesetter's quotes". These often cause
problems for software expecting pure ISO 8859-1.

HTMLDOC has both command-line and GUI options that allow you to select
the appropriate windows encoding (sorry, not UTF-8) to use when reading
the source files. You should also include a correct <meta> tag declaring
the encoding to be "windows-1252" and *not* "iso-8859-1" in your HTML
documents.

If that doesn't work, you can also use HTML entities in your pages to
replace the characters. For example, ’ is a right single quote. Or
you can use a transliterating converter (such as the //TRANSLIT option
on libiconv) to approximate the right results. (Caution: you may
experience data degradation with this last "solution")

Hope that helps.

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Internationalization is an architecture.
It is not a feature.
Paul Johnston wrote:
> Hi,
> 
> I am using Unicode throughout my system (a web-based database for 
> tracking work). I am forced to use a tool (htmldoc - for html to PDF 
> conversion) that does not support unicode in any manner. This should not 
> be a significant problem in practice, as all the data is in English. 
> However, I am having problems with a few characters, primarily an 
> apostrophe-like character (don't know the code offhand; it's not in 
> Latin-1).
> 
> If I encode the output as Windows-1251, the character causes an error. 
> If I used utf-8 it causes visual garbage in the output. What would be 
> ideal is to perform a "visually approximate" conversion to Windows-1251, 
> which would replace this with a regular apostrophe. I am happy to accept 
> the risks that such an approximation carries.
> 
> I know Windows can do this, as retrieving values from controls using a 
> non-Unicode interface does exactly this conversion. However, I have not 
> been able to find out how I can perform the conversion at will. I 
> apologise if this is not the most appropriate forum for this question, 
> but I have been looking long ang hard for this without success.
> 
> Many thanks for any help you can offer,
> 
> Paul
> 
> P.S. If someone can suggest a unicode compatible replacement for 
> htmldoc, that would satisfy me too!
> 
> 
>

Next message: Kenneth Whistler: "Re: Unicode and RFC 4690"
Previous message: Jukka K. Korpela: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
In reply to: Paul Johnston: ""Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
Next in thread: Paul Hastings: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 12:26:23 CST