From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Jun 28 2010 - 15:36:31 CDT
On 6/28/2010 11:38 AM, Mark Davis ☕ wrote:
>
>
> The problem with slavishly following the charset parameter is that it
> is often incorrect. However, the charset parameter is a signal into
> the character detection module, so the charset is correctly supplied
> from the message then the results of the detection will be weighted
> that direction.
>
The weighting factor / mechanism may be something that you might look at
for possible improvement.
Doug raised an interesting argument, i.e. that some values of a charset
parameter might have a higher probability of being correct than other
values.
If something is tagged Latin-1 or Windows-1252, the chances are that
this is merely an unexamined default setting. Most of the other 8859
values should be much less likely to be such "blind" defaults.
I wonder whether the probability of successful charset assignment
increases if you were to give these more "specific" charset values a
higher weight.
When I played with simple recognition algorithms about 15 years ago, I
found that some simple methods for crude language detection gave
signatures that would allow charset detection. Even though these methods
weren't sophisticated enough to resolve actual languages (esp. among
closely related languages) they were good enough to narrow things down
to the point, where one could pick or confirm charsets.
For example, significant stretches of German can be written without
diacritics, and can fool charset detection unless it picks up on the
statistic patterns for German. With that in hand, the first non-ASCII
character encountered is then likely to "nail" the charset. Or, absent
such character, the statistics can be used to confirm that an existing
charset assignment is plausible. (8859-15 having been deliberately
designed to be "undetectable" is the exception, unless there's a Euro
sign in the scanned part of the document...)
A./
This archive was generated by hypermail 2.1.5 : Mon Jun 28 2010 - 15:39:34 CDT