Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

From: Asmus Freytag (
Date: Mon Jun 28 2010 - 15:36:31 CDT

    On 6/28/2010 11:38 AM, Mark Davis ☕ wrote:
    > The problem with slavishly following the charset parameter is that it
    > is often incorrect. However, the charset parameter is a signal into
    > the character detection module, so the charset is correctly supplied
    > from the message then the results of the detection will be weighted
    > that direction.
    The weighting factor / mechanism may be something that you might look at
    for possible improvement.

    Doug raised an interesting argument, i.e. that some values of a charset
    parameter might have a higher probability of being correct than other

    If something is tagged Latin-1 or Windows-1252, the chances are that
    this is merely an unexamined default setting. Most of the other 8859
    values should be much less likely to be such "blind" defaults.

    I wonder whether the probability of successful charset assignment
    increases if you were to give these more "specific" charset values a
    higher weight.

    When I played with simple recognition algorithms about 15 years ago, I
    found that some simple methods for crude language detection gave
    signatures that would allow charset detection. Even though these methods
    weren't sophisticated enough to resolve actual languages (esp. among
    closely related languages) they were good enough to narrow things down
    to the point, where one could pick or confirm charsets.

    For example, significant stretches of German can be written without
    diacritics, and can fool charset detection unless it picks up on the
    statistic patterns for German. With that in hand, the first non-ASCII
    character encountered is then likely to "nail" the charset. Or, absent
    such character, the statistics can be used to confirm that an existing
    charset assignment is plausible. (8859-15 having been deliberately
    designed to be "undetectable" is the exception, unless there's a Euro
    sign in the scanned part of the document...)


