RE: Character set conversion question

From: Dreiheller, Albrecht (albrecht.dreiheller@siemens.com)
Date: Wed Jun 17 2009 - 02:46:00 CDT

Next message: Jeroen Ruigrok van der Werven: "Re: Jyutping Phrase Box to be removed (was: Unihan database: kCangjie field)"

Previous message: Leo Broukhis: "Re: Character set conversion question"
In reply to: Leo Broukhis: "Re: Character set conversion question"
Next in thread: Andreas Prilop: "Re: Character set conversion question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The brute force approach should be table driven, using a data model
capable of composite conversions and "reverse misinterpretations" as Björn Höhrmann described it.
The first few table entries may be manually created, containing favorites.
After that, one-stage generic entries follow, using all known encodings.
After that, two-stage generic entries follow, combining two one-stage conversions.
And so on.
The user will then choose the applicable decoding by checking the results visually in
an interactive user interface. (MS Word has such a dialog when opening .txt files with unknown encoding).
The user's choice may then be used to set up a ranking within the table to find the
most common encodings (or "misencodings") first, like popular web search engines do it.

Albrecht

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Leo Broukhis
Sent: Wednesday, June 17, 2009 7:03 AM
To: Bjoern Hoehrmann
Cc: unicode Unicode Discussion
Subject: Re: Character set conversion question

That's exactly my question: how to organize the brute force approach
with 217 character maps in /usr/share/i18n/charmaps/ ?

Leo

On Tue, Jun 16, 2009 at 3:38 PM, Bjoern Hoehrmann<derhoermi@gmx.net> wrote:
> * Leo Broukhis wrote:
>>What would be a way to find out what character set conversions were
>>applied to the text?
>
> Where the brute force approach fails and you have not misanalyzed the
> byte stream (copy and paste from a mail program may be unreliable) it
> is likely that you either have not tried enough encodings, or the en-
> coding is the result of function composition, for example, it might
> have been ISO-8859-X which is then interpreted as ISO-8859-Y and then
> encoded using ISO-8859-Z by some process; a popular example is UTF-8
> encoded data re-interpreted as ISO-8859-1 and re-encoded as UTF-8.
> Then your brute force search has to include such compositions aswell.
> --
> Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
> Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
> 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
>

Next message: Jeroen Ruigrok van der Werven: "Re: Jyutping Phrase Box to be removed (was: Unihan database: kCangjie field)"
Previous message: Leo Broukhis: "Re: Character set conversion question"
In reply to: Leo Broukhis: "Re: Character set conversion question"
Next in thread: Andreas Prilop: "Re: Character set conversion question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 17 2009 - 02:49:19 CDT