RE: Autodetection of CP437 vs. Latin-1

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Feb 15 2007 - 15:11:23 CST

Next message: Philippe Verdy: "RE: Query for Validity of Thai Sequence"

Previous message: Philippe Verdy: "RE: Autodetection of CP437 vs. Latin-1"
In reply to: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> De : Doug Ewell [mailto:dewell@adelphia.net]
> > Consider also using a filesystem that can store more than just 8.3
> > filenames, to allow such tagging; today, all systems have such
> > capabilities (so forget FAT and FAT12, use FAT32 or NTFS to get long
> > filenames on support medias, or Unix/Linux partitions...)
>
> I am using NTFS under Windows XP SP2. That has precisely nothing to do
> with this. I have text files accumulated over the past 20 years that
> are in various character sets that I would like to convert, or at least
> view, with as much automatic charset recognition as possible. Renaming
> the files to identify the charset is not part of the solution.

This was just a suggestion. The main interest is that you atually don't
convert the data in a possibly destructive way (if your guess of the charset
is wrong and you convert it to Unicode, then it will be even more complicate
to change a false guess, if you have not tracked the conversion that was
made because the conversion matrix will be squared in size!)

If you have to convert something using some automated guessing tool, make
sure that you don't loose the original encoding, or that at least you keep
the information about which conversion was done (so that it can be easily
reversed and then changed with another guess option before reconverting to
Unicode with this changed guess).

File renaming is interesting because the original encoding is not altered,
so undoing it is basically a do-nothing operation, that just consists in
suppressing the tagging to allow another guess.

Similar alternatives are:
* A separate database of meta-data storing the association between legacy
texts and the guessed charset
* File classifications using directories, where all files in the same
directory share the same charset information meta-data (which could be
specified either as part of the directory path name, or in a special
meta-data file stored in that directory)
* A "mirror" directory using the same relative pathnames (except the base)
to store meta-data files associated to text files (in a way similar to what
MacOS does for storing the resource forks in a FAT filesystem). This is a
costly solution in terms of storage space but is simpler to develop than a
specific database. The interest is that you don't have to modify anything in
the original storage of your legacy texts (you don't need extra spae there
for storing the meta-data, you don't rename anything, you don't reencode
anything, files keep their original modification dates, and you don't need
to copy the source into a writable media if the source is a read-only
long-term archive like a burnt CD or DVD, or a remote access to a read-only
source where you don't have the administrator privileges).

For various reasons, I don't like conversion of data sources, as long as we
need to keep long term archives:
* there are legal and contractual concerns about the fact that archives must
not be altered.
* on-the-fly conversion of charset encodings to/from Unicode is really not
costly: you just have to install a small conversion table to support any
volume of legacy files, and conversion routines are extremely fast, and now
part of the standard features of any current OS, or standard string
libraries.
* technologies are hanging all the time: today's representations are not
always the best to perform future data processing....

Next message: Philippe Verdy: "RE: Query for Validity of Thai Sequence"
Previous message: Philippe Verdy: "RE: Autodetection of CP437 vs. Latin-1"
In reply to: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 15 2007 - 15:14:08 CST