RE: Autodetection of CP437 vs. Latin-1

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Feb 15 2007 - 15:11:23 CST

  • Next message: Philippe Verdy: "RE: Query for Validity of Thai Sequence"

    > De : Doug Ewell [mailto:dewell@adelphia.net]
    > > Consider also using a filesystem that can store more than just 8.3
    > > filenames, to allow such tagging; today, all systems have such
    > > capabilities (so forget FAT and FAT12, use FAT32 or NTFS to get long
    > > filenames on support medias, or Unix/Linux partitions...)
    >
    > I am using NTFS under Windows XP SP2. That has precisely nothing to do
    > with this. I have text files accumulated over the past 20 years that
    > are in various character sets that I would like to convert, or at least
    > view, with as much automatic charset recognition as possible. Renaming
    > the files to identify the charset is not part of the solution.

    This was just a suggestion. The main interest is that you atually don't
    convert the data in a possibly destructive way (if your guess of the charset
    is wrong and you convert it to Unicode, then it will be even more complicate
    to change a false guess, if you have not tracked the conversion that was
    made because the conversion matrix will be squared in size!)

    If you have to convert something using some automated guessing tool, make
    sure that you don't loose the original encoding, or that at least you keep
    the information about which conversion was done (so that it can be easily
    reversed and then changed with another guess option before reconverting to
    Unicode with this changed guess).

    File renaming is interesting because the original encoding is not altered,
    so undoing it is basically a do-nothing operation, that just consists in
    suppressing the tagging to allow another guess.

    Similar alternatives are:
    * A separate database of meta-data storing the association between legacy
    texts and the guessed charset
    * File classifications using directories, where all files in the same
    directory share the same charset information meta-data (which could be
    specified either as part of the directory path name, or in a special
    meta-data file stored in that directory)
    * A "mirror" directory using the same relative pathnames (except the base)
    to store meta-data files associated to text files (in a way similar to what
    MacOS does for storing the resource forks in a FAT filesystem). This is a
    costly solution in terms of storage space but is simpler to develop than a
    specific database. The interest is that you don't have to modify anything in
    the original storage of your legacy texts (you don't need extra spae there
    for storing the meta-data, you don't rename anything, you don't reencode
    anything, files keep their original modification dates, and you don't need
    to copy the source into a writable media if the source is a read-only
    long-term archive like a burnt CD or DVD, or a remote access to a read-only
    source where you don't have the administrator privileges).

    For various reasons, I don't like conversion of data sources, as long as we
    need to keep long term archives:
    * there are legal and contractual concerns about the fact that archives must
    not be altered.
    * on-the-fly conversion of charset encodings to/from Unicode is really not
    costly: you just have to install a small conversion table to support any
    volume of legacy files, and conversion routines are extremely fast, and now
    part of the standard features of any current OS, or standard string
    libraries.
    * technologies are hanging all the time: today's representations are not
    always the best to perform future data processing....



    This archive was generated by hypermail 2.1.5 : Thu Feb 15 2007 - 15:14:08 CST