From: Sebastian Hofer (sebastian.hofer@gistec-online.de)
Date: Wed May 14 2003 - 04:19:49 EDT
Hi List:
Thanks to all who anwered. As all of the hints and links have different
approaches it is hard to give a general statement. So give it a try.
#####################################
Thanks to:
Edward Trager (Linux solution)
T. "Kuro" Kurosaka (basitech)
Marco Cimarosti (languageidentifier)
Ben Dougall mlmassociates/Dcpcmd
#####################################
Here the links and solutions:
-----------------------------
> On Linux there is the command line utility called "file" which will
> certainly segregate ASCII and UTF-8. Although it doesn't go very
> far in detecting other unicode encoding possibilities, I'm sure one could
> combine this with a little bit of Perl to meet your specific needs:
> $> file *
> images: directory
> index.html: HTML document text
> java.data: ASCII text
> ucs2.data: MP3, 56 kBits2, 64 kBits, 48 kHz, Stereo
> utf-16-be.data: data
> utf-16-le.data: data
> utf-7.data: ASCII text
> utf8.data: UTF-8 Unicode text
> utf8.data.png: PNG image data, 914 x 676, 2-bit colormap, non-interlaced
===============
http://www.basistech.com/products/text-processing/euclid.html
This is good although it is expensive. Free online demo!
===============
http://www.languageidentifier.com/
===============
have a look at the very recent thread on this list, in the archives:
"suggestions for strategy on dealing with plain text in potentially any
(unspecified) encoding?" there's a lot of useful stuff in that.
basically nearly all text encodings just go ahead and use their
encoding without stating "i'm 7bit ascii" or whatever, first. (even
unicode, when it doesn't use a bom). so, often the required info simply
isn't there. some html, most(maybe all) xml, some unicode(via a bom)
and most(maybe all) emails have information to which encoding is being
used.
so it seems if anything is going to tell you explicitly which encoding
is being used, it's going to be the text format rather than the
encoding itself (apart from unicode and it's boms). if the text or the
encoding itself does not specify the encoding, i don't think there is
any absolute, sure way to find out. but there are various methods to
make good, educated guesses (see the thread i mentioned).
also someone on this list pointed me to this which you might find
useful:
<http://www.mlmassociates.cc/dl-win32.htm>
Dcpcmd is a command line program that illustrates using the Windows
IMultiLanguage interface to detect a code page.
Cheers!
Seb
This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 05:26:03 EDT