From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 14 2003 - 17:26:29 EDT
On Monday, July 14, 2003 10:14 PM, Peter_Constable@sil.org <Peter_Constable@sil.org> wrote:
> Are there any libraries out there (open-source or otherwise) that can
> be used to detect the character encoding of a file or data stream?
Yes, but these libraries actually try to detect the actual encoded
language, based on strict validity rules to discriminate first the
possible encodings, then statistic rules to try matching the
languages with their various encoded byte sequences, then with
the help of common words. The result is probabilistic, and what you
get is an ordered list of language-encoding pairs. There are many
cases where the final decision is ambiguous, so this may be tuned
by the reader.
Simple algorithms are used in Internet Explorer for its "auto-
determined" mode, but it often fails and detects a Chinese
text encoded with EUC-CN or UTF-7, when in fact it is just plain
English coded with ASCII. This failure occurs with Chinese
simply because there is no actual dictionnary to try matching the
common ideographs often used in Chinese text (notably its
ideographic punctuation and square spaces).
However pure statistic rules often works to detect only the
encoding (but with no guarantee).
I don't use Mozilla, but it may have such a mode for the detection
of the actual encoding; if so it should be in its sources (I did not
check).
-- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
This archive was generated by hypermail 2.1.5 : Mon Jul 14 2003 - 18:07:08 EDT