From: Jungshik Shin (jshin@mailaps.org)
Date: Sat May 10 2003 - 22:04:39 EDT
On Sat, 10 May 2003, Maurice Bauhahn wrote:
> It would appear to be a three step process:
>
> (1) First, detect whether there are patterns reflecting single or multiple
> byte encoding and separate the text into apparent units. Hence work out
> for the last two). I'm not aware of Shift-JIS, Big5, or EUC encoding
> patterns, but presumably there are some characters for these. The units
SJIS/Big5/JOHAB/GBK/GB18030 form a class of ISO-2022 incompliant
multibyte CJK encodings while EUC-JP, EUC-KR, EUC-CN and EUC-TW are ISO
2022 compliant CJK multibyte encodings. ISO-2022-JP(-x), ISO-2022-KR,
ISO-2022-CN belong to another class of ISO 2022 compliant encodings
that use ISO 2022 escape sequences. HZ is kinda a class of its own.
For details, see Ken Lunde's CJKV Information Processing.
> (2) Second, compare this list against a hash of reference frequencies versus
> (3) Third, with a generous bit of fuzzy logic (!!), test against the most
> likely encodings (normalising the assumed code points to Unicode) and run
These are all good advices. As already mentioned, the final touch would
be to let user override what your program come up with. Web browsers
also need this encoding detection technique (there are numerous unlabelled
or mislabelled web pages and email messages) and Mozilla has a couple of
them ('universal' and lang/script specific. needless to say, the latter
has a higher chance of getting it right than the former). Take a look
at intl/unichardet in Mozilla's CVS.
Jungshik
This archive was generated by hypermail 2.1.5 : Sat May 10 2003 - 22:46:08 EDT