At 09:11 29.3.1999 -0800, Mark Davis wrote:
>Heuristics for identifying between ASCII-family encodings (ASCII, 8859
>series, etc) and Unicode (UTF-8, UTF-16BE, UTF-16LE) are pretty easy.
>They work well if you have a reasonable amount of data to analyse (a few
>hundred bytes). [If you try to distinguish among all character sets
>(Unicode, ASCII-family, EUC-family, EBCDIC-family, ISO 2022), it gets
>quite complicated.]
>
>Off the top of my head, here are some things to check for (others are
>welcome to add to this):
>UTF-16BE/LE
The most obvious test is to check the size of the record or the total size
of the file (in Win 9x/NT using GetFileSize or in a more portable way using
fseek to the end of file and then using ftell to get the file size), if the
byte count is odd, it can not be a UTF-16 file.
>However, there are checks you can use for the likelyhood of text being
>UTF-16BE/LE:
>
>- If you get a 00 byte (or other unusual control-character bytes) then
>you are probably UTF16. SPACE (0020), TAB (0008), CR (000D) and LF
>(000A) and common punctuation will often cause this to happen, even in
>non-Latin texts.
With a lot programs written in C, it is extremely unlikely to find 00 bytes
in single byte or variable length character code text files, so it must be
UTF-16.
>- If you get lots of cases where every other byte is identical, you are
>probably in UTF-16.
>
>- When you hit the above cases, you can use the polarity of the byte
>index (even or odd) to distinguish between UTF-16BE and UTF-16LE.
At least for non-CJK languages, this is a good idea to build 256 element
histograms separately for odd and even bytes in the data stream, not just
for detecting if it is UTF-16 and solve the BE/LE issue, but it also gives
some hints of what language(s) are used and some language specific
processing (such as spell checker or better language specific fonts) can be
selected.
Paul Keinanen
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT