RE: Identifying file encoding scheme

From: Montgomery Securities (mkrebs@primebroker.com)
Date: Thu Sep 09 1999 - 17:38:13 EDT

Next message: John Cowan: "Re: First draft of proposed XML TC for Unicode 3.0 (unofficial)"
Previous message: peter_constable@sil.org: "Re: IPA a vowels"
Maybe in reply to: Montgomery Securities: "Identifying file encoding scheme"
Next in thread: Addison Phillips: "RE: Identifying file encoding scheme"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The side discussions that I have had with some of the very helpful members
of this list have led to some conclusions:

1) Windows NT uses some extra heuristics in addition to simply looking for
the signature at the beginning of the file to identify a file as
"Unicoded". These heuristics can, in the case of files that have textual
contents that are highly repetitive, lead to misidentification.
Unfortunately for me, database exports can sometimes be highly repetitive.

2) This is a problem with Windows NT, not notepad. If you create a file
that confuses NT's Unicode detection algorithm, and use the command "type
confused.txt > confuse2.txt" in order to make a copy of the file, the file
"confuse2.txt" is half the length of "confused.txt". For plain text files,
the lengths shouldn't differ at all. Remembering that "type" is not a
program, but rather a built in function of the command shell ("cmd.exe"),
this leads me to conclude that it's NT, and not any particular software.

3) For perl programmers, this program will generate a file that will
confuse NT:

unless (open(OUTFILE, ">c:\\confused.txt")) {die("cannot open file.\n");}
$c1 = "A";
$c2 = "B";
printf OUTFILE $c1 . ((($c1 x 3) . $c2) x 100) . "\n" . $c1 . ((($c1 x 3) .
$c2) x 100) . "\n";

The file looks something like this:
AAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAAB
AAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAAB

Two lines, 401 characters each in length (not including the CR/LF),
consisting of one "A" followed by 100 "AAAB"s. There are many other
variations to this that will confuse NT, but this one is fairly easy to
create. You can even type it into notepad by hand, save it to disk, and
then try to read it right back in. A perfectly normal text file to everyone
but microsoft. In a command shell in NT typing the file only shows a bunch
of ?'s.

Does anyone know of any Unicode detection heuristics that are currently in
use by any software packages? This might help me rewrite the program that
exports the data in a way that won't confuse NT.

Thanks!

Michael Krebs

Michael Everson scripsit:

> Does that mean this e-mail confuses MS software?

Seemingly not: Windows NT 4.0 Notepad treats it resolutely as CP1252,
and I don't know why or how. Conversely, the oddball ASCII file
is treated as UCS-2, and I don't understand that either.

Microsoft folks?

--
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin

Next message: John Cowan: "Re: First draft of proposed XML TC for Unicode 3.0 (unofficial)"
Previous message: peter_constable@sil.org: "Re: IPA a vowels"
Maybe in reply to: Montgomery Securities: "Identifying file encoding scheme"
Next in thread: Addison Phillips: "RE: Identifying file encoding scheme"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT