The side discussions that I have had with some of the very helpful members 
of this list have led to some conclusions:
1) Windows NT uses some extra heuristics in addition to simply looking for 
the signature at the beginning of the file to identify a file as 
"Unicoded". These heuristics can, in the case of files that have textual 
contents that are highly repetitive, lead to misidentification. 
Unfortunately for me, database exports can sometimes be highly repetitive.
2) This is a problem with Windows NT, not notepad. If you create a file 
that confuses NT's Unicode detection algorithm, and use the command "type 
confused.txt > confuse2.txt" in order to make a copy of the file, the file 
"confuse2.txt" is half the length of "confused.txt". For plain text files, 
the lengths shouldn't differ at all. Remembering that "type" is not a 
program, but rather a built in function of the command shell ("cmd.exe"), 
this leads me to conclude that it's NT, and not any particular software.
3) For perl programmers, this program will generate a file that will 
confuse NT:
unless (open(OUTFILE, ">c:\\confused.txt")) {die("cannot open file.\n");}
$c1 = "A";
$c2 = "B";
printf OUTFILE $c1 . ((($c1 x 3) . $c2) x 100) . "\n" . $c1 . ((($c1 x 3) . 
$c2) x 100) . "\n";
The file looks something like this:
AAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAAB
AAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAAB
Two lines, 401 characters each in length (not including the CR/LF), 
consisting of one "A" followed by 100 "AAAB"s. There are many other 
variations to this that will confuse NT, but this one is fairly easy to 
create. You can even type it into notepad by hand, save it to disk, and 
then try to read it right back in. A perfectly normal text file to everyone 
but microsoft. In a command shell in NT typing the file only shows a bunch 
of ?'s.
Does anyone know of any Unicode detection heuristics that are currently in 
use by any software packages? This might help me rewrite the program that 
exports the data in a way that won't confuse NT.
Thanks!
Michael Krebs
Michael Everson scripsit:
> Does that mean this e-mail confuses MS software?
Seemingly not: Windows NT 4.0 Notepad treats it resolutely as CP1252,
and I don't know why or how.  Conversely, the oddball ASCII file
is treated as UCS-2, and I don't understand that either.
Microsoft folks?
--
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT