"Robert A. Rosenberg" wrote:
> Unless you
> can show where there is a mismatch in the x00-x7F and/or xA0-xFF
> glyphs/characters between ISO-8859-1 and Windows-1252 (or one of the other
> 125x sets and the corresponding 8859 set)
Well, here's a table of correspondences, based on the
latest mappings at the Unicode FTP site:
8859-1.TXT CP1252.TXT 224 bulls, 0 cows, 27 adds, 32 drops
8859-2.TXT CP1250.TXT 209 bulls, 15 cows, 27 adds, 32 drops
8859-5.TXT CP1251.TXT 130 bulls, 94 cows, 31 adds, 32 drops
8859-6.TXT CP1256.TXT 154 bulls, 25 cows, 77 adds, 32 drops
8859-7.TXT CP1253.TXT 214 bulls, 4 cows, 21 adds, 32 drops
8859-8.TXT CP1255.TXT 186 bulls, 0 cows, 47 adds, 34 drops
8859-9.TXT CP1254.TXT 224 bulls, 0 cows, 25 adds, 32 drops
8859-13.TXT CP1257.TXT 220 bulls, 4 cows, 20 adds, 32 drops
8859-15.TXT CP1252.TXT 216 bulls, 8 cows, 27 adds, 32 drops
(CP1258/Vietnamese does not correspond closely to any 8859 charset.)
A "bull" means the same character is present in both charsets at the
same codepoint; a "cow" means that the same character is present
but at different codepoints. (This terminology comes from the game of
Mastermind.) Cows represent incompatibilities, so it can be
seen that only 8859-1/CP1252/Western Latin, 8859-8/CP1255/Hebrew,
and 8859-9/CP1254/Turkish can safely be treated as interchangeable.
I have attached the Perl script that generated these results, for
any interested parties. The "32 drops" in most of the rows
represents the C1 characters.
(Interesting historical fact: the charsets 8859-[1234], which
were the original parts of 8859, have no cows in any pair.
If a character appears in any of those parts, it appears
in the other 3 parts either at the same codepoint or not at all.)
--Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)
#!/usr/bin/perl
die "usage: bulcow charmap1 charmap2\n" unless @ARGV == 2;
$bulls = $cows = $adds = $drops = 0;
open(MAP, $ARGV[0]) || die "bulcow: can't open $ARGV[0]\n";
while (<MAP>) {
($code, $ucode) = /^0x(..)\t0x(....)/;
next unless $code && $ucode;
$map1{hex($code)} = hex($ucode);
$rmap1{hex($ucode)} = hex($code);
}
open(MAP, $ARGV[1]) || die "bulcow: can't open $ARGV[1]\n";
while (<MAP>) {
($code, $ucode) = /^0x(..)\t0x(....)/;
next unless $code && $ucode;
$map2{hex($code)} = hex($ucode);
$rmap2{hex($ucode)} = hex($code);
}
for ($code = 0; $code < 256; $code++) {
$ucode1 = $map1{$code};
$ucode2 = $map2{$code};
if (defined($ucode1)) {
if ($ucode1 == $ucode2) {
$bulls++;
}
elsif (defined($rmap2{$ucode1})) {
$cows++;
}
else {
$drops++;
}
}
if (defined($ucode2) && !defined($rmap1{$ucode2})) {
$adds++;
}
}
print "$ARGV[0]\t$ARGV[1]\t$bulls bulls, $cows cows, $adds adds, $drops drops\n";
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT