Re: 8859-1, 8859-15, 1252 and Euro

From: John Cowan (jcowan@reutershealth.com)
Date: Thu Feb 10 2000 - 14:39:02 EST


"Robert A. Rosenberg" wrote:

> Unless you
> can show where there is a mismatch in the x00-x7F and/or xA0-xFF
> glyphs/characters between ISO-8859-1 and Windows-1252 (or one of the other
> 125x sets and the corresponding 8859 set)

Well, here's a table of correspondences, based on the
latest mappings at the Unicode FTP site:

8859-1.TXT CP1252.TXT 224 bulls, 0 cows, 27 adds, 32 drops
8859-2.TXT CP1250.TXT 209 bulls, 15 cows, 27 adds, 32 drops
8859-5.TXT CP1251.TXT 130 bulls, 94 cows, 31 adds, 32 drops
8859-6.TXT CP1256.TXT 154 bulls, 25 cows, 77 adds, 32 drops
8859-7.TXT CP1253.TXT 214 bulls, 4 cows, 21 adds, 32 drops
8859-8.TXT CP1255.TXT 186 bulls, 0 cows, 47 adds, 34 drops
8859-9.TXT CP1254.TXT 224 bulls, 0 cows, 25 adds, 32 drops
8859-13.TXT CP1257.TXT 220 bulls, 4 cows, 20 adds, 32 drops
8859-15.TXT CP1252.TXT 216 bulls, 8 cows, 27 adds, 32 drops

(CP1258/Vietnamese does not correspond closely to any 8859 charset.)

A "bull" means the same character is present in both charsets at the
same codepoint; a "cow" means that the same character is present
but at different codepoints. (This terminology comes from the game of
Mastermind.) Cows represent incompatibilities, so it can be
seen that only 8859-1/CP1252/Western Latin, 8859-8/CP1255/Hebrew,
and 8859-9/CP1254/Turkish can safely be treated as interchangeable.

I have attached the Perl script that generated these results, for
any interested parties. The "32 drops" in most of the rows
represents the C1 characters.

(Interesting historical fact: the charsets 8859-[1234], which
were the original parts of 8859, have no cows in any pair.
If a character appears in any of those parts, it appears
in the other 3 parts either at the same codepoint or not at all.)

-- 

Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)

#!/usr/bin/perl

die "usage: bulcow charmap1 charmap2\n" unless @ARGV == 2;
$bulls = $cows = $adds = $drops = 0;

open(MAP, $ARGV[0]) || die "bulcow: can't open $ARGV[0]\n";
while (<MAP>) {
        ($code, $ucode) = /^0x(..)\t0x(....)/;
        next unless $code && $ucode;
        $map1{hex($code)} = hex($ucode);
        $rmap1{hex($ucode)} = hex($code);
        }

open(MAP, $ARGV[1]) || die "bulcow: can't open $ARGV[1]\n";
while (<MAP>) {
        ($code, $ucode) = /^0x(..)\t0x(....)/;
        next unless $code && $ucode;
        $map2{hex($code)} = hex($ucode);
        $rmap2{hex($ucode)} = hex($code);
        }

for ($code = 0; $code < 256; $code++) {
        $ucode1 = $map1{$code};
        $ucode2 = $map2{$code};
        if (defined($ucode1)) {
                if ($ucode1 == $ucode2) {
                        $bulls++;
                        }
                elsif (defined($rmap2{$ucode1})) {
                        $cows++;
                        }
                else {
                        $drops++;
                        }
                }
        if (defined($ucode2) && !defined($rmap1{$ucode2})) {
                $adds++;
                }
        }

print "$ARGV[0]\t$ARGV[1]\t$bulls bulls, $cows cows, $adds adds, $drops drops\n";



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT