Re: UTF-8: Michael takes the plunge

From: schererm@us.ibm.com
Date: Tue Apr 06 1999 - 09:51:20 EDT

Next message: schererm@us.ibm.com: "Re: UTF-8: Michael takes the plunge"
Previous message: Markus Kuhn: "Re: Character converter"
Maybe in reply to: Constantine Stathopoulos: "UTF-8: Michael takes the plunge"
Next in thread: schererm@us.ibm.com: "Re: UTF-8: Michael takes the plunge"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

from markus to markus (kuhn) -
could you please add a branch for characters >=0x10000 to your perl script?
without being a perl programmer, i am guessing

    } elsif ($c < 0x10000) {
        return sprintf("%c%c%c",
                       0xe0 | ($c >> 12),
                       0x80 | (($c >> 6) & 0x3f),
                       0x80 | ($c & 0x3f));
    } elsif ($c < 0x1fffff) {
        return sprintf("%c%c%c%c",
                       0xf0 | ($c >> 18),
                       0x80 | (($c >> 12) & 0x3f),
                       0x80 | (($c >> 6) & 0x3f),
                       0x80 | ($c & 0x3f));
    } else {
        return utf8(0xfffd);
    }

of course, purists would also add the remaining two branches up to
<=3ffffff and <=0x7fffffff ...
i don't think the html etc standards limit the number range to 64k.

tnx und tschüß,

markus

Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
schererm@us.ibm.com
Unicode is here! --> http://www.unicode.org/

Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> on 99-04-06 05:09:47

To: Unicode List <unicode@unicode.org>
Subject: Re: UTF-8: Michael takes the plunge

...

Another alternative is this Perl program that replaces HTML/SGML
numerical character references by the corresponding UTF-8 sequences and
is excellently suited to quickly enter UTF-8 test documents:

------------------------------------------------------------------
#!/usr/bin/perl
# Convert HTML numeric character identifiers to UTF-8. M. Kuhn, 1998

sub utf8 ($) {
my $c = shift(@_);

    if ($c < 0x80) {
        return sprintf("%c", $c);
    } elsif ($c < 0x800) {
        return sprintf("%c%c", 0xc0 | ($c >> 6), 0x80 | ($c & 0x3f));
    } elsif ($c < 0x10000) {
        return sprintf("%c%c%c",
                       0xe0 | ($c >> 12),
                       0x80 | (($c >> 6) & 0x3f),
                       0x80 | ($c & 0x3f));
    } else {
        return utf8(0xfffd);
    }
}

while (<>) {
    while (/&\#[xX]([0-9a-fA-F]+);/) {
        $c = hex($1);
        $utf = utf8($c);
        s/$&/$utf/;
    }
    while (/&\#([0-9]+);/) {
        $utf = utf8($1);
        s/$&/$utf/;
    }
    print;
};
------------------------------------------------------------------

You can get a Perl interpreter from

http://www.perl.com/pace/pub/perldocs/latest.html

and there is even a Mac version on

http://www.iis.ee.ethz.ch/~neeri/macintosh/perl.html

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Next message: schererm@us.ibm.com: "Re: UTF-8: Michael takes the plunge"
Previous message: Markus Kuhn: "Re: Character converter"
Maybe in reply to: Constantine Stathopoulos: "UTF-8: Michael takes the plunge"
Next in thread: schererm@us.ibm.com: "Re: UTF-8: Michael takes the plunge"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT