from markus to markus (kuhn) -
could you please add a branch for characters >=0x10000 to your perl script?
without being a perl programmer, i am guessing
} elsif ($c < 0x10000) {
return sprintf("%c%c%c",
0xe0 | ($c >> 12),
0x80 | (($c >> 6) & 0x3f),
0x80 | ($c & 0x3f));
} elsif ($c < 0x1fffff) {
return sprintf("%c%c%c%c",
0xf0 | ($c >> 18),
0x80 | (($c >> 12) & 0x3f),
0x80 | (($c >> 6) & 0x3f),
0x80 | ($c & 0x3f));
} else {
return utf8(0xfffd);
}
of course, purists would also add the remaining two branches up to
<=3ffffff and <=0x7fffffff ...
i don't think the html etc standards limit the number range to 64k.
tnx und tschüß,
markus
Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
schererm@us.ibm.com
Unicode is here! --> http://www.unicode.org/
Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> on 99-04-06 05:09:47
To: Unicode List <unicode@unicode.org>
Subject: Re: UTF-8: Michael takes the plunge
...
Another alternative is this Perl program that replaces HTML/SGML
numerical character references by the corresponding UTF-8 sequences and
is excellently suited to quickly enter UTF-8 test documents:
------------------------------------------------------------------
#!/usr/bin/perl
# Convert HTML numeric character identifiers to UTF-8. M. Kuhn, 1998
sub utf8 ($) {
my $c = shift(@_);
if ($c < 0x80) {
return sprintf("%c", $c);
} elsif ($c < 0x800) {
return sprintf("%c%c", 0xc0 | ($c >> 6), 0x80 | ($c & 0x3f));
} elsif ($c < 0x10000) {
return sprintf("%c%c%c",
0xe0 | ($c >> 12),
0x80 | (($c >> 6) & 0x3f),
0x80 | ($c & 0x3f));
} else {
return utf8(0xfffd);
}
}
while (<>) {
while (/&\#[xX]([0-9a-fA-F]+);/) {
$c = hex($1);
$utf = utf8($c);
s/$&/$utf/;
}
while (/&\#([0-9]+);/) {
$utf = utf8($1);
s/$&/$utf/;
}
print;
};
------------------------------------------------------------------
You can get a Perl interpreter from
http://www.perl.com/pace/pub/perldocs/latest.html
and there is even a Mac version on
http://www.iis.ee.ethz.ch/~neeri/macintosh/perl.html
Markus
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT