Re: Unicode to UTF-8

From: John Cowan (jcowan@reutershealth.com)
Date: Wed Mar 15 2000 - 17:31:15 EST


Kenneth Whistler wrote:

> Someday I'll write myself a little command line convertor for this --
> I spend way too much time hand converting these little examples
> back and forth!

Oh, very well, here it is:

---cut here---
#!/usr/bin/perl
# This silly script examines its first argument.
# It converts a U+xxxx or U-xxxxxxxx string into UTF-8.
# If the argument doesn't look like that, it's assumed
# to be UTF-8 already, and is converted to UTF-16 and UTF-32 instead.
# No significant error checking; do not use in production.
#
# John Cowan (cowan@ccil.org) wrote this because Ken Whistler and I got
# tired of doing the job by hand all the time.
# No copyright, no warranty, use as you will.

unless (($_) = @ARGV) {
        die "usage: utf (U+xxxx | U-xxxxxxxx | xxxx...)\n";
        }

if (/^U\+(....)$/) {
        $v = hex($1);
        if ($v < 0x80) {
                printf "%-2.2X\n", $v;
                }
        elsif ($v < 0x7ff) {
                $lead = 0xc0 + (($v >> 6) & 0x1f);
                $t1 = 0x80 + ($v & 0x3f);
                printf "%-2.2X %-2.2X\n", $lead, $t1;
                }
        else {
                $lead = 0xe0 + (($v >> 12) & 0xf);
                $t1 = 0x80 + (($v >> 6) & 0x3f);
                $t2 = 0x80 + ($v & 0x3f);
                printf "%-2.2X %-2.2X %-2.2X\n", $lead, $t1, $t2;
                }
        }
elsif (/^U-(........)$/) {
        $v = hex($1);
        $lead = 0xf0 + (($v >> 18) & 0x7);
        $t1 = 0x80 + (($v >> 12) & 0x3f);
        $t2 = 0x80 + (($v >> 6) & 0x3f);
        $t3 = 0x80 + ($v & 0x3f);
        printf "%-2.2X %-2.2X %-2.2X %-2.2X\n", $lead, $t1, $t2, $t3;
        }
else {
        if (/^(..)$/) {
                $lead = hex($1);
                printf "U+%-4.4X\n", $lead;
                }
        elsif (/^(..)(..)$/) {
                $lead = hex($1);
                $t1 = hex($2);
                printf "U+%-4.4X\n", (($lead & 0x1f) << 6) + ($t1 & 0x3f);
                }
        elsif (/^(..)(..)(..)$/) {
                $lead = hex($1);
                $t1 = hex($2);
                $t2 = hex($3);
                printf "U+%-4.4X\n", (($lead & 0xf) << 12) +
                        (($t1 & 0x3f) << 6) + ($t2 & 0x3f);
                }
        elsif (/^(..)(..)(..)(..)$/) {
                $lead = hex($1);
                $t1 = hex($2);
                $t2 = hex($3);
                $t3 = hex($4);
                $v = (($lead & 0x3) << 18) + (($t1 & 0x3f) << 12) +
                        (($t2 & 0x3f) << 6) + ($t3 & 0x3f);
                $s1 = 0xd800 + ((($v - 0x10000) >> 10) & 0x3ff);
                $s2 = 0xdc00 + ($v & 0x3ff);
                printf "U+%-4.4X U+%-4.4X\n", $s1, $s2;
                printf "U-%-8.8X\n", $v;
                }
        else {
                die "eh?\n";
                }
        }
---cut here---

-- 

Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT