Re: Non-ascii string processing?

From: John Delacour (JD@BD8.COM)
Date: Tue Oct 07 2003 - 08:57:35 CST


At 4:20 am -0700 7/10/03, Peter Kirk wrote:

> Suppose I have a UTF-8 string and want to know
> how many default grapheme clusters it contains.
> How do I do so? Well, I step through the string
> character by character, combining successive
> characters into grapheme clusters. To do this
> without having to decode the UTF-8 myself, I
> need to be able to get at the string character
> by character, and very likely use a loop based
> on the number of characters in the string, e.g.
> the following Basic (horrid language but good
> for making my point here):
>
> For i% = 1 to Len(utf8string$)
> c$ = Mid(utf8string$, i%, 1)
> Process c$
> Next i%
>
> Such a loop would be more efficient in UTF-32
> of course, but this is still a real need for
> working with character counts.

Why use a horrid language when there's a nice one? :

#!/usr/bin/perl
use utf8 ; # not needed (and ignored) in Perl 5.8.*
my $string = "alpha \x{03b1}\ntagspace \x{e0020}" ;
my @utf8chars = split //, $string ;
foreach my $char(@utf8chars) {
        my $len = length unpack "a*", $char;
        print "$char\[$len\]" ;
}

### a[1]l[1]p[1]h[1]a[1] [1]α[2]
### [1]t[1]a[1]g[1]s[1]p[1]a[1]c[1]e[1] [1]󠀠[4]



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST