Re: How does Python Unicode treat surrogates?

From: Rick McGowan (rick@unicode.org)
Date: Mon Jun 25 2001 - 17:12:09 EDT


Marc-Andre Lemburg wrote:

> Do you have references which we could look at
> to determine which of these boundary kinds would actually be
> useful in daily programming ?

There are two things utterly useful in daily programming... One is to get
a "character", whether it's a surrogate or not; another is to get a base
character and all associated combining marks.

It's useful to find the range covered by a "character" at some given
index. That allows the programmer to easily write an increment loop:

        while (index i is valid) {
            c = next_char_at_Index [i] of string s;
            i += lengthOfChar_at_Index [i] of string s;
            // do something with c...
        }

or similar...

Also, in a similar vein, finding the "range" covered by the combining
character sequence or "locale-independent grapheme" at the given index.

Please also see the FAQ pages on combining marks and the Tech Report #18
of Unicode, section 3.3:

http://www.unicode.org/unicode/reports/tr18/#Locale-Independent Graphemes

There is now some work going on with regard to more precise definition of
such useful chunking units.

I would also take a look at the specifications for NSString and NSText in
Apple's Cocoa environment. Python has a some of these operations already
built-in of course.

        Rick



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:19 EDT