Re: How does Python Unicode treat surrogates?

From: Gaute B Strokkenes (gs234@cam.ac.uk)
Date: Mon Jun 25 2001 - 08:03:31 EDT


[I'm cc:-ing the unicode list to make sure that I've gotten my
terminology right, and to solicit comments

On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> Tim Peters wrote:
>>
>> [M.-A. Lemburg]
>> > ...
>> > 2. What to do when slicing of Unicode strings would break
>> > a surrogate pair ?
>>
>> To me a string is a sequence of characters, and s[0] returns the
>> first, s[1] the second, and so on. The internal details of how the
>> implementation chooses to torture itself <0.7 wink> should be
>> invisible. That is, breaking a surrogate via slicing should be
>> impossible: s[i:j] returns j-i characters, and that's that.
>
> It's not that simple: lone surrogates are true Unicode char points
> in their own right; it's just that they are pretty useless without
> their resp. partners in the data stream. And with this "feature"
> they are in good company: the Unicode combining characters (e.g. the
> combining acute) have th same property.

This is completely and totally wrong. The Unicode standard version
3.1 states (conformance requirement C12(c): A conformant process shall
not interpret illegal UTF code unit sequences as characters.

The precise definition of "illegal" in this context is given
elsewhere. See <http://www.unicode.org/unicode/reports/tr17/>:

  0xD800 is incomplete in Unicode. Unless followed by another 16-bit
  value of the right form, it is illegal.

(Unicode here should read UTF-16, off course. The reason it does not
is that the language of the technical report has not been updated to
that of 3.1)

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT