Re: How does Python Unicode treat surrogates?

From: Mark Davis (mark@macchiato.com)
Date: Mon Jun 25 2001 - 10:24:28 EDT


You cannot interpret isolated UTF-16 surrogate code units as characters. For
example, you can't interpret the sequence of D800 followed by 0061 as if it
were some private use character (say, Klingon) followed by an 'a'.

(For those unfamiliar with the terminology, see
http://www.unicode.org/glossary, and my paper at
http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/.)

However, you can certainly deal with surrogate code units in storage, and it
is permissible on that level to handle them. For example, most UTF-16 string
interfaces use code unit indices, so that a string from position 3 of length
5 will include precisely 5 code units, not however many code points (or
graphemes!) they take up. Similarly for UTF-8 strings, the low-level units
are bytes.

In most people's experience, it is best to leave the low level interfaces
with indices in terms of code units, then supply some utility routines that
tell you information about code points. The most useful are:

- given a string and an index into that string, how many code points are
before it?
- given a string and a number of code points, what is the lowest index that
contains them?
- given a string and an index into that string, is the index on a code point
boundary?

An example for Java is at
http://oss.software.ibm.com/icu4j/doc/com/ibm/text/UTF16.html.

Mark

----- Original Message -----
From: "Gaute B Strokkenes" <gs234@cam.ac.uk>
To: "M.-A. Lemburg" <mal@lemburg.com>
Cc: "Tim Peters" <tim.one@home.com>; <i18n-sig@python.org>;
<unicode@unicode.org>
Sent: Monday, June 25, 2001 05:03
Subject: Re: How does Python Unicode treat surrogates?

>
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
>
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> > a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on. The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible. That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
>
> This is completely and totally wrong. The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.
>
> The precise definition of "illegal" in this context is given
> elsewhere. See <http://www.unicode.org/unicode/reports/tr17/>:
>
> 0xD800 is incomplete in Unicode. Unless followed by another 16-bit
> value of the right form, it is illegal.
>
> (Unicode here should read UTF-16, off course. The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)
>
> --
> Big Gaute http://www.srcf.ucam.org/~gs234/
> Hello? Enema Bondage? I'm calling because I want to be happy, I guess..
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT