Re: [I18n-sig] Re: How does Python Unicode treat surrogates?

From: Mark Davis (mark@macchiato.com)
Date: Mon Jun 25 2001 - 14:18:52 EDT


comments below.

----- Original Message -----
From: "M.-A. Lemburg" <mal@lemburg.com>
To: "Mark Davis" <mark@macchiato.com>
Cc: "Gaute B Strokkenes" <gs234@cam.ac.uk>; "Tim Peters" <tim.one@home.com>;
<i18n-sig@python.org>; <unicode@unicode.org>
Sent: Monday, June 25, 2001 09:46
Subject: Re: [I18n-sig] Re: How does Python Unicode treat surrogates?

[snip]
>
> My question was targetting into a slightly different direction,
> though. I know that UTF-16 does not allow lone surrogates, but
> how does Unicode itself treat these ? If I have a sequence of Unicode
> code points which includes an isolated surrogate code point,
> would this be considered a legal Unicode sequence or not ?

It is a legal Unicode code point sequence. However, it is not a legal
Unicode *character* sequence, since it contains code points that by
definition cannot be used to represent characters.

>
> > However, you can certainly deal with surrogate code units in storage,
and it
> > is permissible on that level to handle them. For example, most UTF-16
string
> > interfaces use code unit indices, so that a string from position 3 of
length
> > 5 will include precisely 5 code units, not however many code points (or
> > graphemes!) they take up. Similarly for UTF-8 strings, the low-level
units
> > are bytes.
>
> FYI, Python currently uses UTF-16 as internal storage format
> and also exposes this through its indexing interfaces. In that
> sense isolated surrogates would be illegal. The codecs which
> convert such Unicode object to other encodings would raise an
> exception.

> Unicode object constructors, slicing and concatenating
> Unicode objects currently do not apply any checks though.

That is what is typically done, since using codepoint indices on each
operation is a very significant performance burden.

Mark



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT