Re: How does Python Unicode treat surrogates?

From: DougEwell2@cs.com
Date: Tue Jun 26 2001 - 00:34:40 EDT


In a message dated 2001-06-25 20:19:18 Pacific Daylight Time, gs234@cam.ac.uk
writes:

> (For instance, I
> don't see how it would be possible to encode a sequence of unicode
> scalar values corresponding to a low and a high surrogate; if you
> tried to map this back then you would get a single unicode scalar
> value outside of the BMP). Perhaps someone on the unicode list could
> elaborate?

This is the source of my remaining confusion about definition D29. It
requires UTFs to round-trip all Unicode code points, and by extension all
sequences of code points; yet if you use UTF-16 and start with the sequence
<D800 DC00>, you don't end up with that -- you end up with <10000>.

The way it was explained to me on this list made it sound as though UTF-16 is
the "master" UTF that other UTFs have to accommodate. That didn't make sense
to me, but I've been trying to cope with it.

Proposed UTFs that are based on UTF-16 code units, and are thus subject to
the same D29 limitations as UTF-16, really annoy me, though.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:19 EDT