Re: How does Python Unicode treat surrogates?

From: DougEwell2@cs.com
Date: Tue Jun 26 2001 - 00:34:40 EDT

Next message: B: "Re: Playing with Unicode (was: Re: UTF-17)"
Previous message: Curtis Clark: "Re: Playing with Unicode (was: Re: UTF-17)"
Maybe in reply to: Gaute B Strokkenes: "Re: How does Python Unicode treat surrogates?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In a message dated 2001-06-25 20:19:18 Pacific Daylight Time, gs234@cam.ac.uk
writes:

> (For instance, I
> don't see how it would be possible to encode a sequence of unicode
> scalar values corresponding to a low and a high surrogate; if you
> tried to map this back then you would get a single unicode scalar
> value outside of the BMP). Perhaps someone on the unicode list could
> elaborate?

This is the source of my remaining confusion about definition D29. It
requires UTFs to round-trip all Unicode code points, and by extension all
sequences of code points; yet if you use UTF-16 and start with the sequence
<D800 DC00>, you don't end up with that -- you end up with <10000>.

The way it was explained to me on this list made it sound as though UTF-16 is
the "master" UTF that other UTFs have to accommodate. That didn't make sense
to me, but I've been trying to cope with it.

Proposed UTFs that are based on UTF-16 code units, and are thus subject to
the same D29 limitations as UTF-16, really annoy me, though.

-Doug Ewell
Fullerton, California

Next message: B: "Re: Playing with Unicode (was: Re: UTF-17)"
Previous message: Curtis Clark: "Re: Playing with Unicode (was: Re: UTF-17)"
Maybe in reply to: Gaute B Strokkenes: "Re: How does Python Unicode treat surrogates?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:19 EDT