Re: How does Python Unicode treat surrogates?

From: J M Sykes (mike.sykes@acm.org)
Date: Mon Jun 25 2001 - 13:38:09 EDT


Mark Davis said:
>
> In most people's experience, it is best to leave the low level interfaces
> with indices in terms of code units, then supply some utility routines
that
> tell you information about code points. ...

Anyone on the list interested in the treatment of UCS aka Unicode in
programming languages might like to know that a meeting of ISO/IEC JTC 1/SC
32/WG 3 recently approved a paper that specifies how SQL implementations
should do it.

The proposal can be found at:

ftp://sqlstandards.org/SC32/WG3/Meetings/PER_2001_04_Perth_AUS/per054r1.pdf

The current CD of the next SQL standard (ISO/IEC 9075), as amended by this
proposal (and many others) can be found at:

ftp://sqlstandards.org/SC32/WG3/Progression_Documents/CD/cd1r1-foundation-20
01-06.pdf

Briefly, the SQL functions CHARACTER_LENGTH, POSITION (the SQL string
indexing function), and SUBSTRING will all accept a parameter specifying the
units to be used, the alternatives being OCTETS, CODE_UNITS and CHARACTERS
(which to SQL means code points); the default being characters.

This proposal was agreed by major SQL implementors.

Which doesn't mean that it's right, nor that it can't be changed. But that's
how it is at the moment.

Mike.

***********************************************************

J M Sykes Email: Mike.Sykes@acm.org
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire SK8 3SN
UK Tel: (44) 161 437 5413

***********************************************************



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT