Re: CESU-8 marches on

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Sat Dec 22 2001 - 17:55:56 EST


-----BEGIN PGP SIGNED MESSAGE-----

DougEwell2@cs.com wrote:
[...]

I agree with everything in your post, but I have some additional comments:

> The promoters of CESU-8 say that data in this format already exists in the
> real world, and the purpose in describing it in a UTR is to codify an
> existing de-facto standard.

Note that it is *and has always been* non-compliant to generate the 6-byte
form for supplementary characters. So if any process generated such data,
that was a bug.

> So my question is: What supplementary characters are currently, TODAY,
> stored in Oracle or PeopleSoft databases that require the creation of a new
> encoding scheme to ensure they can continue to be sorted consistently?

I suspect effectively none. However, even though we don't *expect* such data
to exist, we would still like to guarantee that an existing database that is
stored using 16-bit code units, will not become inconsistent if it does contain
any surrogate codes.

There are at least three ways to guarantee this:

Option A:
 - Prevent strings with surrogate codes from being added to the database.
 - On the request of an administrator, do the following:
   1. Verify that the database does not contain any surrogate codes.
   2. Obtain a global lock.
   3. Set a flag that switches to using code point order, and enables
      adding strings with supplementary characters.
   4. Release the global lock.

Option B:
 - Treat strings as having a flag specifying 'new' or 'old', where all
   existing strings are old.
 - Tag new strings added to the database as 'new'.
 - Sort characters in the following order:
     U+0000..U+D7FF
     U+D800..U+DFFF in 'old' strings
     U+E000..U+FFFF
     U+10000...10FFFF in 'new' strings.

Option C:
 - Represent supplementary characters internally as three code units:
   <0xFFFF, high_surrogate, low_surrogate>. This will sort in the
   same way as Option B (using the database's existing sorting algorithm)
   provided that there are no instances of U+FFFF followed by a
   non-surrogate, which there should not be.
 - Make sure that U+FFFF never appears outside the database implementation,
   i.e. delete it on export or when passing a string to a stored procedure,
   and add it where necessary when storing strings.
 - Despite slightly changing the representation of supplementary characters,
   this is conformant to Unicode 3.1 because the U+FFFF noncharacter only
   occurs internally.

Each of these options has advantages and disadvantages, but they would
all work, they allow for interoperability even between vendors who
choose different options, and none of them inflict broken UTFs or sort
orders on anything that does not directly work on the database file
format.

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPCUPSzkCAxeYt5gVAQHxrggAnEqKE12ELPdKxgfEM09E2nwpHuAK6s7r
tJQjBm+BqNCgu+NzJ3xZ11ntZpHCDQONqbBsRD81DTsI/HJw6/XgVvstwtW4i06E
uXaD5tq3Xcy98Zm6LfTOlRir80/+wcXSo73BGY1t5gu9HkheqBwET2qOYOsQdGaG
QdUguqdhD3XzPFSz33xc62CdgKKPMhNiCqP+xZ4hUKqUChHcAd1xPBc+UXmi/Woa
DSU9NPbfcNNCNR5BUGOy8oc6ycGHpac7C2kxRqXiqdUWWXm6CoTto38T6J9iCG2q
4YJ43Q99Wh96PdQ+4kRX2WlPLp5HuZ+rzV2V/arFp0nUlKRfKXdq+Q==
=cCfl
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Sat Dec 22 2001 - 19:11:41 EST