Re: Handling of Surrogates

From: Asmus Freytag ([email protected])
Date: Thu Apr 16 2009 - 15:04:30 CDT

Next message: Peter Zilahy Ingerman, PhD: "Re: Localizable Sentences Experiment"

Previous message: Sam Mason: "Handling of Surrogates"
In reply to: Sam Mason: "Handling of Surrogates"
Next in thread: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Mark Davis: "Re: Handling of Surrogates"
Reply: Philippe Verdy: "RE: Handling of Surrogates"
Reply: Sam Mason: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 4/16/2009 12:04 PM, Sam Mason wrote:
> Hi All,
>
> I've got myself in a discussion about the correct handling of surrogate
> pairs. The background is as follows; the Postgres database server[1]
> currently assumes that the SQL it's receiving is in some user specified
> encoding, and it's been proposed that it would be nicer to be able to
> enter Unicode characters directly in the form of escape codes in a
> similar form to Python, i.e. support would be added for:
>
> '\uxxxx'
> and
> '\Uxxxxxxxx'
>
> The currently proposed patch[2] specifically handles surrogate pairs
> in the input. For example '\uD800\uDF02' and '\U00010302' would be
> considered to be valid and identical strings containing exactly one
> character. I was wondering if this should indeed be considered valid or
> if an error should be returned instead.
>
>
As long as there are pairs of the surrogate code points provided as
escape sequences, there's an unambiguous relation between each pair and
a code point in the supplementary planes. So far, so good.

The upside is that the dual escape sequences facilitate conversion
to/from UTF-16. Each code unit in UTF-16 can be processed separately.

The downside is that you now have two equivalent escape mechanisms, and
you can no longer take a string with escape sequences and binarily
compare it without bringing it into a canonical form.

However, if one is allowed to represent the character "a" both as 'a'
and as '\u0061' (which I assume is possible) then there's already a
certain ambiguity built into the escape sequence mechanism.

What should definitely result in an error is to write '\U0000D800'
because the 8-byte form is to be understood as UTF-32, and in that
context there would be an issue.

So, in short, if the definition of the escapes is as follows

'\uxxxxx' - escape sequence for a UTF-16 code point

'\Uxxxxxxxx' - escape sequence for a UTF-32 code point

then everything is fine and predictable. If the definition of the
shorter sequence, is instead, "a code point on the BMP" then it's not
clear how to handle surrogate pairs.

A./

Next message: Peter Zilahy Ingerman, PhD: "Re: Localizable Sentences Experiment"
Previous message: Sam Mason: "Handling of Surrogates"
In reply to: Sam Mason: "Handling of Surrogates"
Next in thread: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Mark Davis: "Re: Handling of Surrogates"
Reply: Philippe Verdy: "RE: Handling of Surrogates"
Reply: Sam Mason: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 15:07:15 CDT