From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Apr 16 2009 - 15:41:30 CDT
On 4/16/2009 1:04 PM, Asmus Freytag wrote:
> On 4/16/2009 12:04 PM, Sam Mason wrote:
>> Hi All,
>>
>> I've got myself in a discussion about the correct handling of surrogate
>> pairs. The background is as follows; the Postgres database server[1]
>> currently assumes that the SQL it's receiving is in some user specified
>> encoding, and it's been proposed that it would be nicer to be able to
>> enter Unicode characters directly in the form of escape codes in a
>> similar form to Python, i.e. support would be added for:
>>
>> '\uxxxx'
>> and
>> '\Uxxxxxxxx'
>>
>> The currently proposed patch[2] specifically handles surrogate pairs
>> in the input. For example '\uD800\uDF02' and '\U00010302' would be
>> considered to be valid and identical strings containing exactly one
>> character. I was wondering if this should indeed be considered valid or
>> if an error should be returned instead.
>>
>>
> As long as there are pairs of the surrogate code points provided as
> escape sequences, there's an unambiguous relation between each pair
> and a code point in the supplementary planes. So far, so good.
>
> The upside is that the dual escape sequences facilitate conversion
> to/from UTF-16. Each code unit in UTF-16 can be processed separately.
>
> The downside is that you now have two equivalent escape mechanisms,
> and you can no longer take a string with escape sequences and binarily
> compare it without bringing it into a canonical form.
>
> However, if one is allowed to represent the character "a" both as 'a'
> and as '\u0061' (which I assume is possible) then there's already a
> certain ambiguity built into the escape sequence mechanism.
>
> What should definitely result in an error is to write '\U0000D800'
> because the 8-byte form is to be understood as UTF-32, and in that
> context there would be an issue.
>
> So, in short, if the definition of the escapes is as follows
>
> '\uxxxxx' - escape sequence for a UTF-16 code point
>
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
recte: code unit in both cases.
>
> then everything is fine and predictable. If the definition of the
> shorter sequence is instead, "a code point on the BMP" then it's not
> clear how to handle surrogate pairs.
>
> A./
>
>
>
This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 15:47:13 CDT