Re: Handling of Surrogates

From: Mark Davis ([email protected])
Date: Thu Apr 16 2009 - 16:55:39 CDT

Next message: Philippe Verdy: "RE: Handling of Surrogates"

Previous message: Asmus Freytag: "Re: Handling of Surrogates"
In reply to: Asmus Freytag: "Re: Handling of Surrogates"
Next in thread: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Peter Constable: "RE: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I disagree somewhat, if I understand what you wrote. When the \u and \U
conventions are used:

U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061> ( a ) LATIN
SMALL LETTER A could be represented as any of:

   1. 'a'
   2. \u0061
   3. \U00000061

The use of #3 is a waste of space, but should not beillegal (except where \U
is not available). Eg.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\U00000061

U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A> ( 𝐚 )
MATHEMATICAL BOLD SMALL A could be represented as any of:

   1. '𝐚'
   2. \uD835\uDC1A
   3. \U0000D835\U0000DC1A
   4. \U0001D41A

Similarly #3 is a waste of space, but should not be illegal. #2 and #3 are
discouraged where \U is available or UTF-16 is not used, but #2 is necessary
where \U is not available (eg Java). [Myself, I like \x{...} escaping
better, since it is more uniform. Having a terminator allows variable
length.]

Mark

On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <[email protected]> wrote:

> On 4/16/2009 12:04 PM, Sam Mason wrote:
>
>> Hi All,
>>
>> I've got myself in a discussion about the correct handling of surrogate
>> pairs. The background is as follows; the Postgres database server[1]
>> currently assumes that the SQL it's receiving is in some user specified
>> encoding, and it's been proposed that it would be nicer to be able to
>> enter Unicode characters directly in the form of escape codes in a
>> similar form to Python, i.e. support would be added for:
>>
>> '\uxxxx'
>> and
>> '\Uxxxxxxxx'
>>
>> The currently proposed patch[2] specifically handles surrogate pairs
>> in the input. For example '\uD800\uDF02' and '\U00010302' would be
>> considered to be valid and identical strings containing exactly one
>> character. I was wondering if this should indeed be considered valid or
>> if an error should be returned instead.
>>
>>
>>
> As long as there are pairs of the surrogate code points provided as escape
> sequences, there's an unambiguous relation between each pair and a code
> point in the supplementary planes. So far, so good.
>
> The upside is that the dual escape sequences facilitate conversion to/from
> UTF-16. Each code unit in UTF-16 can be processed separately.
>
> The downside is that you now have two equivalent escape mechanisms, and you
> can no longer take a string with escape sequences and binarily compare it
> without bringing it into a canonical form.
>
> However, if one is allowed to represent the character "a" both as 'a' and
> as '\u0061' (which I assume is possible) then there's already a certain
> ambiguity built into the escape sequence mechanism.
>
> What should definitely result in an error is to write '\U0000D800' because
> the 8-byte form is to be understood as UTF-32, and in that context there
> would be an issue.
>
> So, in short, if the definition of the escapes is as follows
>
> '\uxxxxx' - escape sequence for a UTF-16 code point
>
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
>
> then everything is fine and predictable. If the definition of the shorter
> sequence, is instead, "a code point on the BMP" then it's not clear how to
> handle surrogate pairs.
>
> A./
>
>

Next message: Philippe Verdy: "RE: Handling of Surrogates"
Previous message: Asmus Freytag: "Re: Handling of Surrogates"
In reply to: Asmus Freytag: "Re: Handling of Surrogates"
Next in thread: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Peter Constable: "RE: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 16:57:51 CDT