From: Mark Davis (mark.edward.davis@gmail.com)
Date: Thu Apr 16 2009 - 16:55:39 CDT
I disagree somewhat, if I understand what you wrote. When the \u and \U
conventions are used:
U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061> ( a ) LATIN
SMALL LETTER A could be represented as any of:
1. 'a'
2. \u0061
3. \U00000061
The use of #3 is a waste of space, but should not beillegal (except where \U
is not available). Eg.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\U00000061
U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A> ( 𝐚 )
MATHEMATICAL BOLD SMALL A could be represented as any of:
1. '𝐚'
2. \uD835\uDC1A
3. \U0000D835\U0000DC1A
4. \U0001D41A
Similarly #3 is a waste of space, but should not be illegal. #2 and #3 are
discouraged where \U is available or UTF-16 is not used, but #2 is necessary
where \U is not available (eg Java). [Myself, I like \x{...} escaping
better, since it is more uniform. Having a terminator allows variable
length.]
Mark
On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <asmusf@ix.netcom.com> wrote:
> On 4/16/2009 12:04 PM, Sam Mason wrote:
>
>> Hi All,
>>
>> I've got myself in a discussion about the correct handling of surrogate
>> pairs. The background is as follows; the Postgres database server[1]
>> currently assumes that the SQL it's receiving is in some user specified
>> encoding, and it's been proposed that it would be nicer to be able to
>> enter Unicode characters directly in the form of escape codes in a
>> similar form to Python, i.e. support would be added for:
>>
>> '\uxxxx'
>> and
>> '\Uxxxxxxxx'
>>
>> The currently proposed patch[2] specifically handles surrogate pairs
>> in the input. For example '\uD800\uDF02' and '\U00010302' would be
>> considered to be valid and identical strings containing exactly one
>> character. I was wondering if this should indeed be considered valid or
>> if an error should be returned instead.
>>
>>
>>
> As long as there are pairs of the surrogate code points provided as escape
> sequences, there's an unambiguous relation between each pair and a code
> point in the supplementary planes. So far, so good.
>
> The upside is that the dual escape sequences facilitate conversion to/from
> UTF-16. Each code unit in UTF-16 can be processed separately.
>
> The downside is that you now have two equivalent escape mechanisms, and you
> can no longer take a string with escape sequences and binarily compare it
> without bringing it into a canonical form.
>
> However, if one is allowed to represent the character "a" both as 'a' and
> as '\u0061' (which I assume is possible) then there's already a certain
> ambiguity built into the escape sequence mechanism.
>
> What should definitely result in an error is to write '\U0000D800' because
> the 8-byte form is to be understood as UTF-32, and in that context there
> would be an issue.
>
> So, in short, if the definition of the escapes is as follows
>
> '\uxxxxx' - escape sequence for a UTF-16 code point
>
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
>
> then everything is fine and predictable. If the definition of the shorter
> sequence, is instead, "a code point on the BMP" then it's not clear how to
> handle surrogate pairs.
>
> A./
>
>
This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 16:57:51 CDT