From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Apr 16 2009 - 17:42:48 CDT
On 4/16/2009 2:55 PM, Mark Davis wrote:
> I disagree somewhat, if I understand what you wrote.
I think that you misunderstood what I wrote.
> When the \u and \U conventions are used:
>
> |U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>| ( a )
> LATIN SMALL LETTER A could be represented as any of:
>
> 1. 'a'
> 2. \u0061
> 3. \U00000061
>
> The use of #3 is a waste of space, but should not be illegal (except
> where \U is not available).
I agree completely so far.
> |U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A>|
> ( 𝐚 ) MATHEMATICAL BOLD SMALL A could be represented as any of:
>
> 1. '𝐚'
> 2. \uD835\uDC1A
> 3. \U0000D835\U0000DC1A
> 4. \U0001D41A
>
> Similarly #3 is a waste of space, but should not be illegal. #2 and #3
> are discouraged where \U is available or UTF-16 is not used, but #2 is
> necessary where \U is not available (eg Java). [Myself, I like \x{...}
> escaping better, since it is more uniform. Having a terminator allows
> variable length.]
OK. Here's where I think it matters how the escapes are defined.
If you use the definition
'\uxxxxx' - escape sequence for a UTF-16 code unit
'\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
then everything is well-defined. Examples 1, 2. and 4 in your second set
are clearly legal, and example 3 is clearly not equivalent. Note, that
lack of equivalence follows from the definition of UTF-32. Just as the
equivalence between the examples 2 and 3 in the *first* set follows from
the defintion of UTF-32 and UTF-16.
How would you rigorously define these two styles of escapes, so that
example #3 (second set) becomes legal? You would have to do something
complicated like
'\uxxxxx' - escape sequence for a UTF-16 code unit
'\Uxxxxxxxx' - escape sequence for a UTF-32 code
unit if xxxxxxxx >= 0x10000, but escape
sequence for a UTF-16 code unit
if xxxxxxxx < 0x10000.
To me, that seems unnecessarily convoluted.
Further, you create the problem that illegal UTF-32 can get converted to
legal UTF-32.
Here's how: Client 1 starts out with illegal UTF-32 containing the
sequence <0000D835, 0000DC1A>. Assume this gets turned into the escapes
"\U0000D835\U0000DC1A" and sent to the server. Client 2 receives this
escaped sequence and interprets it as the single character sequence
<0001D41A>. Had client 1 sent the UTF-32 string to client 2 directly,
client 2 would have been able to reject it as illegal UTF-32.
However, now we have client 3, which works in UTF-16 and has data of the
form <D835, DC1A>. Under your scheme, client 3 has a choice. It can send
any one of these four sequences of escape sequences containing surrogates
"\uD835\U0000DC1A"
"\U0000D835\uDC1A"
"\U0000D835\U0000DC1A"
or
"\uD835\uDC1A"
To the server, the third sequence of escapes matches what client 2 has
produced starting with an illegal UTF-32 sequence.
You now have introduced into your distributed application a way to
convert illegal UTF-32 sequences silently to legal UTF-32 sequences.
From a security point of view, that would give me pause.
A./
>
> Mark
>
>
> On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <asmusf@ix.netcom.com
> <mailto:asmusf@ix.netcom.com>> wrote:
>
> On 4/16/2009 12:04 PM, Sam Mason wrote:
>
> Hi All,
>
> I've got myself in a discussion about the correct handling of
> surrogate
> pairs. The background is as follows; the Postgres database
> server[1]
> currently assumes that the SQL it's receiving is in some user
> specified
> encoding, and it's been proposed that it would be nicer to be
> able to
> enter Unicode characters directly in the form of escape codes in a
> similar form to Python, i.e. support would be added for:
>
> '\uxxxx'
> and
> '\Uxxxxxxxx'
>
> The currently proposed patch[2] specifically handles surrogate
> pairs
> in the input. For example '\uD800\uDF02' and '\U00010302'
> would be
> considered to be valid and identical strings containing
> exactly one
> character. I was wondering if this should indeed be
> considered valid or
> if an error should be returned instead.
>
>
>
> As long as there are pairs of the surrogate code points provided
> as escape sequences, there's an unambiguous relation between each
> pair and a code point in the supplementary planes. So far, so good.
>
> The upside is that the dual escape sequences facilitate conversion
> to/from UTF-16. Each code unit in UTF-16 can be processed separately.
>
> The downside is that you now have two equivalent escape
> mechanisms, and you can no longer take a string with escape
> sequences and binarily compare it without bringing it into a
> canonical form.
>
> However, if one is allowed to represent the character "a" both as
> 'a' and as '\u0061' (which I assume is possible) then there's
> already a certain ambiguity built into the escape sequence mechanism.
>
> What should definitely result in an error is to write '\U0000D800'
> because the 8-byte form is to be understood as UTF-32, and in that
> context there would be an issue.
>
> So, in short, if the definition of the escapes is as follows
>
> '\uxxxxx' - escape sequence for a UTF-16 code point
>
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
>
> then everything is fine and predictable. If the definition of the
> shorter sequence, is instead, "a code point on the BMP" then it's
> not clear how to handle surrogate pairs.
>
> A./
>
>
This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 17:45:36 CDT