Re: Handling of Surrogates

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Thu Apr 16 2009 - 23:32:51 CDT

Next message: Asmus Freytag: "Re: Handling of Surrogates"

Previous message: Doug Ewell: "Re: Handling of Surrogates"
In reply to: Doug Ewell: "Re: Handling of Surrogates"
Next in thread: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Asmus Freytag: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It only implies it if it was spec'ed to imply it. Which it doesn't, at least
in cases I'm familiar with.
Mark

On Thu, Apr 16, 2009 at 20:56, Doug Ewell <doug@ewellic.org> wrote:

> I have to agree with Asmus on this. Even if the \Uxxxxxxxx notation was
> originally created to get around the four-hex-digit limit of \uxxxx, it does
> imply a 32-bit value. Writing \U0000D835\U0000DC1A would strongly imply
> that two characters are being represented, not one. With this extended
> notation, there should be no reason to fall back to UTF-16 code units.
>
> --
> Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
> http://www.ewellic.org
> http://www1.ietf.org/html.charters/ltru-charter.html
> http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
>
> ----- Original Message ----- From: Mark Davis
> To: Asmus Freytag
> Cc: Sam Mason ; unicode@unicode.org
> Sent: Thursday, April 16, 2009 17:42
> Subject: Re: Handling of Surrogates
>
>
> > If you use the definition
> > '\uxxxxx' - escape sequence for a UTF-16 code unit
> > '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
>
> You'll have to stop there, because that isn't the definition typically (or
> certainly not universally) used. In practice, these conventions arose as an
> adaption as we went from UCS-2 to UTF-16:
>
>
> \u means the code point and the equivalent code unit (for these cases they
> have identical numeric values).
>
> \U means the code point and the equivalent code unit if <= FFFF (for these
> cases they have identical numeric values),
> and otherwise a code point and the equivalent two paired code units.
>
>
> While it would be possible to restrict the recognized escapes, best
> practice for interoperability is to accept all 7. When generating, as I
> said, it would be cleaner to not do A.3 or B.3, and to do B.2 if and only if
> B.4 is unavailable.
>
> > Further, you create the problem that illegal UTF-32 can get converted to
> legal UTF-32.
>
> These conventions are designed for and typically used for UTF-16 literal
> text. And if they are used with other UTFs, they should be interpreted as
> representing what they would be in UTF-16. That is, each of the 7 formats I
> listed would have the corresponding meaning.
>
> > You now have introduced into your distributed application a way to
> convert illegal UTF-32 sequences silently to legal UTF-32 sequences. From a
> security point of view, that would give me pause.
>
> First off, essentially nobody uses UTF-32 for interchange, so your example
> would have been better as UTF-8 (you can do the same example). Secondly,
> yes, this escaping format is based on UTF-16, and thus has some history
> behind it. But it doesn't present any significant problem. You can get
> well-formed result from an ill-formed source if:
>
> a.. If you convert ill-formed UTF-32 or UTF-8 to the escaped form without
> checking for ill-formed source, OR
> b.. If you convert ill-formed UTF-32 or UTF-8 to UTF-16 without checking
> for ill-formed source.
>
> Of course, if I also do a conversion from UTF-32 where I replace surrogate
> code points by FFFD, I also get a valid result. The key problem for security
> is where I can sneak harmful characters past a gatekeeper. Very few servers
> use surrogate characters (or FFFD) as syntax characters ;-)
>
>
> And yes, I did forget B.3a and B3b, which are also possible.
>
> a \U0000D835\uDC1A
> b \uD835\U0000DC1A
>
> Ugly, but the meaning is well-defined.
>
> Mark
>
>
>
> On Thu, Apr 16, 2009 at 15:42, Asmus Freytag <asmusf@ix.netcom.com>
> wrote:
>
> On 4/16/2009 2:55 PM, Mark Davis wrote:
>
> I disagree somewhat, if I understand what you wrote.
>
> I think that you misunderstood what I wrote.
>
> When the \u and \U conventions are used:
>
>
> |U+0061 <http://unicode.org/cldr/utility/character.jsp?a=0061>| ( a )
> LATIN SMALL LETTER A could be represented as any of:
>
> 1. 'a'
> 2. \u0061
> 3. \U00000061
>
> The use of #3 is a waste of space, but should not be illegal (except
> where \U is not available).
>
> I agree completely so far.
>
> |U+1D41A <http://unicode.org/cldr/utility/character.jsp?a=1D41A>| ( 𝐚
> ) MATHEMATICAL BOLD SMALL A could be represented as any of:
>
> 1. '𝐚'
> 2. \uD835\uDC1A
> 3. \U0000D835\U0000DC1A
> 4. \U0001D41A
>
>
> Similarly #3 is a waste of space, but should not be illegal. #2 and #3
> are discouraged where \U is available or UTF-16 is not used, but #2 is
> necessary where \U is not available (eg Java). [Myself, I like \x{...}
> escaping better, since it is more uniform. Having a terminator allows
> variable length.]
>
> OK. Here's where I think it matters how the escapes are defined.
>
> If you use the definition
>
> '\uxxxxx' - escape sequence for a UTF-16 code unit
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
>
> then everything is well-defined. Examples 1, 2. and 4 in your second set
> are clearly legal, and example 3 is clearly not equivalent. Note, that lack
> of equivalence follows from the definition of UTF-32. Just as the
> equivalence between the examples 2 and 3 in the *first* set follows from the
> defintion of UTF-32 and UTF-16.
>
> How would you rigorously define these two styles of escapes, so that
> example #3 (second set) becomes legal? You would have to do something
> complicated like
>
> '\uxxxxx' - escape sequence for a UTF-16 code unit
>
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code
>
> unit if xxxxxxxx >= 0x10000, but escape
> sequence for a UTF-16 code unit
> if xxxxxxxx < 0x10000.
>
> To me, that seems unnecessarily convoluted.
>
> Further, you create the problem that illegal UTF-32 can get converted to
> legal UTF-32.
>
> Here's how: Client 1 starts out with illegal UTF-32 containing the
> sequence <0000D835, 0000DC1A>. Assume this gets turned into the escapes
> "\U0000D835\U0000DC1A" and sent to the server. Client 2 receives this
> escaped sequence and interprets it as the single character sequence
> <0001D41A>. Had client 1 sent the UTF-32 string to client 2 directly, client
> 2 would have been able to reject it as illegal UTF-32.
>
> However, now we have client 3, which works in UTF-16 and has data of the
> form <D835, DC1A>. Under your scheme, client 3 has a choice. It can send any
> one of these four sequences of escape sequences containing surrogates
> "\uD835\U0000DC1A"
> "\U0000D835\uDC1A"
> "\U0000D835\U0000DC1A"
> or
> "\uD835\uDC1A"
>
> To the server, the third sequence of escapes matches what client 2 has
> produced starting with an illegal UTF-32 sequence.
>
> You now have introduced into your distributed application a way to
> convert illegal UTF-32 sequences silently to legal UTF-32 sequences. From a
> security point of view, that would give me pause.
>
> A./
>
>
> Mark
>
>
>
> On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <asmusf@ix.netcom.com<mailto:
> asmusf@ix.netcom.com>> wrote:
>
> On 4/16/2009 12:04 PM, Sam Mason wrote:
>
> Hi All,
>
> I've got myself in a discussion about the correct handling of
> surrogate
> pairs. The background is as follows; the Postgres database
> server[1]
> currently assumes that the SQL it's receiving is in some user
> specified
> encoding, and it's been proposed that it would be nicer to be
> able to
> enter Unicode characters directly in the form of escape codes in
> a
> similar form to Python, i.e. support would be added for:
>
> '\uxxxx'
> and
> '\Uxxxxxxxx'
>
> The currently proposed patch[2] specifically handles surrogate
> pairs
> in the input. For example '\uD800\uDF02' and '\U00010302'
> would be
> considered to be valid and identical strings containing
> exactly one
> character. I was wondering if this should indeed be
> considered valid or
> if an error should be returned instead.
>
>
> As long as there are pairs of the surrogate code points provided
> as escape sequences, there's an unambiguous relation between each
> pair and a code point in the supplementary planes. So far, so good.
>
> The upside is that the dual escape sequences facilitate conversion
> to/from UTF-16. Each code unit in UTF-16 can be processed
> separately.
>
> The downside is that you now have two equivalent escape
> mechanisms, and you can no longer take a string with escape
> sequences and binarily compare it without bringing it into a
> canonical form.
>
> However, if one is allowed to represent the character "a" both as
> 'a' and as '\u0061' (which I assume is possible) then there's
> already a certain ambiguity built into the escape sequence
> mechanism.
>
> What should definitely result in an error is to write '\U0000D800'
> because the 8-byte form is to be understood as UTF-32, and in that
> context there would be an issue.
>
> So, in short, if the definition of the escapes is as follows
>
> '\uxxxxx' - escape sequence for a UTF-16 code point
>
> '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
>
> then everything is fine and predictable. If the definition of the
> shorter sequence, is instead, "a code point on the BMP" then it's
> not clear how to handle surrogate pairs.
>
> A./
>
>
>
>
>
>
>
>
>

Next message: Asmus Freytag: "Re: Handling of Surrogates"
Previous message: Doug Ewell: "Re: Handling of Surrogates"
In reply to: Doug Ewell: "Re: Handling of Surrogates"
Next in thread: Asmus Freytag: "Re: Handling of Surrogates"
Reply: Asmus Freytag: "Re: Handling of Surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 23:35:24 CDT