Re: Handling of Surrogates

From: Mark Davis (
Date: Thu Apr 16 2009 - 23:32:51 CDT

  • Next message: Asmus Freytag: "Re: Handling of Surrogates"

    It only implies it if it was spec'ed to imply it. Which it doesn't, at least
    in cases I'm familiar with.

    On Thu, Apr 16, 2009 at 20:56, Doug Ewell <> wrote:

    > I have to agree with Asmus on this. Even if the \Uxxxxxxxx notation was
    > originally created to get around the four-hex-digit limit of \uxxxx, it does
    > imply a 32-bit value. Writing \U0000D835\U0000DC1A would strongly imply
    > that two characters are being represented, not one. With this extended
    > notation, there should be no reason to fall back to UTF-16 code units.
    > --
    > Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
    > ˆ
    > ----- Original Message ----- From: Mark Davis
    > To: Asmus Freytag
    > Cc: Sam Mason ;
    > Sent: Thursday, April 16, 2009 17:42
    > Subject: Re: Handling of Surrogates
    > > If you use the definition
    > > '\uxxxxx' - escape sequence for a UTF-16 code unit
    > > '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
    > You'll have to stop there, because that isn't the definition typically (or
    > certainly not universally) used. In practice, these conventions arose as an
    > adaption as we went from UCS-2 to UTF-16:
    > \u means the code point and the equivalent code unit (for these cases they
    > have identical numeric values).
    > \U means the code point and the equivalent code unit if <= FFFF (for these
    > cases they have identical numeric values),
    > and otherwise a code point and the equivalent two paired code units.
    > While it would be possible to restrict the recognized escapes, best
    > practice for interoperability is to accept all 7. When generating, as I
    > said, it would be cleaner to not do A.3 or B.3, and to do B.2 if and only if
    > B.4 is unavailable.
    > > Further, you create the problem that illegal UTF-32 can get converted to
    > legal UTF-32.
    > These conventions are designed for and typically used for UTF-16 literal
    > text. And if they are used with other UTFs, they should be interpreted as
    > representing what they would be in UTF-16. That is, each of the 7 formats I
    > listed would have the corresponding meaning.
    > > You now have introduced into your distributed application a way to
    > convert illegal UTF-32 sequences silently to legal UTF-32 sequences. From a
    > security point of view, that would give me pause.
    > First off, essentially nobody uses UTF-32 for interchange, so your example
    > would have been better as UTF-8 (you can do the same example). Secondly,
    > yes, this escaping format is based on UTF-16, and thus has some history
    > behind it. But it doesn't present any significant problem. You can get
    > well-formed result from an ill-formed source if:
    > a.. If you convert ill-formed UTF-32 or UTF-8 to the escaped form without
    > checking for ill-formed source, OR
    > b.. If you convert ill-formed UTF-32 or UTF-8 to UTF-16 without checking
    > for ill-formed source.
    > Of course, if I also do a conversion from UTF-32 where I replace surrogate
    > code points by FFFD, I also get a valid result. The key problem for security
    > is where I can sneak harmful characters past a gatekeeper. Very few servers
    > use surrogate characters (or FFFD) as syntax characters ;-)
    > And yes, I did forget B.3a and B3b, which are also possible.
    > a \U0000D835\uDC1A
    > b \uD835\U0000DC1A
    > Ugly, but the meaning is well-defined.
    > Mark
    > On Thu, Apr 16, 2009 at 15:42, Asmus Freytag <>
    > wrote:
    > On 4/16/2009 2:55 PM, Mark Davis wrote:
    > I disagree somewhat, if I understand what you wrote.
    > I think that you misunderstood what I wrote.
    > When the \u and \U conventions are used:
    > |U+0061 <>| ( a )
    > LATIN SMALL LETTER A could be represented as any of:
    > 1. 'a'
    > 2. \u0061
    > 3. \U00000061
    > The use of #3 is a waste of space, but should not be illegal (except
    > where \U is not available).
    > I agree completely so far.
    > |U+1D41A <>| ( 𝐚
    > ) MATHEMATICAL BOLD SMALL A could be represented as any of:
    > 1. '𝐚'
    > 2. \uD835\uDC1A
    > 3. \U0000D835\U0000DC1A
    > 4. \U0001D41A
    > Similarly #3 is a waste of space, but should not be illegal. #2 and #3
    > are discouraged where \U is available or UTF-16 is not used, but #2 is
    > necessary where \U is not available (eg Java). [Myself, I like \x{...}
    > escaping better, since it is more uniform. Having a terminator allows
    > variable length.]
    > OK. Here's where I think it matters how the escapes are defined.
    > If you use the definition
    > '\uxxxxx' - escape sequence for a UTF-16 code unit
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code unit
    > then everything is well-defined. Examples 1, 2. and 4 in your second set
    > are clearly legal, and example 3 is clearly not equivalent. Note, that lack
    > of equivalence follows from the definition of UTF-32. Just as the
    > equivalence between the examples 2 and 3 in the *first* set follows from the
    > defintion of UTF-32 and UTF-16.
    > How would you rigorously define these two styles of escapes, so that
    > example #3 (second set) becomes legal? You would have to do something
    > complicated like
    > '\uxxxxx' - escape sequence for a UTF-16 code unit
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code
    > unit if xxxxxxxx >= 0x10000, but escape
    > sequence for a UTF-16 code unit
    > if xxxxxxxx < 0x10000.
    > To me, that seems unnecessarily convoluted.
    > Further, you create the problem that illegal UTF-32 can get converted to
    > legal UTF-32.
    > Here's how: Client 1 starts out with illegal UTF-32 containing the
    > sequence <0000D835, 0000DC1A>. Assume this gets turned into the escapes
    > "\U0000D835\U0000DC1A" and sent to the server. Client 2 receives this
    > escaped sequence and interprets it as the single character sequence
    > <0001D41A>. Had client 1 sent the UTF-32 string to client 2 directly, client
    > 2 would have been able to reject it as illegal UTF-32.
    > However, now we have client 3, which works in UTF-16 and has data of the
    > form <D835, DC1A>. Under your scheme, client 3 has a choice. It can send any
    > one of these four sequences of escape sequences containing surrogates
    > "\uD835\U0000DC1A"
    > "\U0000D835\uDC1A"
    > "\U0000D835\U0000DC1A"
    > or
    > "\uD835\uDC1A"
    > To the server, the third sequence of escapes matches what client 2 has
    > produced starting with an illegal UTF-32 sequence.
    > You now have introduced into your distributed application a way to
    > convert illegal UTF-32 sequences silently to legal UTF-32 sequences. From a
    > security point of view, that would give me pause.
    > A./
    > Mark
    > On Thu, Apr 16, 2009 at 13:04, Asmus Freytag <<mailto:
    >>> wrote:
    > On 4/16/2009 12:04 PM, Sam Mason wrote:
    > Hi All,
    > I've got myself in a discussion about the correct handling of
    > surrogate
    > pairs. The background is as follows; the Postgres database
    > server[1]
    > currently assumes that the SQL it's receiving is in some user
    > specified
    > encoding, and it's been proposed that it would be nicer to be
    > able to
    > enter Unicode characters directly in the form of escape codes in
    > a
    > similar form to Python, i.e. support would be added for:
    > '\uxxxx'
    > and
    > '\Uxxxxxxxx'
    > The currently proposed patch[2] specifically handles surrogate
    > pairs
    > in the input. For example '\uD800\uDF02' and '\U00010302'
    > would be
    > considered to be valid and identical strings containing
    > exactly one
    > character. I was wondering if this should indeed be
    > considered valid or
    > if an error should be returned instead.
    > As long as there are pairs of the surrogate code points provided
    > as escape sequences, there's an unambiguous relation between each
    > pair and a code point in the supplementary planes. So far, so good.
    > The upside is that the dual escape sequences facilitate conversion
    > to/from UTF-16. Each code unit in UTF-16 can be processed
    > separately.
    > The downside is that you now have two equivalent escape
    > mechanisms, and you can no longer take a string with escape
    > sequences and binarily compare it without bringing it into a
    > canonical form.
    > However, if one is allowed to represent the character "a" both as
    > 'a' and as '\u0061' (which I assume is possible) then there's
    > already a certain ambiguity built into the escape sequence
    > mechanism.
    > What should definitely result in an error is to write '\U0000D800'
    > because the 8-byte form is to be understood as UTF-32, and in that
    > context there would be an issue.
    > So, in short, if the definition of the escapes is as follows
    > '\uxxxxx' - escape sequence for a UTF-16 code point
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
    > then everything is fine and predictable. If the definition of the
    > shorter sequence, is instead, "a code point on the BMP" then it's
    > not clear how to handle surrogate pairs.
    > A./

    This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 23:35:24 CDT