RE: Handling of Surrogates

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Apr 16 2009 - 17:01:51 CDT

  • Next message: Asmus Freytag: "Re: Handling of Surrogates"

    I don't think that the proposed syntax means that the Unicode characters
    need to be stored in the database using this form. It is just meant to be
    used within the syntax of SQL requests itself, but is not needed for binding
    variables. The database cn then store directly the encoded characters, using
    any convenient binary UTF format.

    The good question to ask is wht will be the semantic of table definitions
    that take count of "characters" in terms of restriction of field length, or
    in functions manipulating substrings or performing various operations on
    them. The database could still perform a validation for the proposed syntax,
    so that it will not accept to store unpaired surrogates (this would generate
    a constraint validation error, just like the one that occurs when you
    violate a length restriction.

    The actual number of bytes taken in the database storage is independant of
    the SQL definition of tables (if you use a syntax like "VARCHAR(n)" or
    "CHAR(n)", but many RDBMS engines (Oracle, Sybase, Informix, Microsoft SQL
    Server) have chosen to define a new type (like "[VAR]NLSCHAR(n)") for
    allowing storing sequences of atomic Unicode codepoints rather than sequence
    of code units (with [VAR]CHAR(n)). Some engines do not use a separate type,
    but allow the database to be created with a global parameter specifying the
    encoding used for CHAR, CHAR(n), VARCHAR and VARCHAR(n).

    So for me, the proposal makes sense: the syntax will only be used for the
    representation text in quoted strings within the SQL request statements. The
    escaping mechanism is not new in SQL and one is already needed since even
    longer for escaping the surrounding syntaxic quote characters themselves.

    Whever you run a SQL query specifying:
    SELECT ... FROM... WHERE field='a'
    or
    SELECT ... FROM... WHERE field='\x61'
    or
    SELECT ... FROM... WHERE field='\u0061'
    or
    SELECT ... FROM... WHERE field='\U00000061'

    the queries should be run equivalently in all cases, as long as the
    character is representable in the target database table storage. The
    behavior should be exactly similar to what has been adopted in HTML.

    Same thing if you use surrogates with '\u' pairs or a single '\U'.

    For historic reasons, C, C++, and Java (this is also true for Python, PHP,
    Javascript...) have been quite relaxed about the possibility of storing
    arbitrary sequences of 16-bit code units, including those invalid in
    Unicode; there are pro's and con's to this approach, but nothing forbids a
    database engine to use the relaxed model, as long as the adopted syntax
    still allows to add specify additional validy constraints (that's why I
    think that it's simpler to use a separate datatype for Unicode codepoints,
    independatly of the internal UTF used for them (which may be UTF-8, UTF-16,
    or a compressed UTF like BOCU or CESU). But in all cases, the syntax used in
    SQL queries is completely independant of the actual storage representation.

    I see no rationale in allowing one syntax and treating the others as errors:
    just consider that '\U' 32-bit syntax will always map to the same '\u'
    pairs: this is true as long as this is NOT indicating a surrogate codepoint
    using the 32-bit syntax: you only need to reject '\U0000D800' to
    '\U0000D8FF' and '\U00110000' to '\UFFFFFFFF' as invalid SQL syntaxes
    because they will break the equivalences. You also need to check that
    '\uD800' to '\uDBFF' are correctly paired with a matching '\uDC00' to
    '\uDFFF' in the SQL query lexer. There will never be any ambiguity, all will
    be predictable.

    Handling the lagacy '\xNN' syntax is generally more complex because it is
    generally dependant of the encoding used by the SQL client (within its
    session parameters at logon time or in subsequently set locale parameters)
    when communicating with the SQL server. But '\u' and '\U' syntaxes must be
    made independant of these encoding and client locale preferences.

    One difficulty will be to specify the behavior of the RDMS engine if the
    client connects to it using a client-side encoding that does not support
    Unicode: should it be allowed to connet to the database if this is a global
    setting or perform queries returning data columns containing Unicoded
    characters of the Unicode capability is specified per table or per column?
    If it is allowed to perform the query, how can the RDBMS perform the
    conversion (using a single substitute, using approximation to other "near"
    characters? Or should it return a SQL query execution error (invalidating
    the current transaction)? What will happen if the change of encoding means
    changing the number of codepoints after the conversion? Should it be the
    server that performs the encoding conversion or the client within its local
    client libraries or within the supporting libraries of the programming
    language or of the client OS?

    All these options are possible in Oracle, Sybase, MSSQL and Informix (most
    probably in IBM DB2 too, however I'm not an expert of it), they all have
    their interest (compatibility and connextivity possible from legacy clients)
    and inconveniences (in terms of data constraints and coherence). Generally,
    if you allow a database to store Unicode codepoints, you should make sure
    that it will store only valid Unicode text, and prepare your clients to
    handle Unicode texts as well in a encoding suitable for full compatibility
    with Unicode (with full roundtrip compatibility).

    I don't see why this should be a new difficulty in Postgres, when it has
    already been solved since long in Oracle, Sybase, MS SQL, DB2 and more
    recently in MySQL too...

    > -----Message d'origine-----
    Asmus Freytag wrote:
    > On 4/16/2009 12:04 PM, Sam Mason wrote:
    > > Hi All,
    > >
    > > I've got myself in a discussion about the correct handling of
    > > surrogate pairs. The background is as follows; the
    > Postgres database
    > > server[1] currently assumes that the SQL it's receiving is in some
    > > user specified encoding, and it's been proposed that it
    > would be nicer
    > > to be able to enter Unicode characters directly in the form
    > of escape
    > > codes in a similar form to Python, i.e. support would be added for:
    > >
    > > '\uxxxx'
    > > and
    > > '\Uxxxxxxxx'
    > >
    > > The currently proposed patch[2] specifically handles
    > surrogate pairs
    > > in the input. For example '\uD800\uDF02' and '\U00010302' would be
    > > considered to be valid and identical strings containing exactly one
    > > character. I was wondering if this should indeed be
    > considered valid
    > > or if an error should be returned instead.
    > >
    > >
    > As long as there are pairs of the surrogate code points
    > provided as escape sequences, there's an unambiguous relation
    > between each pair and a code point in the supplementary
    > planes. So far, so good.
    >
    > The upside is that the dual escape sequences facilitate
    > conversion to/from UTF-16. Each code unit in UTF-16 can be
    > processed separately.
    >
    > The downside is that you now have two equivalent escape
    > mechanisms, and you can no longer take a string with escape
    > sequences and binarily compare it without bringing it into a
    > canonical form.
    >
    > However, if one is allowed to represent the character "a" both as 'a'
    > and as '\u0061' (which I assume is possible) then there's
    > already a certain ambiguity built into the escape sequence mechanism.
    >
    > What should definitely result in an error is to write '\U0000D800'
    > because the 8-byte form is to be understood as UTF-32, and in
    > that context there would be an issue.
    >
    > So, in short, if the definition of the escapes is as follows
    >
    > '\uxxxxx' - escape sequence for a UTF-16 code point
    >
    > '\Uxxxxxxxx' - escape sequence for a UTF-32 code point
    >
    > then everything is fine and predictable. If the definition of
    > the shorter sequence, is instead, "a code point on the BMP"
    > then it's not clear how to handle surrogate pairs.



    This archive was generated by hypermail 2.1.5 : Thu Apr 16 2009 - 17:03:55 CDT