RE: Subj: MySQl and Unicode support

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Feb 22 2007 - 07:59:45 CST

  • Next message: Henrik Theiling: "Re: Subj: MySQl and Unicode support"

    MySQL car perfectly support Unicode texts, using UTF-8 as the base encoding
    for the strings; however some SQL operators will not work as expected,
    because they use byte positions, instead of character positions (in UTF-8,
    Uniode characters are encoded with a variable number of bytes, from 1 byte
    for ASCII characters up to U+007F, to 4).

    If you don't use the SQL string functions that compute:
    * substrings based on character indices instead of byte indices
    * case conversion,
    * uncased string matching,
    * sorting the results,
    then you an perfectly use it. Some newer versions of MySQL have the support
    of "nvarchar" to support Unicode strings with the Unicode character
    semantics, but be careful: the data type maximum length will still be based
    on the internal code units, not really on the Unicode characters (i.e. code
    points here). To allow this support, you need to install MySQL with extra
    support libraries, notably a database of character properties (used for
    proper case conversion and uncased/case-folded string matching).

    If your MySQL does not have such support, you can still use MySQL, but all
    string handling that require Unicode semantics should be performed on the
    SQL client side (for example in JSP, ASP, PHP, C, C++, #, ... provided that
    the client side language is also installed with the Unicode support).

    In fact, ANY database engine that supports storing and retrieving vectors of
    8-bit bytes without altering them (for example a database that fully
    supports ISO-8859-1) can be used to store UTF-8 encoded text. If you
    carefully use the good practice of CASE design for your database, all
    queries to the SQL engine will not require any Unicode-dependant features.

    Note that sorting (SELECT... ORDER BY...) may be an issue if you expect the
    results from the database to be in the appropriate order; but sorting also
    has a cost on the database engine and should not influence the results (if
    you follow the good practices of CASE design, using unique IDs for table
    rows rather than matching them with complex joins with strings), so you may
    as well accumulate the results on the client-side, and realize the sort in
    the client, rather than in the database engine.

    If you still want to use strings for table row unique identifiers used as
    joining keys, make sure that your system will accept variable length keys;
    if the key represents a person name, make sure that the string length
    restriction in the table declaration will be sufficient to store all
    possible encoded key lengths after computing the effective key string, if
    the key string can contain any Unicode character.

    Note also that if you need a particular sort to be still performed by the
    SQL engine, one solution is to include in your tables a column for storing
    the sort keys using a binary encoding of collation keys (see the Unicode UCA
    algorithm about how to compute collation keys); then store the collation key
    in the table in addition to the effective string (this will of course have
    an impact on the table storage size, and on the performance due to extra
    I/O, as less rows will fit in the same page on disk).

    Note that database engines that have the full support for Unicode use
    UCA-computed collation keys when they create an index on columns declared to
    use an Unicode-compatible string data type, for example NVARCHAR(n). You
    don't need to declare an additional column for the collation key, as the
    database engine performs itself the computation of collation keys (used for
    sorts, sub-selections, range-selection, and joining) within its indexes.

    There are many applications, notably for the web where MySQL is used with
    PHP, that DO support Unicode strings which are stored in a MySQL-driven
    database. If you know the limitations of MySQL before designing the
    database, then it's not complicate to design the client-side code so that it
    will make SQL queries without requiring Unicode-support in the MySQL
    database, when you can provide this support directly with the client-side
    libraries (for example in PHP, there are tons of modules that do accept
    Unicode strings, perform encoding conversions, ...).

    Note that PHP itself is quite similar to MySQL: it also has no native
    support of Unicode for its strings, which are just vectors of 8-bit units.
    But despite this limitation (where a "char" or "string" in PHP have not the
    same meaning as in Unicode), the supported data type is enough to allow
    handling UTF-8 strings correctly. There are constraints to respect in your
    code to avoid breaking Unicode-compatibility: the language will not prevent
    you to break those rules (notably not breaking a UTF-8 encoded character by
    splitting it in the middle of a sequence), but it is perfectly possible.

    The situation is almost identical in MOST other languages:
    * JavaScript/ECMA-Script for example uses characters and strings that are
    effectively vectors of 16-bit code units, and not Unicode characters (i.e.
    full code points); the difference is that code points above U+FFFF must be
    represented by two "characters" in JavaScript, each one corresponding to
    "surrogates", and the language does not enforce the rule of not breaking a
    character in the middle of a surrogate pair.
    * C/C++ may use 8-bit "char" or 8/16/32-bit "wchar_t"; you have to study the
    platform specification of the supported data types to find which format of
    code units are supported. Then you need to use appropriate code or libraries
    to use vectors of code units according to the expected Unicode semantics and
    requirement). The language itself will not enforce the Unicode rules, but it
    perfectly has all what it needed to support it

    In fact you could as well program in assembly language with the same
    capabilities and lacks of native support: it's up to you to add this support
    or to use libraries that will help you in this task.

    Note that if your MySQL instance only supports Unicode characters encoded
    with up to 3 bytes per characters, it is enough to support UTF-32 (after
    dropping the high byte which is always null and not stored, but will be
    regenerated implicitly at retrieval time). What this means is that it will
    use fixed-length encoding for fixed-length Unicode strings, so the storage
    space can be easily sized to a safe maximum. It will however probably not be
    very efficient in terms of I/O because most text will then be stored with 1
    or possibly 2 null bytes per characters, so there will be less rows per disk
    page.

    Really, for compact database storage, a SQL engine should propose some
    standard compaction algorithm (SCSU being a good candidate); what it needs
    to enforce is the effective Unicode string length, but not the internal
    storage length, but it must offer a way for database designers to estimate
    the average size and maximum size for the storage of tables. The average
    size will be data-dependant, but not the maximum which is very strict and
    can be enforced by the SQL engine as a simple measure in addition to the
    maximum Unicode string length. The effective internal representation used in
    the database storage should not affect the semantics of the strings you are
    storing and retrieving, if your database says it supports Unicode.

    But no database should say it supports Unicode if it cannot accept strings
    containing EVERY character of the 17 planes (including NULL and other
    control characters, but possibly excluding non-characters whose exclusion
    should better be enforced by the database engine).

    Full support of Unicode is illusory: new algorithms get developed everyday,
    as well as new characters and new properties. A SQL engine really conforms
    only to a set of Unicode algorithms specified in a precise version, which
    should be documented. The behaviour of the database with Unicode characters
    that were still not standardized in that Unicode version is not specified:

    Some database will adopt a "safe" behaviour and will reject strings
    containing them, others will accept them using default properties, but this
    may affect future evolutions when the engine gets updated to support a new
    version of Unicode: indices may need to be rebuilt for correct sorting or
    range sub-selections according to newer UCA collation rules including the
    newer characters. But this reordering of indices and queries may affect
    applications and that's why some database engines offer you the choice,
    either in the table store format (where the strict enforcement to the known
    standardized Unicode version may be encoded), or in the client connection
    handshakes, or in the stored user properties (that determine the client
    compatibility, but this may produce inconsistent results if there are
    different user profiles for the same database).

    I personally think that the best place for such Unicode version enforcement
    is in the application, not in the database itself, and not in the user
    profiles; but this is just an opinion, which may be wrong in other
    environments where there are multiple applications with different needs
    connecting to the same database.

    The per-user profile is for me a bad choice because it easily produces
    inconstant results (where the queries do not reflect the actual content of
    the database, which may produce sometimes inconsistent results, like
    rejecting a row creation due to a unique key constraint, despite that a row
    with that key could not be retrieved by a simple SELECT, even within the
    same SQL transaction). However, an application may still be built safely
    with such per-user profiles (regarding handshaked Unicode versions), if it
    is prepared to handle row creation exceptions, and do not assume that a ROW
    NOT FOUND indication (or COUNT(*)==0) after a simple SELECT (or a
    range-SELECT) means that the row will be creatable in the same transaction.

    > -----Message d'origine-----
    > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    > part de Magda Danish (Unicode)
    > Envoyé : mercredi 21 février 2007 23:34
    > À : unicode@unicode.org
    > Cc : sinead.martin@traventec.com
    > Objet : FW: Subj: MySQl and Unicode support
    >
    > Sinead --
    > I am forwarding your question to the Unicode mailing list.
    >
    > Unicode subscribers --
    > Does anyone have an answer to Sinead's email?
    > -----Original Message-----
    > Date/Time: Tue Feb 20 07:22:21 CST 2007
    > Contact: sinead.martin@traventec.com
    > Name: Sinead Martin
    > Report Type: Other Question, Problem, or Feedback Opt Subject: MySQl and
    > Unicode support
    >
    > Hi there,
    >
    > I notice that MySQl is not listed on your website as a database that
    > supports Unicode.
    > Would you recommend mySQL as a database that needs to support 10 languages
    > incuding Traditional / simplified Chinese and Japanese?
    >
    > I wondered whether the fact that MySQL support only 3 bytes instead of 4
    > has any significance?
    >
    > Thanks in advance,
    > Sinead Martin



    This archive was generated by hypermail 2.1.5 : Thu Feb 22 2007 - 08:01:56 CST