RE: Subj: MySQl and Unicode support

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Feb 22 2007 - 07:59:45 CST

Next message: Henrik Theiling: "Re: Subj: MySQl and Unicode support"

Previous message: Omi Azad: "Re: Subj: MySQl and Unicode support"
In reply to: Magda Danish (Unicode): "FW: Subj: MySQl and Unicode support"
Next in thread: Henrik Theiling: "Re: Subj: MySQl and Unicode support"
Reply: Henrik Theiling: "Re: Subj: MySQl and Unicode support"
Maybe reply: Philippe Verdy: "RE: Subj: MySQl and Unicode support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

MySQL car perfectly support Unicode texts, using UTF-8 as the base encoding
for the strings; however some SQL operators will not work as expected,
because they use byte positions, instead of character positions (in UTF-8,
Uniode characters are encoded with a variable number of bytes, from 1 byte
for ASCII characters up to U+007F, to 4).

If you don't use the SQL string functions that compute:
* substrings based on character indices instead of byte indices
* case conversion,
* uncased string matching,
* sorting the results,
then you an perfectly use it. Some newer versions of MySQL have the support
of "nvarchar" to support Unicode strings with the Unicode character
semantics, but be careful: the data type maximum length will still be based
on the internal code units, not really on the Unicode characters (i.e. code
points here). To allow this support, you need to install MySQL with extra
support libraries, notably a database of character properties (used for
proper case conversion and uncased/case-folded string matching).

If your MySQL does not have such support, you can still use MySQL, but all
string handling that require Unicode semantics should be performed on the
SQL client side (for example in JSP, ASP, PHP, C, C++, #, ... provided that
the client side language is also installed with the Unicode support).

In fact, ANY database engine that supports storing and retrieving vectors of
8-bit bytes without altering them (for example a database that fully
supports ISO-8859-1) can be used to store UTF-8 encoded text. If you
carefully use the good practice of CASE design for your database, all
queries to the SQL engine will not require any Unicode-dependant features.

Note that sorting (SELECT... ORDER BY...) may be an issue if you expect the
results from the database to be in the appropriate order; but sorting also
has a cost on the database engine and should not influence the results (if
you follow the good practices of CASE design, using unique IDs for table
rows rather than matching them with complex joins with strings), so you may
as well accumulate the results on the client-side, and realize the sort in
the client, rather than in the database engine.

If you still want to use strings for table row unique identifiers used as
joining keys, make sure that your system will accept variable length keys;
if the key represents a person name, make sure that the string length
restriction in the table declaration will be sufficient to store all
possible encoded key lengths after computing the effective key string, if
the key string can contain any Unicode character.

Note also that if you need a particular sort to be still performed by the
SQL engine, one solution is to include in your tables a column for storing
the sort keys using a binary encoding of collation keys (see the Unicode UCA
algorithm about how to compute collation keys); then store the collation key
in the table in addition to the effective string (this will of course have
an impact on the table storage size, and on the performance due to extra
I/O, as less rows will fit in the same page on disk).

Note that database engines that have the full support for Unicode use
UCA-computed collation keys when they create an index on columns declared to
use an Unicode-compatible string data type, for example NVARCHAR(n). You
don't need to declare an additional column for the collation key, as the
database engine performs itself the computation of collation keys (used for
sorts, sub-selections, range-selection, and joining) within its indexes.

There are many applications, notably for the web where MySQL is used with
PHP, that DO support Unicode strings which are stored in a MySQL-driven
database. If you know the limitations of MySQL before designing the
database, then it's not complicate to design the client-side code so that it
will make SQL queries without requiring Unicode-support in the MySQL
database, when you can provide this support directly with the client-side
libraries (for example in PHP, there are tons of modules that do accept
Unicode strings, perform encoding conversions, ...).

Note that PHP itself is quite similar to MySQL: it also has no native
support of Unicode for its strings, which are just vectors of 8-bit units.
But despite this limitation (where a "char" or "string" in PHP have not the
same meaning as in Unicode), the supported data type is enough to allow
handling UTF-8 strings correctly. There are constraints to respect in your
code to avoid breaking Unicode-compatibility: the language will not prevent
you to break those rules (notably not breaking a UTF-8 encoded character by
splitting it in the middle of a sequence), but it is perfectly possible.

The situation is almost identical in MOST other languages:
* JavaScript/ECMA-Script for example uses characters and strings that are
effectively vectors of 16-bit code units, and not Unicode characters (i.e.
full code points); the difference is that code points above U+FFFF must be
represented by two "characters" in JavaScript, each one corresponding to
"surrogates", and the language does not enforce the rule of not breaking a
character in the middle of a surrogate pair.
* C/C++ may use 8-bit "char" or 8/16/32-bit "wchar_t"; you have to study the
platform specification of the supported data types to find which format of
code units are supported. Then you need to use appropriate code or libraries
to use vectors of code units according to the expected Unicode semantics and
requirement). The language itself will not enforce the Unicode rules, but it
perfectly has all what it needed to support it

In fact you could as well program in assembly language with the same
capabilities and lacks of native support: it's up to you to add this support
or to use libraries that will help you in this task.

Note that if your MySQL instance only supports Unicode characters encoded
with up to 3 bytes per characters, it is enough to support UTF-32 (after
dropping the high byte which is always null and not stored, but will be
regenerated implicitly at retrieval time). What this means is that it will
use fixed-length encoding for fixed-length Unicode strings, so the storage
space can be easily sized to a safe maximum. It will however probably not be
very efficient in terms of I/O because most text will then be stored with 1
or possibly 2 null bytes per characters, so there will be less rows per disk
page.

Really, for compact database storage, a SQL engine should propose some
standard compaction algorithm (SCSU being a good candidate); what it needs
to enforce is the effective Unicode string length, but not the internal
storage length, but it must offer a way for database designers to estimate
the average size and maximum size for the storage of tables. The average
size will be data-dependant, but not the maximum which is very strict and
can be enforced by the SQL engine as a simple measure in addition to the
maximum Unicode string length. The effective internal representation used in
the database storage should not affect the semantics of the strings you are
storing and retrieving, if your database says it supports Unicode.

But no database should say it supports Unicode if it cannot accept strings
containing EVERY character of the 17 planes (including NULL and other
control characters, but possibly excluding non-characters whose exclusion
should better be enforced by the database engine).

Full support of Unicode is illusory: new algorithms get developed everyday,
as well as new characters and new properties. A SQL engine really conforms
only to a set of Unicode algorithms specified in a precise version, which
should be documented. The behaviour of the database with Unicode characters
that were still not standardized in that Unicode version is not specified:

Some database will adopt a "safe" behaviour and will reject strings
containing them, others will accept them using default properties, but this
may affect future evolutions when the engine gets updated to support a new
version of Unicode: indices may need to be rebuilt for correct sorting or
range sub-selections according to newer UCA collation rules including the
newer characters. But this reordering of indices and queries may affect
applications and that's why some database engines offer you the choice,
either in the table store format (where the strict enforcement to the known
standardized Unicode version may be encoded), or in the client connection
handshakes, or in the stored user properties (that determine the client
compatibility, but this may produce inconsistent results if there are
different user profiles for the same database).

I personally think that the best place for such Unicode version enforcement
is in the application, not in the database itself, and not in the user
profiles; but this is just an opinion, which may be wrong in other
environments where there are multiple applications with different needs
connecting to the same database.

The per-user profile is for me a bad choice because it easily produces
inconstant results (where the queries do not reflect the actual content of
the database, which may produce sometimes inconsistent results, like
rejecting a row creation due to a unique key constraint, despite that a row
with that key could not be retrieved by a simple SELECT, even within the
same SQL transaction). However, an application may still be built safely
with such per-user profiles (regarding handshaked Unicode versions), if it
is prepared to handle row creation exceptions, and do not assume that a ROW
NOT FOUND indication (or COUNT(*)==0) after a simple SELECT (or a
range-SELECT) means that the row will be creatable in the same transaction.

> -----Message d'origine-----
> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
> part de Magda Danish (Unicode)
> Envoyé : mercredi 21 février 2007 23:34
> À : unicode@unicode.org
> Cc : sinead.martin@traventec.com
> Objet : FW: Subj: MySQl and Unicode support
>
> Sinead --
> I am forwarding your question to the Unicode mailing list.
>
> Unicode subscribers --
> Does anyone have an answer to Sinead's email?
> -----Original Message-----
> Date/Time: Tue Feb 20 07:22:21 CST 2007
> Contact: sinead.martin@traventec.com
> Name: Sinead Martin
> Report Type: Other Question, Problem, or Feedback Opt Subject: MySQl and
> Unicode support
>
> Hi there,
>
> I notice that MySQl is not listed on your website as a database that
> supports Unicode.
> Would you recommend mySQL as a database that needs to support 10 languages
> incuding Traditional / simplified Chinese and Japanese?
>
> I wondered whether the fact that MySQL support only 3 bytes instead of 4
> has any significance?
>
> Thanks in advance,
> Sinead Martin

Next message: Henrik Theiling: "Re: Subj: MySQl and Unicode support"
Previous message: Omi Azad: "Re: Subj: MySQl and Unicode support"
In reply to: Magda Danish (Unicode): "FW: Subj: MySQl and Unicode support"
Next in thread: Henrik Theiling: "Re: Subj: MySQl and Unicode support"
Reply: Henrik Theiling: "Re: Subj: MySQl and Unicode support"
Maybe reply: Philippe Verdy: "RE: Subj: MySQl and Unicode support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 22 2007 - 08:01:56 CST