From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Nov 05 2003 - 06:38:24 EST
On 04/11/2003 21:49, Doug Ewell wrote:
>Peter Kirk <peterkirk at qaya dot org> wrote:
>
>
>
>>>... (a very old, legacy application, unaware of the existence of
>>>codepoints above U+FFFF) ...
>>>
>>>
>>Such applications are not "very old", they are still being written.
>>For example (see http://www.mysql.com/doc/en/Charset-Unicode.html),
>>MySQL 4.1 adds UCS-2 and UTF-8 support to previous versions but for
>>single two-byte codes in UCS-2 and up to three bytes per UTF-8
>>character only :-( - and this is still in alpha!
>>
>>
>
>At the risk of upsetting the open-source faithful, that is just plain
>lazy. Anyone who can master the wizardly details of building a powerful
>(and commercially successful) database program can figure out how to
>slap two surrogates together without destroying performance.
>Constraining UTF-8 to the BMP is even less defensible, since there is no
>performance penalty in allowing four-byte UTF-8 sequences.
>
>-Doug Ewell
> Fullerton, California
> http://users.adelphia.net/~dewell/
>
>
>
Agreed. But to be fair to MySQL, they do mention as a potential problem
that three bytes have to be allocated in strings for each UTF-8
character. For full UTF-8 support they would need four bytes per
character which would, from their perspective, be a greater problem.
Also I suspect that Unicode data is actually being stored in 16-bit
entities, and that the major issue is the extra complication of handling
surrogate pairs within that representation (rather than the trivial one
of converting such pairs to and from valid UTF-8).
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 07:33:11 EST