From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Feb 05 2008 - 12:56:49 CST
Phillipe gave some very interesting arguments (complete with specific
figures) but without citing his evidence or stating the assumptions. A
thorough comparison of the performance of large data volumes in the
various encoding forms would be interesting.
Assuming for the moment, that the general arguments that Phillipe
presented are not that far off the mark, it would seem that UTF-16 is
not such a bad choice either. Because all, except very specialized, data
collections can expect to have 99+% of their character codes in the BMP,
the cost of decompressing the data to UTF-32 is dominated by the case
for BMP characters. Even if handing surrogates were to take 100 times as
long, that would only double the average.
In the meantime, the benefits of more localized memory access are those
of a 50% reduction, not a 25% reduction. Plus, in many cases, you get
the benefit of direct library support w/o the need to convert the
strings, if you want.
That's the real argument I see against a 3-byte form.
But, knowing programmers, they won't rest until every single permutation
of possible encoding forms has been used and foisted on some
unsuspecting user.
A./
This archive was generated by hypermail 2.1.5 : Tue Feb 05 2008 - 12:59:44 CST