From: Hans Aberg (haberg@math.su.se)
Date: Sun Jun 04 2006 - 07:21:44 CDT
On 4 Jun 2006, at 03:53, Asmus Freytag wrote:
> UTF-32 loses on all counts: it's so space inefficient that for
> large scale text processing it's swamped by cache misses,
What do you have in your mind here?
> and the slight gain in efficiency for accessing character property
> values matters only for selected text corpora, such as cuneiform
> etc, that are entirely off the BMP.
This does just say that for character sets confined to a particular
region, an encoding optimizing that is more efficient, though it will
loose out in general use. It might be better choosing a more
efficient optimizing method than a particular legacy encoding.
> Therfore, if you need to perform more than one operation on UTF-32
> or hold large data in memory, it almost always pays to convert it
> to some other encoding form - UTF-16 being the easier conversion.
I am not sure what you have in your mind here: With modern use of
virtual memory, the OS emulates a large data space. For 32-bit
computers, this is typically 2^31 bytes (or words), but these are now
on the way out, in favor of 64-bit computers with even larger address
space.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 07:26:49 CDT