Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Hans Aberg (haberg@math.su.se)
Date: Sun Jun 04 2006 - 07:21:44 CDT

Next message: Hans Aberg: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"

Previous message: Adam Twardoch: "Re: Glyphs for German quotation marks"
In reply to: Asmus Freytag: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 4 Jun 2006, at 03:53, Asmus Freytag wrote:

> UTF-32 loses on all counts: it's so space inefficient that for
> large scale text processing it's swamped by cache misses,

What do you have in your mind here?

> and the slight gain in efficiency for accessing character property
> values matters only for selected text corpora, such as cuneiform
> etc, that are entirely off the BMP.

This does just say that for character sets confined to a particular
region, an encoding optimizing that is more efficient, though it will
loose out in general use. It might be better choosing a more
efficient optimizing method than a particular legacy encoding.

> Therfore, if you need to perform more than one operation on UTF-32
> or hold large data in memory, it almost always pays to convert it
> to some other encoding form - UTF-16 being the easier conversion.

I am not sure what you have in your mind here: With modern use of
virtual memory, the OS emulates a large data space. For 32-bit
computers, this is typically 2^31 bytes (or words), but these are now
on the way out, in favor of 64-bit computers with even larger address
space.

Hans Aberg

Next message: Hans Aberg: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Previous message: Adam Twardoch: "Re: Glyphs for German quotation marks"
In reply to: Asmus Freytag: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Reply: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 07:26:49 CDT