Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 03 2004 - 16:36:48 CST

Next message: Philippe Verdy: "Re: OpenType vs TrueType (was current version of unicode-font)"

Previous message: Peter Constable: "RE: OpenType vs TrueType (was current version of unicode-font)"
In reply to: Mark Davis: "Re: Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> From: "Asmus Freytag" <asmusf@ix.netcom.com>
>> A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
>>
>> 1) 1 extra test per character (to see whether it's a surrogate)
>>
>> 2) special handling every 100 to 1000 characters (say 10 instructions)
>>
>> 3) additional cost of accessing 16-bit registers (per character)
>>
>> 4) reduction in cache misses (each the equivalent of many instructions)
>>
>> 5) reduction in disk access (each the equivaletn of many many
> instructions)
>> (...)
>> For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
>> occurrence depending on the architecture. Their relative weight depends
>> not only on cache sizes, but also on how many other instructions per
>> character are performed. For text scanning operations, their cost
>> does predominate with large data sets.

I tend to disagree with you on points 4 and 5: cache misses, and disk
accesses (more commonly refered to as "data locality" in computing
performances) really favors UTF-16 face to UTF-32, simply because UTF-16
will be more compact for almost every text you need to process, unless you
are working on texts that only contain characters from a script *not present
at all* in the BMP (this sentence excludes Han, even if there are tons of
ideographs out of the BMP, because these ideographs are almost never used
alone, but used seldomly within tons of other conventional Han characters in
the BMP).

Given that these scripts are all historic ones, or were encoded for
technical purpose with very specific usage, a very large majority of texts
will not use significant numbers of characters out of the BMP, so the use of
surrogates in UTF-16 will remain a minority. In all cases, even for texts
made only of characters out of the BMP, UTF-16 can't be larger than UTF-32.

The only case where it would be worse than UTF-32 is for the internal
representation of strings in memory, where 16-bit code units can't be
represented with 16-bit only, for example if memory cells are not
individually addressable below units of at least 32 bits, and the CPU
architecture is very inefficient when working with 16-bit bitfields within
32-bit memory units or registers, due to extra shifts and masking operations
needed to pack and unpack 16-bit bitfields into a single 32-bit memory cell.

I doubt that such architecture would be very successful, given that too many
standard protocols depend on being able to work with datastreams made of
8-bit bytes: with such architecture, all data I/O would need to store 8-bit
bytes in separate but addressable 32-bit memory cells, which would really be
a poor usage of available central memory (such architecture would require
much more RAM to work with equivalent performances for data I/O, and even
the very costly fast RAM caches would need to be increased a lot, meaning
higher hardware construction costs).

So even on such 32-bit only (or 64-bit only...) architectures (where for
example the C datatype "char" would be 32-bit or 64-bit), there would be
efficient instructions in the CPU to allow packing/unpacking bytes in 32-bit
(or 64-bit) memory cells (or at least at the register level, with
instructions allowing to work efficiently with such bitfields).

Next message: Philippe Verdy: "Re: OpenType vs TrueType (was current version of unicode-font)"
Previous message: Peter Constable: "RE: OpenType vs TrueType (was current version of unicode-font)"
In reply to: Mark Davis: "Re: Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 16:40:27 CST