Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 03 2004 - 16:36:48 CST

  • Next message: Philippe Verdy: "Re: OpenType vs TrueType (was current version of unicode-font)"

    > From: "Asmus Freytag" <asmusf@ix.netcom.com>
    >> A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
    >>
    >> 1) 1 extra test per character (to see whether it's a surrogate)
    >>
    >> 2) special handling every 100 to 1000 characters (say 10 instructions)
    >>
    >> 3) additional cost of accessing 16-bit registers (per character)
    >>
    >> 4) reduction in cache misses (each the equivalent of many instructions)
    >>
    >> 5) reduction in disk access (each the equivaletn of many many
    > instructions)
    >> (...)
    >> For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
    >> occurrence depending on the architecture. Their relative weight depends
    >> not only on cache sizes, but also on how many other instructions per
    >> character are performed. For text scanning operations, their cost
    >> does predominate with large data sets.

    I tend to disagree with you on points 4 and 5: cache misses, and disk
    accesses (more commonly refered to as "data locality" in computing
    performances) really favors UTF-16 face to UTF-32, simply because UTF-16
    will be more compact for almost every text you need to process, unless you
    are working on texts that only contain characters from a script *not present
    at all* in the BMP (this sentence excludes Han, even if there are tons of
    ideographs out of the BMP, because these ideographs are almost never used
    alone, but used seldomly within tons of other conventional Han characters in
    the BMP).

    Given that these scripts are all historic ones, or were encoded for
    technical purpose with very specific usage, a very large majority of texts
    will not use significant numbers of characters out of the BMP, so the use of
    surrogates in UTF-16 will remain a minority. In all cases, even for texts
    made only of characters out of the BMP, UTF-16 can't be larger than UTF-32.

    The only case where it would be worse than UTF-32 is for the internal
    representation of strings in memory, where 16-bit code units can't be
    represented with 16-bit only, for example if memory cells are not
    individually addressable below units of at least 32 bits, and the CPU
    architecture is very inefficient when working with 16-bit bitfields within
    32-bit memory units or registers, due to extra shifts and masking operations
    needed to pack and unpack 16-bit bitfields into a single 32-bit memory cell.

    I doubt that such architecture would be very successful, given that too many
    standard protocols depend on being able to work with datastreams made of
    8-bit bytes: with such architecture, all data I/O would need to store 8-bit
    bytes in separate but addressable 32-bit memory cells, which would really be
    a poor usage of available central memory (such architecture would require
    much more RAM to work with equivalent performances for data I/O, and even
    the very costly fast RAM caches would need to be increased a lot, meaning
    higher hardware construction costs).

    So even on such 32-bit only (or 64-bit only...) architectures (where for
    example the C datatype "char" would be 32-bit or 64-bit), there would be
    efficient instructions in the CPU to allow packing/unpacking bytes in 32-bit
    (or 64-bit) memory cells (or at least at the register level, with
    instructions allowing to work efficiently with such bitfields).



    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 16:40:27 CST