Re: Nicest UTF

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Dec 03 2004 - 11:56:38 CST

  • Next message: Peter Constable: "RE: OpenType vs TrueType (was current version of unicode-font)"

    That's a good response. I would add a couple of other factors:

    - What APIs will you be using? If most of the APIs take/return a particular
    UTF, the cost of constant conversions will swamp many if not most other
    performance considerations.
    - Asmus mentioned memory, but I'd like to add to that. When you are using
    virtual memory, significant increases in memory usage will cause a
    considerable slowdown because of swapping. This is especially important in
    server environments.

    ‎Mark

    ----- Original Message -----
    From: "Asmus Freytag" <asmusf@ix.netcom.com>
    To: "Doug Ewell" <dewell@adelphia.net>; "Unicode Mailing List"
    <unicode@unicode.org>
    Sent: Friday, December 03, 2004 07:55
    Subject: Re: Nicest UTF

    > At 09:56 PM 12/2/2004, Doug Ewell wrote:
    > >I use ... and UTF-32 for most internal processing that I write
    > >myself. Let people say UTF-32 is wasteful if they want; I don't tend to
    > >store huge amounts of text in memory at once, so the overhead is much
    > >less important than one code unit per character.
    >
    >
    > For performance-critical applications on the other hand, you need to use
    > whichever UTF gives you the correct balance in speed and average storage
    > size for your data.
    >
    > If you have very large amounts of data, you'll be sensitive to cache
    > overruns. Enough so, that UTF-32 may be disqualified from the start.
    > I have encountered systems for which that was true.
    >
    > If your 'per character' operations are based on parsing for ASCII symbols,
    > e.g. HTML parsing, then both UTF-8 and UTF-16 allow you to process your
    > data directly, w/o need to worry about the longer sequences. For such
    > tasks, it may be that some processors will work faster if working in
    > 32-bit chunks.
    >
    > However, many 'inner loop' algorithms, such as copy, can be implemented
    > using native machine words, handling multiple characters, or parts of
    > characters, at once, independent of the UTF.
    >
    > And even in those situations, the savings from that better not be
    > offset by cache limitations.
    >
    > A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
    >
    > 1) 1 extra test per character (to see whether it's a surrogate)
    >
    > 2) special handling every 100 to 1000 characters (say 10 instructions)
    >
    > 3) additional cost of accessing 16-bit registers (per character)
    >
    > 4) reduction in cache misses (each the equivalent of many instructions)
    >
    > 5) reduction in disk access (each the equivaletn of many many
    instructions)
    >
    > For many operations, e.g. string length, both 1, and 2 are no-ops,
    > so you need to apply a reduction factor based on the mix of operations
    > you do perform, say 50%-75%.
    >
    > For many processors, item 3 is not an issue.
    >
    > For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
    > occurrence depending on the architecture. Their relative weight depends
    > not only on cache sizes, but also on how many other instructions per
    > character are performed. For text scanning operations, their cost
    > does predominate with large data sets.
    >
    > Given this little model and some additional assumptions about your
    > own project(s), you should be able to determine the 'nicest' UTF for
    > your own performance-critical case.
    >
    > A./
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 12:01:31 CST