If only MS Word was coded this well (was Re: Nicest UTF)

From: Theodore H. Smith (delete@elfdata.com)
Date: Tue Dec 07 2004 - 16:42:40 CST

  • Next message: Philippe Verdy: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    > From: "D. Starner" <shalesller@writeme.com>

    > (Sorry for sending this twice, Marcin.)
    >
    > "Marcin 'Qrczak' Kowalczyk" writes:
    >> UTF-8 is poorly suitable for internal processing of strings in a
    >> modern programming language (i.e. one which doesn't already have a
    >> pile of legacy functions working of bytes, but which can be designed
    >> to make Unicode convenient at all). It's because code points have
    >> variable lengths in bytes, so extracting individual characters is
    >> almost meaningless

    Same with UTF-16 and UTF-32. A character is multiple code-points,
    remember? (decomposed chars?)

    >> (unless you care only about the ASCII subset, and
    >> sequences of all other characters are treated as non-interpreted bags
    >> of bytes).

    Nope. I've done tons of UTF-8 string processing. I've even done a case
    insensitive word-frequency measuring algorithm on UTF-8. It runs
    blastingly fast, because I can do the processing with bytes.

    It just requires you to understand the actual logic of UTF-8 well
    enough to know that you can treat it as bytes, most of the time.

    And the times you can't treat it as bytes, usually you can't even treat
    UTF-32 as bytes!

    If you are talking about creating an editfield or text control or
    something, that is true that UTF-32 is better. However, UTF-16 is the
    worst of all cases, you'd be better off using UTF-8 as the native
    encoding of an editfield.

    The thing is, very very very few people write editfields.

    I've seen tons of XML parsers in my lifetime (at least 3 I wrote
    myself), but only a few editfield libraries.

    Its a shame that very few people understand the different UTFs properly.

    As for isspace... sure there is a UTF-8 non-byte space.

    My case insensitive utf-8 word frequency counter (which runs blastingly
    fast) however didn't find this to be any problem. It dealt with
    non-single byte all sorts of word breaks :o)

    It appears to run at about 3MB/second on my laptop, which involves for
    every word, doing a word check on the entire previous collection of
    words.

    Thats like having MS Word spell-check 3MB of pure Unicode text (no
    style junk bloating up the file-size) in one second, for you. (The
    words would all be spelt correctly though, so as to not require
    expensive RAM copying when doing the replacements.)

    Yes, I do know how to code ;o)

    Too bad so few others do.

    --
        Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
        Industrial strength string processing code, made easy.
        (If you believe that's an oxymoron, see for yourself.)
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 16:44:27 CST