If only MS Word was coded this well (was Re: Nicest UTF)

From: Theodore H. Smith (delete@elfdata.com)
Date: Tue Dec 07 2004 - 16:42:40 CST

Next message: Philippe Verdy: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

Previous message: Philippe Verdy: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: If only MS Word was coded this well"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: If only MS Word was coded this well"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> From: "D. Starner" <shalesller@writeme.com>

> (Sorry for sending this twice, Marcin.)
>
> "Marcin 'Qrczak' Kowalczyk" writes:
>> UTF-8 is poorly suitable for internal processing of strings in a
>> modern programming language (i.e. one which doesn't already have a
>> pile of legacy functions working of bytes, but which can be designed
>> to make Unicode convenient at all). It's because code points have
>> variable lengths in bytes, so extracting individual characters is
>> almost meaningless

Same with UTF-16 and UTF-32. A character is multiple code-points,
remember? (decomposed chars?)

>> (unless you care only about the ASCII subset, and
>> sequences of all other characters are treated as non-interpreted bags
>> of bytes).

Nope. I've done tons of UTF-8 string processing. I've even done a case
insensitive word-frequency measuring algorithm on UTF-8. It runs
blastingly fast, because I can do the processing with bytes.

It just requires you to understand the actual logic of UTF-8 well
enough to know that you can treat it as bytes, most of the time.

And the times you can't treat it as bytes, usually you can't even treat
UTF-32 as bytes!

If you are talking about creating an editfield or text control or
something, that is true that UTF-32 is better. However, UTF-16 is the
worst of all cases, you'd be better off using UTF-8 as the native
encoding of an editfield.

The thing is, very very very few people write editfields.

I've seen tons of XML parsers in my lifetime (at least 3 I wrote
myself), but only a few editfield libraries.

Its a shame that very few people understand the different UTFs properly.

As for isspace... sure there is a UTF-8 non-byte space.

My case insensitive utf-8 word frequency counter (which runs blastingly
fast) however didn't find this to be any problem. It dealt with
non-single byte all sorts of word breaks :o)

It appears to run at about 3MB/second on my laptop, which involves for
every word, doing a word check on the entire previous collection of
words.

Thats like having MS Word spell-check 3MB of pure Unicode text (no
style junk bloating up the file-size) in one second, for you. (The
words would all be spelt correctly though, so as to not require
expensive RAM copying when doing the replacements.)

Yes, I do know how to code ;o)

Too bad so few others do.

--
    Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
    Industrial strength string processing code, made easy.
    (If you believe that's an oxymoron, see for yourself.)

Next message: Philippe Verdy: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Previous message: Philippe Verdy: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: If only MS Word was coded this well"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: If only MS Word was coded this well"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 16:44:27 CST