From: Theodore H. Smith (delete@elfdata.com)
Date: Tue Dec 07 2004 - 16:42:40 CST
> From: "D. Starner" <shalesller@writeme.com>
> (Sorry for sending this twice, Marcin.)
>
> "Marcin 'Qrczak' Kowalczyk" writes:
>> UTF-8 is poorly suitable for internal processing of strings in a
>> modern programming language (i.e. one which doesn't already have a
>> pile of legacy functions working of bytes, but which can be designed
>> to make Unicode convenient at all). It's because code points have
>> variable lengths in bytes, so extracting individual characters is
>> almost meaningless
Same with UTF-16 and UTF-32. A character is multiple code-points,
remember? (decomposed chars?)
>> (unless you care only about the ASCII subset, and
>> sequences of all other characters are treated as non-interpreted bags
>> of bytes).
Nope. I've done tons of UTF-8 string processing. I've even done a case
insensitive word-frequency measuring algorithm on UTF-8. It runs
blastingly fast, because I can do the processing with bytes.
It just requires you to understand the actual logic of UTF-8 well
enough to know that you can treat it as bytes, most of the time.
And the times you can't treat it as bytes, usually you can't even treat
UTF-32 as bytes!
If you are talking about creating an editfield or text control or
something, that is true that UTF-32 is better. However, UTF-16 is the
worst of all cases, you'd be better off using UTF-8 as the native
encoding of an editfield.
The thing is, very very very few people write editfields.
I've seen tons of XML parsers in my lifetime (at least 3 I wrote
myself), but only a few editfield libraries.
Its a shame that very few people understand the different UTFs properly.
As for isspace... sure there is a UTF-8 non-byte space.
My case insensitive utf-8 word frequency counter (which runs blastingly
fast) however didn't find this to be any problem. It dealt with
non-single byte all sorts of word breaks :o)
It appears to run at about 3MB/second on my laptop, which involves for
every word, doing a word check on the entire previous collection of
words.
Thats like having MS Word spell-check 3MB of pure Unicode text (no
style junk bloating up the file-size) in one second, for you. (The
words would all be spelt correctly though, so as to not require
expensive RAM copying when doing the replacements.)
Yes, I do know how to code ;o)
Too bad so few others do.
-- Theodore H. Smith - Software Developer - www.elfdata.com/plugin/ Industrial strength string processing code, made easy. (If you believe that's an oxymoron, see for yourself.)
This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 16:44:27 CST