From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Jun 04 2006 - 03:59:27 CDT
On 4 Jun 2006, at 02:53, Asmus Freytag wrote:
> All nice advantages. On the other hand, the minute you do text
> processing on the actual text data, such as morphological analysis,
What's that? Like levenshtein? (EditDistance) If you are talking
about a levenshtein-like thing on Unicode, well you can't do it with
codepoint processing, because a character is not a codepoint, a
character is a string of codepoints. So if your "cells" must now be
strings intead of bytes or UInt32s... you might as well use a string
of UTF-8 instead of a string of UTF-32.
> case transformation,
I got case transformation code, running directly on UTF-8.
Once I figure out in theory how to do normalisation on UTF-32, or at
all even, then I'll be able to do it on UTF-8 also.
> linguistically aware search, etc.
I bet I could use my existing code to do this also :) Besides, what
if your code must deal with things like ß == SS in the search? You'll
need a string processing library not a codepoint processing library,
because the equivalence is done on a variable length unit! (I so
happen to have such "variable length unit" processing code on UTF-8 :) )
> you will need to perform an implicit conversion to integral
> character values in order to get at their properties, which you
> will need to drive your algorithm.
But you need string processing to do case transformation, because in
Unicode, characters are sequences of codepoints.
I've actually said the same points, in response to the same points, a
few times, now, in this thread. No one's found any flaw in my points,
no one's yet responded to my offer of me putting up an example on my
web host that does Unicode case transformations, even.
-- http://elfdata.com/plugin/
This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 04:15:50 CDT