From: Doug Ewell (dewell@adelphia.net)
Date: Fri Dec 10 2004 - 10:20:18 CST
Arcane Jill <arcanejill at ramonsky dot com> wrote:
> Here's something that's been bothering me. Suppose I write a function
> - let's call it trim(), which removes leading and trailing spaces from
> a string, represented as one of the UTFs. If I've understood this
> correctly, I'm supposed to validate the input, yes?
>
> Okay, now suppose I write a second function - let's call it tolower(),
> which lowercases a string, again represented as one of the UTFs.
> Again, I guess I'm supposed to validate the input. yes?...
This is one reason why I work with "strings" of code points, and only
convert strings of UTF code units when I read them in and write them
out. The read and write functions do the necessary validation, allowing
the rest of the code to focus on characters. If you operate directly on
strings of UTF-8 bytes, you have to worry about things like this.
To answer your question, if you've already validated your input, and you
generate only valid output (which I hope is the case :-), and your
second function ONLY gets (valid) data from your first function, then
you probably don't need to re-validate them. But I'd hate to have to do
tolower() for non-Basic-Latin on strings of UTF-8 bytes.
For me, conversion from any CES or TES always implies validation.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 10:22:21 CST