From: David Starner (prosfilaes@gmail.com)
Date: Mon Jul 07 2008 - 19:33:06 CDT
On Mon, Jul 7, 2008 at 7:17 PM, William J Poser <wjposer@ldc.upenn.edu> wrote:
> Of course you want to be prepared for any possible input, but
> in some cases you do know what the range of possible inputs is.
> The input may not be coming directly from the user. It may be user
> input that has already been cleaned or validated, or it may be
> data that you yourself have generated.
Most low-level string processing code shouldn't need to be rewritten
for each application. If you've got UCS-2 only code, you have to
reƫvaluate it for each project, or introduce a subtle bug by the reuse
of code. If you don't reuse code, you're probably rewriting code,
which introduces bugs, especially in the parts that aren't
well-tested--which for most people will include non-BMP characters.
And just because you can clean and validate user input doesn't mean
that you should arbitrarily forbid non-BMP characters. One of the
principles of Unicode is that you can pass through arbitrary scripts
and not worry about the difference.
> I don't get the point. Whether you're dealing with one character or
> many, life is simpler if they're all the same size.
If I have to look up a single character in an array, it makes a
difference. If I'm looking up multiple characters, it no longer
matters the length of any one of them; you're passing and returning
strings.
> But for some purposes, yes, you can assume that input is BMP-only.
> Not all input comes direct from the user.
Even for the times that you can assume integer input is positive, you
generally need to guard that code carefully with run-time tests. I
would regard nothing less as reasonable and necessary for code that
assumes the input in in the BMP. If simplicity is your goal, why not
use UTF-32?
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 19:36:19 CDT