From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 07 2008 - 20:07:28 CDT
>Most low-level string processing code shouldn't need to be rewritten
>for each application. If you've got UCS-2 only code, you have to
>reƫvaluate it for each project, or introduce a subtle bug by the reuse
>of code. If you don't reuse code, you're probably rewriting code,
>which introduces bugs, especially in the parts that aren't
>well-tested--which for most people will include non-BMP characters.
True, but it depends what code you're writing. Some things are general
purpose and should handle anything. Some things are very application
specific.
>And just because you can clean and validate user input doesn't mean
>that you should arbitrarily forbid non-BMP characters. One of the
>principles of Unicode is that you can pass through arbitrary scripts
>and not worry about the difference.
That's a principle of Unicode, not a design requirement for particular
applications. It is prefectly appropriate to write programs
using Unicode that are aimed at a particular language or writing
system and so can make "arbitrary" assumptions.
>>I don't get the point. Whether you're dealing with one character or
>>many, life is simpler if they're all the same size.
>If I have to look up a single character in an array, it makes a
>difference. If I'm looking up multiple characters, it no longer
>matters the length of any one of them; you're passing and returning
>strings.
True, for string lookup it doesn't matter. For determining how much storage
to allocate assuming you know how many characters are in a string, it does.
For moving from character to character, it does.
>>But for some purposes, yes, you can assume that input is BMP-only.
>>Not all input comes direct from the user.
>Even for the times that you can assume integer input is positive, you
>generally need to guard that code carefully with run-time tests. I
>would regard nothing less as reasonable and necessary for code that
>assumes the input in in the BMP.
I agree that you need run-time tests, but you can't put them everywhere.
You have to have a few points at which you test and elsewhere rely
on your code not to take the data out of bounds.
>If simplicity is your goal, why not use UTF-32?
That is what I normally do. You seem to think that I am advocating
UTF-16. Far from it. I never use it. I was simply listing pros and cons
of the various formats.
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 20:09:03 CDT