Re: Getting A Newb Started

From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 07 2008 - 20:07:28 CDT

Next message: William J Poser: "Re: Getting A Newb Started"

Previous message: Kenneth Whistler: "Re: Getting A Newb Started"
In reply to: David Starner: "Re: Getting A Newb Started"
Next in thread: John H. Jenkins: "Re: Getting A Newb Started"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>Most low-level string processing code shouldn't need to be rewritten
>for each application. If you've got UCS-2 only code, you have to
>reëvaluate it for each project, or introduce a subtle bug by the reuse
>of code. If you don't reuse code, you're probably rewriting code,
>which introduces bugs, especially in the parts that aren't
>well-tested--which for most people will include non-BMP characters.

True, but it depends what code you're writing. Some things are general
purpose and should handle anything. Some things are very application
specific.

>And just because you can clean and validate user input doesn't mean
>that you should arbitrarily forbid non-BMP characters. One of the
>principles of Unicode is that you can pass through arbitrary scripts
>and not worry about the difference.

That's a principle of Unicode, not a design requirement for particular
applications. It is prefectly appropriate to write programs
using Unicode that are aimed at a particular language or writing
system and so can make "arbitrary" assumptions.

>>I don't get the point. Whether you're dealing with one character or
>>many, life is simpler if they're all the same size.

>If I have to look up a single character in an array, it makes a
>difference. If I'm looking up multiple characters, it no longer
>matters the length of any one of them; you're passing and returning
>strings.

True, for string lookup it doesn't matter. For determining how much storage
to allocate assuming you know how many characters are in a string, it does.
For moving from character to character, it does.

>>But for some purposes, yes, you can assume that input is BMP-only.
>>Not all input comes direct from the user.

>Even for the times that you can assume integer input is positive, you
>generally need to guard that code carefully with run-time tests. I
>would regard nothing less as reasonable and necessary for code that
>assumes the input in in the BMP.

I agree that you need run-time tests, but you can't put them everywhere.
You have to have a few points at which you test and elsewhere rely
on your code not to take the data out of bounds.

>If simplicity is your goal, why not use UTF-32?

That is what I normally do. You seem to think that I am advocating
UTF-16. Far from it. I never use it. I was simply listing pros and cons
of the various formats.

Next message: William J Poser: "Re: Getting A Newb Started"
Previous message: Kenneth Whistler: "Re: Getting A Newb Started"
In reply to: David Starner: "Re: Getting A Newb Started"
Next in thread: John H. Jenkins: "Re: Getting A Newb Started"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 20:09:03 CDT