Re: What does it mean to "not be a valid string in Unicode"?

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Tue, 8 Jan 2013 03:06:27 -0800

> Wouldn't the clean way be to ensure valid strings (only) when they're
>> built
>>
>
> Of course, the earlier erroneous data gets caught, the better. The problem
> is that error checking is expensive, both in lines of code and in execution
> time (I think there is data showing that in any real-life programs, more
> than 50% or 80% or so is error checking, but I forgot the details).
>
> So indeed as Ken has explained with a very good example, it doesn't make
> sense to check at every corner.

What I meant: The idea was to check only when a string is constructed. As
soon as it's been fed into a collation/whatever algorithm, the algorithm
should assume the original input was well-formed and shouldn't do any more
error-checking, yes.

Not having facilities for dealing with ill-formed values ("U+"D800 ..
"U+"DFFF) in an algorithm will surely make *something* faster, even if it's
just some table that's being used indirectly having fewer entries.

What I had in mind is a library where the public interface only ever allows
Unicode scalar values to be in- and output. This will lead to a cleaner
interface. A data structure that can hold surrogate values can and should
be used algorithm-*internally*, if that makes things more efficient, safer,
etc.

Convenience of implementation is an important aspect in programming.

For a user yes, but not for a library writer/maintainer, I would suggest.
The STL uses red-black trees; these are annoyingly difficult to implement
but invisible to the user.

Stephan
Received on Tue Jan 08 2013 - 05:09:27 CST

This archive was generated by hypermail 2.2.0 : Tue Jan 08 2013 - 05:09:28 CST