In practice and by design, treating isolated surrogates the same as
reserved code points in processing, and then cleaning up on conversion to
UTFs works just fine. It is a tradeoff that is up to the implementation.
It has nothing to do with a "legacy of C pointer arithmetic". It does
represent a pragmatic choice some time ago, but there is no need getting
worked up about it. Human scripts and their representation on computers is
quite complex enough; in the grand scheme of things the handling of
surrogates in implementations pales in significance.
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**
On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller
<stephan.stiller_at_gmail.com>wrote:
>
> Things like this are called "garbage in, garbage-out" (GIGO). It may be
>>> harmless, or it may hurt you later.
>>>
>> So in this kind of a case, what we are actually dealing with is: garbage
>> in, principled, correct results out. ;-)
>>
>
> Wouldn't the clean way be to ensure valid strings (only) when they're
> built and then make sure that string algorithms (only) preserve
> well-formedness of input?
>
> Perhaps this is how the system grew, but it seems to be that it's
> yet another legacy of C pointer arithmetic and
> about convenience of implementation
> rather than a
> safety or
> performance
> issue.
>
> Stephan
>
>
>
Received on Tue Jan 08 2013 - 00:53:34 CST
This archive was generated by hypermail 2.2.0 : Tue Jan 08 2013 - 00:53:35 CST