RE: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Doug Ewell (doug@ewellic.org)
Date: Fri Nov 05 2010 - 14:56:09 CST

  • Next message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

    >> Doing conversion and validation at different stages isn't a great
    >> idea; that's how character encodings get involved with security
    >> problems.
    >
    > Note that I am careful not to suggest that (and I'm sure Markus isn't
    > either). "Handling" includes much more than code conversion. It
    > includes uppercasing, spell checking, sorting, searching, the whole
    > lot. Burdening every single one of those tasks with policing the
    > integrity of the encoding seems wasteful, and, as I tried to explain,
    > puts the error detection in a place where you'll be most likely
    > prevented from doing something useful in recovery.

    Right, but as I said, those downstream tasks shouldn't be consumers of
    UTF-16 code units anyway. They should be consumers of Unicode code
    points, which by definition excludes loose surrogates.

    --
    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 15:00:30 CST