From: verdy_p (verdy_p@wanadoo.fr)
Date: Mon Dec 28 2009 - 02:16:01 CST
"Dominikus Scherkl" wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Asmus Freytag schrieb:
> > On 12/27/2009 9:56 AM, - - wrote:
> >> 1) Validate that UTF-8 is well-formed with no overlong byte sequences
> >> or 5 to 6 byte sequences.
> >>
> >> 2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
> >> * 0x0000 - 0x001F (1st bunch of control characters)
> >>
> > This would eliminate the TAB character. That doesn't seem promising for
> > "text".
> It would also filter CR and LF. At least these three should not be
> filtered. I personally would also allow VT (vertical tab).
Simply for the compatibility with many text-editors, if I had to keep only one end-of-line control character (all
others being normalized to it in plain texts), I would keep just LF which maps conveniently as the default "\n"
character in C/C++ (but CR on MacOS plaforms where the mapping of \n and \r were historically swapped), Java and C#
(you don't have this choice). VT is rarely used as the end-of-line mark, most editors will render it with some glyph
or with some escaped meta-notation (e.g. in Emacs and vi or vim with classic console charsets).
But I would definitely not filter the new line controls: normalizing these controls (or the CR+LF sequence) on input
from external sources will remain (notably because CR+LF is normally mandatory in MIME plain-text formats and in
many text-based Web protocols, including HTTP or FTP and their secure variants).
And I would also include FF (mappable as the escape sequence "\f" in C/C++/Java/J#/C#) as another newline and as a
whitespace : it occurs quite frequently in many C/C++ sources, to specify a page break position when printing or
rendering the source to a paged media such as a PDF report (it occurs in fact much more frequently than VT, that
I've never seen and that is probably rejected as an invalid source characters in many computer languages, including
C/C++ compilers even when they support the "\v" escape for mapping it in litteral string or character constants, or
in character array initializers).
Philippe.
This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 02:18:36 CST