Re: Filtering and displaying untrusted UTF-8

From: Jason Schauberger (crossroads0000@googlemail.com)
Date: Mon Dec 28 2009 - 17:05:23 CST

Next message: verdy_p: "Re: Filtering and displaying untrusted UTF-8"

Previous message: Jukka K. Korpela: "Re: Filtering and displaying untrusted UTF-8"
In reply to: Jukka K. Korpela: "Re: Filtering and displaying untrusted UTF-8"
Next in thread: verdy_p: "Re: Filtering and displaying untrusted UTF-8"
Reply: verdy_p: "Re: Filtering and displaying untrusted UTF-8"
Reply: Andrew West: "Re: Filtering and displaying untrusted UTF-8"
Reply: Andrew West: "Re: Filtering and displaying untrusted UTF-8"
Reply: Doug Ewell: "Re: Filtering and displaying untrusted UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello again.

On Mon, Dec 28, 2009 at 7:50 PM, Jukka K. Korpela <jkorpela@cs.tut.fi> wrote:
> Therefore, regarding U+FEFF as not allowed in plain text datastream would be
> a big mistake, even though filtering it out would normally result in
> inferior typography at most.

Then it's probably the best to only disallow it if it's the first code
point, otherwise let it through.

On Mon, Dec 28, 2009 at 3:48 AM, verdy_p <verdy_p@wanadoo.fr> wrote:
> May be the NEXT LINE (U+0085) character, in C1 controls, part of all ISO 8859 charsets (for MIME) at position 0x85,
> which is valid as a line separator or as a blank in HTML?
> You may want to replace it with CRLF sequences, or you may want to uniformize the various encodings of newlines (CR
> not followed by LF, CR+LF, LF not following CR, NL) into a single one (such as LF, for compatibility with C language
> standard I/O) on input (and generate CR+LF on output).
>

That's a good idea. I wonder if there are there any more code points
which should be encoded in HTML?

On Mon, Dec 28, 2009 at 4:29 AM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
>>
>> 2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
>> * 0x0000 - 0x001F (1st bunch of control characters)
>>
>
> This would eliminate the TAB character. That doesn't seem promising for
> "text".

Agreed. As others have pointed out, the newline character(s) and
similar should be in there as well.

>>
>> For the rest, allow all ***assigned*** code points, filter unassigned.
>>
>
> That's a fool's game, because assigned code points are version dependent.
> Even if one could adopt a "supported version" for one's own code, nothing
> guarantees that the codes were assigned at the time the originating software
> was written. If not, they could represent data that wasn't really text in
> the context it was created in. Further, the minute the next version of
> Unicode comes along, this will prevent the software from handling perfectly
> well-defined and standardized characters.

I tend to disagree. Of course it's likely that now unassigned code
points are assigned a character in future Unicode versions. However,
it's also possible that some of them will be assigned non-characters.
Then what's the point in filtering out any non-characters at all, if
you're completely neglecting the possibility that new non-characters
or control characters may be added in future versions and your
algorithm is potentially leaving them unfiltered? This is not only
inconsistent, but renders the current attempt at filtering completely
moot. And if you argue that you could update the algorithm to be also
aware of the new control characters, the same can be said about
updating the algorithm to be aware of newly assigned text characters.

I think it is much more consistent to offer an API call to get the
current Unicode database version used and an easy way to update it
when a new Unicode version is released, especially since AIUI most
written and spoken languages are already represented in the current
Unicode version. Hence, the possibilty of interchanged text becoming
illegible due to completely new characters being filtered is rather
slim.

>
> At the same time, there's no attempt to filter the non-characters in the
> FDD0-FDEF range, which looks like a clear omission.

I agree, FDD0-FDEF should be added to the list of characters to
filter/replace. Same goes for 100FE, 100FF, 200FE, 200FF, and so on.

>>
>> 3) For code points in planes 3 to 13 (unassigned planes) filter the
>> complete range 0x30000 to 0xDFFFF.
>>
>> 4) For code points in plane 14 (SSP) allow all ***assigned*** code
>> points, filter unassigned.
>>
>
> The "Tag characters" from E0000 to E007F are deprecated and have no business
> in ordinary text. Much more useful set of characters to consider for
> filtering than those that are merely "not yet assigned".

I agree here, too.

I wonder if it's maybe better to not leave out code points, but
instead replace them with a replacement code point like 0xFFFD--"used
to replace an incoming character whose value is unknown or
unrepresentable in Unicode". Any thoughts?

Kind regards.

Next message: verdy_p: "Re: Filtering and displaying untrusted UTF-8"
Previous message: Jukka K. Korpela: "Re: Filtering and displaying untrusted UTF-8"
In reply to: Jukka K. Korpela: "Re: Filtering and displaying untrusted UTF-8"
Next in thread: verdy_p: "Re: Filtering and displaying untrusted UTF-8"
Reply: verdy_p: "Re: Filtering and displaying untrusted UTF-8"
Reply: Andrew West: "Re: Filtering and displaying untrusted UTF-8"
Reply: Andrew West: "Re: Filtering and displaying untrusted UTF-8"
Reply: Doug Ewell: "Re: Filtering and displaying untrusted UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 17:16:29 CST