From: Dominikus Scherkl (lyratelle@gmx.de)
Date: Mon Dec 28 2009 - 01:35:23 CST
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Asmus Freytag schrieb:
> On 12/27/2009 9:56 AM, - - wrote:
>> 1) Validate that UTF-8 is well-formed with no overlong byte sequences
>> or 5 to 6 byte sequences.
>>
>> 2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
>> * 0x0000 - 0x001F (1st bunch of control characters)
>>
> This would eliminate the TAB character. That doesn't seem promising for
> "text".
It would also filter CR and LF. At least these three should not be
filtered. I personally would also allow VT (vertical tab).
>> * 0x007F - 0x009F (2nd bunch of control characters)
>> * 0xD800 - 0xDFFF (surrogate pairs, have no use in UTF-8)
>>
> These surrogates don't occur in well-formed UTF-8. (See
> http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf)
>> * 0xE000 - 0xF900 (private use; since everyone can make up a
>> different character for a code point in private use, filter them all)
>>
> The private use range ends at F8FF, not F900
>> * 0xFEFF (byte order mark, no use in UTF-8 and may be
>> potentially dangerous if converted later to UTF-16 without proper
>> filtering)
>> * 0xFFFE (byte order mark in wrong endian format, guaranteed
>> never to be assigned as a Unicode character)
>> * 0xFFFF (also guaranteed never to be assigned as a Unicode
>> character).
How about the other non-characters at 100FE, 100FF, 200FE, 200FF, ...?
>>
>> For the rest, allow all ***assigned*** code points, filter
>> unassigned.
>>
> That's a fool's game, because assigned code points are version
> dependent. Even if one could adopt a "supported version" for one's own
> code, nothing guarantees that the codes were assigned at the time the
> originating software was written. If not, they could represent data that
> wasn't really text in the context it was created in. Further, the minute
> the next version of Unicode comes along, this will prevent the software
> from handling perfectly well-defined and standardized characters.
>
> At the same time, there's no attempt to filter the non-characters in the
> FDD0-FDEF range, which looks like a clear omission.
>> 3) For code points in planes 3 to 13 (unassigned planes) filter the
>> complete range 0x30000 to 0xDFFFF.
>>
>> 4) For code points in plane 14 (SSP) allow all ***assigned*** code
>> points, filter unassigned.
>>
> The "Tag characters" from E0000 to E007F are deprecated and have no
> business in ordinary text. Much more useful set of characters to
> consider for filtering than those that are merely "not yet assigned".
>> 5) For code points in plane 15 and 16 (private use) filter the
>> complete range 0xF0000 - 0x10FFFF. Same argument as before: since
>> everyone can make up a different character for a code point in private
>> use, filter them all.
>>
>>
> In principle, this might be a defensible choice, especially if there's a
> need to compare data from different sources against each other. But ,
> I'm afraid that this depends on the purposes for which the text is being
> accepted, and that is not clear enough in the context of this discussion.
Best Regards,
- --
Dominikus Dittes Scherkl
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLOF+7AAoJELBWOtEemFJVZkMH/0MMyFsd5Hcmq1H5mMjz/QUq
uhQmMp7x/xy1y5fQGBymcJsIf7eNh8X0L78gex6pTH+3xj1UI0i8n+71PLje3iCw
cBbRaVFuR5JUQ6ZYV+vwOUquxSqzgEvGThEo9BY+SrpeKZtLqn6g/SE34R2SUlLj
95Xr+JHkTonqLOHPCvbwSyIZJk5PVMiTlcfoNcaGMdWh2OQAvVwwrGAPBwyHZMge
2VrHbBHeNWvUM3nDDI2gyvqJT7QgwU9w9Jobk/XjWtAJhuyYIWEO7kxNraXPjvD0
PcZV0TAMkP70s60Lpn7OXcgoXt1CRT6D3lCYsnvjs1IA5tbwMqzpC5kLuwaKJlw=
=KBFS
-----END PGP SIGNATURE-----
This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 01:37:10 CST