From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Jun 01 2009 - 11:31:48 CDT
The reason for the strict enforcement have to do with securtiy, i.e. by
adhering to [2] you will be denying certain types of "bad utf-8" attacks
that are possible under [1].
Not a minor "practical" concern.
A./
On 6/1/2009 8:46 AM, Mark Crispin wrote:
> On Mon, 1 Jun 2009, Phillips, Addison wrote:
>> Uh... the IETF does not define UTF-8. The Unicode Consortium does.
>> But even if you want to build on the IETF documents, RFC 3629 was
>> published six years ago. Basing a new implementation on something
>> published 11 years ago and obsolete the last six years? Not a good idea.
>
> This is true; but generally within the IETF specifications are upwards
> compatible.
>
> I think that are two obvious implementation choices:
>
> [1] Recognize the sequences for the 0x110000 - 0x7fffffff ranges,
> never generate them, and if a value in that range is encountered treat
> it as an "error" or "not in Unicode" value. This is the traditional
> IETF philosophy.
>
> [2] Strictly enforce the rules for "well formed UTF-8 byte sequences"
> on page 104 of Unicode 5.0, and reject any string which fails to
> comply (note in particular the requirements of the second byte).
>
> In all cases, what is generated must strictly comply with "well formed
> UTF-8 byte sequences".
>
> I have little doubt that Unicode would tend to advocate choice [2],
> but as noted above the "IETF way" would be choice [1].
>
> As a practical matter, it should not make any difference. You should
> never expect anything other than a well-formed sequence to work.
>
> -- Mark --
>
> http://panda.com/mrc
> Democracy is two wolves and a sheep deciding what to eat for lunch.
> Liberty is a well-armed sheep contesting the vote.
>
>
This archive was generated by hypermail 2.1.5 : Mon Jun 01 2009 - 11:33:15 CDT