Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Mark Davis ☕️ via Unicode <unicode_at_unicode.org>
Date: Thu, 3 Aug 2017 17:34:15 -0700

FYI, the UTC retracted the following.

*[151-C19 <http://www.unicode.org/cgi-bin/GetL2Ref.pl?151-C19>]
Consensus:* Modify
the section on "Best Practices for Using FFFD" in section "3.9 Encoding
Forms" of TUS per the recommendation in L2/17-168
<http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/17-168>, for Unicode
version 11.0.

Mark

(https://twitter.com/mark_e_davis)

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson via Unicode <
unicode_at_unicode.org> wrote:

> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>
>> On 2017/05/24 05:57, Karl Williamson via Unicode wrote:
>>
>>> On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:
>>>
>>
>> Adding a "recommendation" this late in the game is just bad standards
>>>> policy.
>>>>
>>>
>> Unless I misunderstand, you are missing the point. There is already a
>>> recommendation listed in TUS,
>>>
>>
>> That's indeed correct.
>>
>>
>> and that recommendation appears to have
>>> been added without much thought.
>>>
>>
>> That's wrong. There was a public review issue with various options and
>> with feedback, and the recommendation has been implemented and in use
>> widely (among else, in major programming language and browsers) without
>> problems for quite some time.
>>
>
> Could you supply a reference to the PRI and its feedback?
>
> The recommendation in TUS 5.2 is "Replace each maximal subpart of an
> ill-formed subsequence by a single U+FFFD."
>
> And I agree with that. And I view an overlong sequence as a maximal
> ill-formed subsequence that should be replaced by a single FFFD. There's
> nothing in the text of 5.2 that immediately follows that recommendation
> that indicates to me that my view is incorrect.
>
> Perhaps my view is colored by the fact that I now maintain code that was
> written to parse UTF-8 back when overlongs were still considered legal
> input. An overlong was a single unit. When they became illegal, the code
> still considered them a single unit.
>
> I can understand how someone who comes along later could say C0 can't be
> followed by any continuation character that doesn't yield an overlong,
> therefore C0 is a maximal subsequence.
>
> But I assert that my interpretation is just as valid as that one. And
> perhaps more so, because of historical precedent.
>
> It appears to me that little thought was given to the fact that these
> changes would cause overlongs to now be at least two units instead of one,
> making long existing code no longer be best practice. You are effectively
> saying I'm wrong about this. I thought I had been paying attention to
> PRI's since the 5.x series, and I don't remember anything about this. If
> you have evidence to the contrary, please give it. However, I would have
> thought Markus would have dug any up and given it in his proposal.
>
>
>
>>
>> There is no proposal to add a
>>> recommendation "this late in the game".
>>>
>>
>> True. The proposal isn't for an addition, it's for a change. The "late in
>> the game" however, still applies.
>>
>> Regards, Martin.
>>
>>
>
>
Received on Thu Aug 03 2017 - 19:35:09 CDT

This archive was generated by hypermail 2.2.0 : Thu Aug 03 2017 - 19:35:09 CDT