Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Markus Scherer via Unicode
unicode at unicode.org
Tue May 16 13:36:39 CDT 2017
Let me try to address some of the issues raised here.
The proposal changes a recommendation, not a requirement. Conformance
applies to finding and interpreting valid sequences properly. This includes
not consuming parts of valid sequences when dealing with illegal ones, as
explained in the section "Constraints on Conversion Processes".
Otherwise, what you do with illegal sequences is a matter of what you think
makes sense -- a matter of opinion and convenience. Nothing more.
I wrote my first UTF-8 handling code some 18 years ago, before joining the
ICU team. At the time, I believe the ISO UTF-8 definition was not yet
limited to U+10FFFF, and decoding overlong sequences and those yielding
surrogate code points was regarded as a misdemeanor. The spec has been
tightened up, but I am pretty sure that most people familiar with how UTF-8
came about would recognize <C0 AF> and <E0 9F 80> as single sequences.
I believe that the discussion of how to handle illegal sequences came out
of security issues a few years ago from some implementations including
valid single and lead bytes with preceding illegal sequences. Beyond the
"Constraints on Conversion Processes", there was evidently also a desire to
recommend how to handle illegal sequences.
I think that the current recommendation was an extrapolation of common
practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for
UTF-8, too, but "it feels like" (yes, that's the level of argument for
stuff that doesn't really matter) not treating <C0 AF> and <E0 9F 80> as
single sequences is "weird".
Why do we care how we carve up an illegal sequence into subsequences? Only
for debugging and visual inspection. Maybe some process is using illegal,
overlong sequences to encode something special (à la Java string
serialization, "modified UTF-8"), and for that it might be convenient too
to treat overlong sequences as single errors.
If you don't like some recommendation, then do something else. It does not
matter. If you don't reject the whole input but instead choose to replace
illegal sequences with something, then make sure the something is not
nothing -- replacing with an empty string can cause security issues.
Otherwise, what the something is, or how many of them you put in, is not
very relevant. One or more U+FFFDs is customary.
When the current recommendation came in, I thought it was reasonable but
didn't like the edge cases. At the time, I didn't think it was important to
twiddle with the text in the standard, and I didn't care that ICU didn't
exactly implement that particular recommendation.
I have seen implementations that clobber every byte in an illegal sequence
with a space, because it's easier than writing an U+FFFD for each byte or
for some subsequences. Fine. Someone might write a single U+FFFD for an
arbitrarily long illegal subsequence; that's fine, too.
Karl Williamson sent feedback to the UTC, "In short, I believe the best
practices are wrong." I think "wrong" is far too strong, but I got an
action item to propose a change in the text. I proposed a modified
recommendation. Nothing gets elevated to "right" that wasn't, nothing gets
demoted to "wrong" that was "right".
None of this is motivated by which UTF is used internally.
It is true that it takes a tiny bit more thought and work to recognize a
wider set of sequences, but a capable implementer will optimize
successfully for valid sequences, and maybe even for a subset of those for
what might be expected high-frequency code point ranges. Error handling can
go into a slow path. In a true state table implementation, it will require
more states but should not affect the performance of valid sequences.
Many years ago, I decided for ICU to add a small amount of slow-path
error-handling code for more human-friendly illegal-sequence reporting. In
other words, this was not done out of convenience; it was an inconvenience
that seemed justified by nicer error reporting. If you don't like to do so,
Which UTF is better? It depends. They all have advantages and problems.
It's all Unicode, so it's all good.
ICU largely uses UTF-16 but also UTF-8. It has data structures and code for
charset conversion, property lookup, sets of characters (UnicodeSet), and
collation that are co-optimized for both UTF-16 and UTF-8. It has a slowly
growing set of APIs working directly with UTF-8.
So, please take a deep breath. No conformance requirement is being touched,
no one is forced to do something they don't like, no special consideration
is given for one UTF over another.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode