Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Philippe Verdy via Unicode
unicode at unicode.org
Tue May 16 16:19:52 CDT 2017
Another alternative for you API is to not return simple integer values, but
return (read-only) instances of a Char32 class whose "scalar" property
would normally be a valid codepoint with scalar value, or whose "string"
property will be the actual character; but with another static property
"isValidScalar" returning "true"; for other ill-formed
sequences,"isValidScalar" will be false, the scalar value will be the
initial code unit from the input (decoded from the internal representation
in tyhe backing store) and the "string" property will be empty. You may
also add a special "Char32" static instance representing
end-of-file/end-of-string, whose property "isEOF" will be true, and
property scalar will be typically -1, "isValid Scalar" will be false, and
the "string" property will be the empty string.
All this is possible independantly of the internal representation made in
the backing store for its own code units (where it may use any extension of
standard UTF's or any data compression scheme without exposing it)
2017-05-16 23:08 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:
> 2017-05-16 20:50 GMT+02:00 Shawn Steele <Shawn.Steele at microsoft.com>:
>> But why change a recommendation just because it “feels like”. As you
>> said, it’s just a recommendation, so if that really annoyed someone, they
>> could do something else (eg: they could use a single FFFD).
>> If the recommendation is truly that meaningless or arbitrary, then we
>> just get into silly discussions of “better” that nobody can really answer.
>> Alternatively, how about “one or more FFFDs?” for the recommendation?
>> To me it feels very odd to perhaps require writing extra code to detect
>> an illegal case. The “best practice” here should maybe be “one or more
>> FFFDs, whatever makes your code faster”.
> Faster ok, privided this does not break other uses, notably for random
> access within strings, where UTF-8 is designed to allow searching backward
> on a limited number of bytes (maximum 3) in order to find the leading byte,
> and then check its value:
> - if it's not found, return back to the initial position and amke the next
> access return U+FFFD to signal the error of position: this trailing byte is
> part of an ill-formed sequence, and for coherence, any further trailine
> bytes fouind after it will **also** return U+FFFD to be coherent (because
> these other trailing bytes may also be found bby random access to them.
> - it the leading byte is found backward ut does not match the expected
> number of trailing bytes after it, return back to the initial random
> position where you'll return also U+FFFD. This means that the initial
> leading byte (part of the ill-formed sequence) must also return a separate
> U+FFFD, given that each following trailing byte will return U+FFFD
> isolately when accessing to them.
> If we want coherent decoding with text handling promitives allowing random
> access with encoded sequences, there's no other choice than treating EACH
> byte part of the ill-formed sequence as individual errors mapped to the
> same replacement code point (U+FFFD if that is what is chosen, but these
> APIs could as well specify annother replacement character or could
> eventually return a non-codepoint if the API return value is not restricted
> to only valid codepoints (for example the replacement could be a negative
> value whose absolute value matches the invalid code unit, or some other
> invalid code unit outside the valid range for code points with scalar
> values: isolated surrogates in UTF-16 for example could be returned as is,
> or made negative either by returning its opposite or by setting (or'ing)
> the most significant bit of the return value).
> The problem will arise when you need to store the replacement values if
> the internal backing store is limited to 16-bit code units or 8-bit code
> units: this internal backing store may use its own internal extension of
> standard UTF's, including the possibility of encoding NULLs as C0,80 (like
> what Java does with its "modified UTF-8 internal encoding used in its
> compiled binary classes and serializations), or internally using isolated
> trailing surrogates to store illformed UTF-8 input by or'ing these bytes
> with 0xDC00 that will be returned as an code point with no valid scalar
> value. For internally representing illformed UTF-16 sequences, there's no
> need to change anything. For internally representing ill-formed UTF-32
> sequences (in fact limited to one 32-bitcode unit), with a 16bit internal
> backing store you may need to store 3 16bit values (three isolated trailing
> surrogates). For internally representing ill formed UTF-32 in an 8 bit
> backing store, you could use 0xC1 followed by 5 five trailing bytes (each
> one storing 7 bits of the initial ill-formed code unit from the UTF-32
> What you'll do in the internal backing store will not be exposed to your
> API which will just return either valide codepoints with valid scalar
> values, or values outside the two valid subranges (so it could possibly
> negative values, or isolated trailing surrogates). That backing store can
> also substitute some valid input causing problems (such as NULLs) using
> 0xC0 plus another byte, that sequence being unexposed by your API which
> will still be able to return the expected codepoints (but with the minor
> caveat that the total number of returned codepoints will not match the
> actual size allocated for the internal backing store (that applications
> using that API won't even need to know how it is internally represented).
> In other words: any private extensions are possible internally, but it is
> possible to isolate it within a blackboxing API which will still be able to
> chose how to represent the input text (it may as well use a zlib-compressed
> backing store, or some stateless Huffmann compression based on a static
> statistic table configured and stored elsewhere, intiialized when you first
> instantiate your API).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode